and then there were none - ibm · and then there were none a stall-free real-time garbage collector...
TRANSCRIPT
AND THEN THERE WERE NONE
A Stall-Free Real-Time Garbage Collector for Reconfigurable Hardware
David F. Bacon Perry Cheng Sunil Shukla
IBM Research
IMPLEMENTING A PROGRAMMING LANGUAGEProgram
Circuit
Source CodeInterpreter
InstructionSet Processor
IMPLEMENTING A PROGRAMMING LANGUAGEProgram
Circuit
Source CodeInterpreter
InstructionSet Processor
Circuit
InstructionSet Interpreter
Compiler
Machine Code
Compiler
Program
IMPLEMENTING A PROGRAMMING LANGUAGEProgram
Circuit Circuit
Source CodeInterpreter
InstructionSet Processor
Circuit
InstructionSet Interpreter
Compiler
Machine Code
Compiler
Program
Circuit Layout
HardwareCompiler
Program
PROGRAMMING RECONFIGURABLE HARDWARE(FPGAS)
• Programmed at very low level of abstraction
• same as designing custom circuits (ASICs)
• Verilog, VHDL prevail: bits and bit arrays are main abstraction
PROGRAMMING RECONFIGURABLE HARDWARE(FPGAS)
• Programmed at very low level of abstraction
• same as designing custom circuits (ASICs)
• Verilog, VHDL prevail: bits and bit arrays are main abstractionHIGH LEVEL LANGUAGE
PROGRAMMING RECONFIGURABLE HARDWARE(FPGAS)
• Programmed at very low level of abstraction
• same as designing custom circuits (ASICs)
• Verilog, VHDL prevail: bits and bit arrays are main abstractionHIGH LEVEL LANGUAGE
GARBAGE COLLECTION
SYSTEM = APPLICATION + COLLECTOR
HAND-WRITTEN
HDLCOLLECTOR
&MEMORY
RECONFIGURABLE HARDWARE BACKGROUND
CONFIGURABLE LOGIC
UP TO 300K SLICES = 2.4M FLIP-FLOPS
PROGRAMMABLE ROUTING NETWORK
SOURCE: WIKIMEDIA (CC) 2007
BLOCK-RAM MEMORIES (BRAMS)
R/W
Address
Data In
Data Out
BLOCK-RAM MEMORIES (BRAMS)
R/W
Address
Data In
Data Out
R/W
Address
Data In
Data Out
A
B
BLOCK-RAM MEMORIES (BRAMS)
R/W
Address
Data In
Data Out
R/W
Address
Data In
Data Out
A
B
R/W
Address
Data InData Out
R/W
Address
Data InData Out
A
B
36 KBIT
36K X 118K X 2
…1K X 36
...
RAM OR
FIFO
R/W
Address
Data InData Out
R/W
Address
Data InData Out
A
B
R/W
Address
Data In
Data Out
R/W
Address
Data In
Data Out
A
B
BLOCK-RAM MEMORIES (BRAMS)
R/W
Address
Data InData Out
R/W
Address
Data InData Out
A
BR/W
Address
Data InData Out
R/W
Address
Data InData Out
A
B
R/W
Address
Data In
Data Out
R/W
Address
Data In
Data Out
A
B
R/W
Address
Data InData Out
A
R/W
Address
Data InData Out
B
R/W
Address
Data In
Data Out
R/W
Address
Data In
Data Out
A
B
WHAT WE BUILT
COLLECTOR IN HARDWARE FOR HARDWARE
COLLECTOR IN HARDWARE FOR HARDWARE
• Complete garbage collector
• NOT hardware-assist instructions (eg Azul, Lisp Machine)
• For on-chip FPGA memory
• NOT for large, general-purpose CPU DRAM
•With fixed object geometry (2 pointers + data)
• NOT for arbitrarily sized/shaped objects
• Snapshot-at-the-Beginning Algorithm [Yuasa 1990]
Pointer to Write Pointer ValueAddr to Read/WriteAddr Alloc’dAlloc
Memory Subsystem
Pointer to Write Pointer ValueAddr to Read/WriteAddr Alloc’dAlloc
Memory SubsystemAllocator Sweep EngineMark Engine
Memory
Pointer to Write Pointer ValueAddr to Read/WriteAddr Alloc’dAlloc
ROOT
Snapshot Engine
GC
Memory SubsystemAllocator Sweep EngineMark Engine
Memory
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
2
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
2
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
2
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
5
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
5
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
MALLOCATOR (INCL. 1 MEMORY “COLUMN”)
5
5
0
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
WRITING A (POINTER) VALUE
0
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
WRITING A (POINTER) VALUE
0
7
5
A
BB
A
0
Pointer to WriteAddr to Read/Write
Address to Clear
PointerMemory
1
Stack Top
Addr Alloc’dAddr to Free
Address Allocated
Pointer ValueAlloc
FreeStack
WRITING A (POINTER) VALUE
075
THE TRACE ENGINE
3 OPERATIONS(a) Get a root pointer and mark it
(b) Deque a pointer from mark queue and mark it
(c) Perform write barrier and mark overwritten pointer
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
(a)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
3
(a)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
3
3
(a)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
3
3
(a)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
33
(a)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
33
(a)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
3
5
(b)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
3
5
(b)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
5
5
(b)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
55
(b)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
5
7
5
(c)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
5
7
5
3
5
(c)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
5
7
5
3
5
(c)
A
A
B
5
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
7
5
7
5
35
(c)
RESULTS
Allocator
Memory
EVALUATE 3 SYSTEMSEVALUATE 3 SYSTEMS
(a) Malloc
(b) Stop-the-World GC
(c) Real-Time GC
Allocator Sweep Engine Mark Engine
Memory
Allocator Sweep Engine Mark Engine
Memory
Snapshot Engine
EVALUATE SYSTEMS IN 3 CONTEXTS
(a) Collector in isolation (no application)
COLLECTOR
&MEMORY
EVALUATE SYSTEMS IN 3 CONTEXTS
BINARY TREE
(HAND-WRITTEN HDL)
(b) Collector with Binary Tree benchmark
(a) Collector in isolation (no application)
COLLECTOR
&MEMORY
EVALUATE SYSTEMS IN 3 CONTEXTS
BINARY TREE
(HAND-WRITTEN HDL)
(b) Collector with Binary Tree benchmark
(a) Collector in isolation (no application)
DEQUEUE
(HAND-WRITTEN HDL)
COLLECTOR
&MEMORY
(c) Collector with Double-ended Queue benchmark
LOGIC (SLICE) USAGE - NO APPLICATION
Xilinx Virtex-5 LX330T51,840 Slices
• Tiny fraction of chip
• STW almost as complex as RTGC
SYNTHESIZED CLOCK FREQUENCY - NO APPLICATION
• Frequency goes down with design complexity
• Malloc is faster, but advantage narrows
EXECUTION TIME - DEQUEUE
EXECUTION TIME - DEQUEUE
• RTGC uniformly faster than STW
• Malloc is faster, but not by that much (almost even for Binary Tree)
CONCLUSIONS
• First complete garbage collector in hardware
• First garbage collector that NEVER pauses mutator
• Greatly expands expressiveness of hardware programs
• RTGC is faster, smaller, and cooler than STW
• RTGC in hardware is MUCH SIMPLER than in software
• Is something wrong with our processor designs?
Questions?
Questions?
• You only have 2 microbenchmarks. Isn’t that bogus?
• Isn’t a fixed object layout totally bogus?
• Can determinism be preserved with a more complex heap?
• Could this technique be applied to general-purpose systems?
• I don’t believe you never stall. Do you have a proof?
• Don’t you lose performance by reserving one of the ports?
• What unique hardware features made stall-freedom possible?
Suggestions:
BACKUP
ROLE OF THE GARBAGE COLLECTOR
COLLECTOR
&MEMORY
ROLE OF THE GARBAGE COLLECTOR
APPLICATIONCOLLECTOR
&MEMORY
ROLE OF THE GARBAGE COLLECTOR
APPLICATIONHAND-WRITTEN
HDLCOLLECTOR
&MEMORY
ROLE OF THE GARBAGE COLLECTOR
APPLICATIONHAND-WRITTEN
HDLLIME TASK COLLECTOR
&MEMORY
worker1(…) { … }
port-to-streamconnection
port-to-streamconnection
compound filter
char intchar int[[5]]
var pipeline = task worker1 => task worker2 => task worker3;
worker2(…) { … } worker3(…) { … }
PIPELINES IN THE LIME LANGUAGE
worker1(…) { … }
port-to-streamconnection
port-to-streamconnection
compound filter
char intchar int[[5]]
var pipeline = task worker1 => task worker2 => task worker3;
worker2(…) { … } worker3(…) { … }
PIPELINES IN THE LIME LANGUAGE
worker1(…) { … }
char intchar int[[5]]
var pipeline = task worker1 => task worker2 => task worker3;
worker2(…) { … } worker3(…) { … }
GARBAGE COLLECTING LIME TASKS
worker1(…) { … }
char intchar int[[5]]
var pipeline = task worker1 => task worker2 => task worker3;
worker2(…) { … } worker3(…) { … }
GARBAGE COLLECTING LIME TASKS
101
Mutator Register
W_EN
DATA_IN
DATA_OUT
DATA_INW_EN DATA_OUT
REGISTER MODULE
101
Mutator Register
W_EN
DATA_IN
DATA_OUT
GC ROOT_OUT
000
Shadow Register
DATA_IN
DATA_OUT
W_EN
DATA_INW_EN DATA_OUT
REGISTER MODULE + SNAPSHOT COMPONENT
101
Mutator Register
W_EN
DATA_IN
DATA_OUT
GC ROOT_OUT
000
Shadow Register
DATA_IN
DATA_OUT
W_EN
DATA_INW_EN DATA_OUT
REGISTER MODULE + SNAPSHOT COMPONENT
101
101
101
Mutator Register
W_EN
DATA_IN
DATA_OUT
GC ROOT_OUT
000
Shadow Register
DATA_IN
DATA_OUT
W_EN
DATA_INW_EN DATA_OUT
REGISTER MODULE + SNAPSHOT COMPONENT
101
101
A
B
Stack Top
Push/PopGC
Scan Pointer
Push Value Pop Value Root to Add
Shadow Register
Mutator Register
Write Reg Read Reg
MUX
MutatorStack
A
A
B
000Barrier Reg
PointerMemory
MarkMap
Addr to Clear Pointer to Write
1 Mark Queue
Pointer Value Root to Add
Pointer to Trace
B
MUXM
UX
Addr to Read/Write
B
A
Stack Top
Alloc
Address Allocated
SweepPointer
MarkMap
GC
Address to Free
FreeStack
MUX
Addr Alloc’d Addr to Clear
=10?
UsedMap
GC
101
Scan Index
101
Top of Stack
W_EN
DATA_IN
W_EN
DATA_IN (PUSH)W_EN DATA_OUT (POP)
-
1
PUSH
MutatorStackMUX
+/-
1
ROOT_OUT
StateMachine
MUX
DATA_IN_A
W_EN_A
ADDR_IN_A
DATA_OUT_B
ADDR_IN_B
DATA_OUT_A
W_EN_B0
ENABLERS FOR STALL-FREEDOM
• Dual-ported Memory
• Read-before-Write Memory and Registers
• Simple, uniquitous synchronization (clock edge)
• Forward reasoning about remote states (clock cycles)
• Determinism
EXECUTION TIME IN CYCLES - DEQUEUE
EXECUTION TIME IN CYCLES - DEQUEUE
• STW burns cycles while stopping the world• Malloc pays (a little) for explicit free operation• Malloc can run in a smaller heap (but not as bad as software)
FIELD PROGRAMMABLE GATE ARRAYS
FIELD PROGRAMMABLE GATE ARRAYS
FIELD PROGRAMMABLE GATE ARRAYS
FIELD PROGRAMMABLE GATE ARRAYS
FIELD PROGRAMMABLE GATE ARRAYS
IOB
GPUCPU PowerEN FPGA
CPU
Back
end
bytecode
GPU
Back
end
binaryNo
de B
acke
ndbinary
Veril
og B
acke
nd
bitfile
GPUCPU PowerEN FPGA
CPU
Back
end
bytecode
GPU
Back
end
binaryNo
de B
acke
ndbinary
Veril
og B
acke
nd
bitfile
THE LIQUID METAL PROGRAMMING LANGUAGE
Lime
Lime Compiler
LVM
EXECUTION, COMMUNICATION, AND REPLACEMENT
LVM
EXECUTION, COMMUNICATION, AND REPLACEMENT
LVM
EXECUTION, COMMUNICATION, AND REPLACEMENT
LVM
EXECUTION, COMMUNICATION, AND REPLACEMENT
STATEFUL TASKS
double avg(double x) { total += x; return total/++count;}
instance variables(local state)
double total;long count;
primitive filter
double double
var averager = task Averager().avg;
VIRTUALIZATION OF DATA MOVEMENT
=>
INTERPRETATION VERSUS COMPILATIONPROGRAMgetField
invokeVirtual
MOVBLR
INTERPRETER
INSTRUCTION SET PROCESSOR
INTERPRETATION VERSUS COMPILATIONPROGRAMgetField
invokeVirtual
MOVBLR
INTERPRETER
INSTRUCTION SET PROCESSOR
GARBAGE COLLECTION• Frees programmer from managing memory
• Simpler interfaces, easier debugging, memory safety
• Invented 1960 for IBM 704 with 18K• Current large FPGAs have memory commensurate with a VAX 11/780
• Recent results:• We built a garbage collector for data in on-chip BRAMs
• Able to handle a memory op each cycle without ever stalling
• Cost in slices and energy is ~0; cost in frequency and BRAM is small
• Algorithmically simpler than SW GC, yet achieves vastly better results
• Potential game-changer in scope of synthesizable code