On-Chip Communication Design and Latency-Insensitive Protocols
Luca P. Carloni
EECS DepartmentUniversity of California at Berkeley
Enabling Systems on Silicon
276 348 439 553 697 878 11062212
4424
8848
1684 2317 3088 39905173 5631 6739
11511
19348
28751
61007500 8100 8500 9200 9800 10400
1200013800
15800
# transistors [M]on-chip local clock [MHz]power [x10mW]
Data forHigh-PerformanceMicroprocessors& ASIC Product Generations
Source: Executive Summary of2001International Technology Roadmap for Semiconductors
Deep Sub-Micron (DSM) Technologies
130(‘01) 115(‘02) 100(‘03) 90(‘04) 80(‘05) 70(‘06) 65(‘07) 45(‘10) 32(‘13) 22(‘16) Technology[nm] (year)
The Productivity Gap Problem
prediction
1 2 3 4 5 6 7 8Generations
Peak
Des
ign
Staf
f, (L
og S
cale
)
G. Spirakis (Intel), GSRC Retreat, Stanford, Feb 12, 1999
“Design staff has doubled each generation!”
• While chip makers could increase the number of logic transistors per chip by close to 60% each year, the productivity of design automation tools has been growing at the rate of only 21% per year [Electronic Business, June99]
The On-Chip Interconnect Latency Roadblock• RC delay of an average metal line with constant length
is worsening with each process generation despite:– increases in metal layers and in wire aspect ratio– one-time technological improvements such as copper metallization
and low-κ dielectric insulators
• The intrinsic interconnect delay of a 1-mm interconnect for a 35-nm technology will be longer than the MOSFET switching delay by two orders of magnitude [Davis et al., IEEE Proc. ‘01]
• Shift from function-centric to communication-centric design – Instead of being limited by the number of transistors that can be
integrated on a single die (computation bound) designs will be limited by the amount of state and logic reachable within the required number of clock cycles (communication bound)
DSM: Percentage of Reachable Die
0
20
40
60
80
100
250 180 130 100 80 60
16 clock cyles8 clock cycles
4 clock cycles
2 clock cycles1 clock cycle
• “For a 60-nanometer process a signal can reach only 5% of the die’s length in a clock cycle” [D. Matzke, ‘97]
• Cause: Combination of high frequencies and slower wires
The Future of Wires [Horowitz et al., IEEE Proc., ’01]
• Local (scaled-length) wires– span a fixed number of gates, scale well together with logic
• Global (fixed-length) wires– span a fixed fraction of a die, do not scale
Scaling
Outline -The Impact of Wire Latency in SOC Design• Interconnect latency
– is increasingly affecting microprocessor design• the amount of state reachable in a clock cycle, not the number of
transistors is going to limit ILP growth [Burger et al., ISCA ’00]• drive stages in Intel Pentium 4 NETBURST• clustered functional units and partitioned register file in Alpha 21264
– hard to estimate because affected by many phenomena• process variations, cross-talk, power-supply drop variations
– breaks the synchronous assumption• that lies at the basis of design automation tool flows
• Towards distributed design– wire pipelining destined to become pervasive in SOC design
• trades-off fixing a design exception by increasing wire latency– rise of on-chip multiprocessor architectures and on-chip networks– new design methodologies for component reuse/plug-and-play that
are robust to interconnect latency variance
Traditional Design Flow & the Timing Closure Problem
• founded on the synchronous design methodology
– longest combinational path (critical path) dictates the maximum operating frequency
– operating frequency is often a design constraint
– design exception: a path with delay larger than clock period
• many costly iterations between synthesis and layout because
– steps are performed independently
– accurate estimations of global wire latencies are impractical
– statistical delay models badly estimate post-layout wire load capacitance
RTL constraintsw/ statistical
wire load models
logicsynthesis
constraintsmet?
floorplanning /coarse placement
detailed placement /placement merge
constraintsmet?
re-optimization(buffering,sizing,
fanout opt.,critical path opt.
routing / layout merge
constraintsmet?
in-place optimization(buffering, sizing)
final layout[Kapadia et al., DAC ’99]
Converging to Final Layout by Fixing Design Exceptions
• Re-placing, re-routing, re-designing– do not alleviate the timing-closure problem
• Combining logic synthesis and physical design– difficult because logic synthesis is inherently unstable
• small variations in the HDL RTL specification may lead to major variations in the output netlist and, consequently, in the final layout
• Wire buffering – efficient, but carries precise limitations
• there is a limit to the number of buffers that designers can insert on a wire and still reduce delay
Wire Buffering and Wire Pipelining• Wire Delay
– grows quadratically with wire length • Wire Buffering
– if optimal makes wire delay grow linearly with its length– reduces the increase of wire delay vs. gate delay ratio
in future process technologies• from 2000X to 40X for global wires• from 10X to 3X for local wires
• Wire Pipelining – is necessary to meet specified
clock period
Buffering and Chip-Edge Long Global Wire Latency [Saraswat et al., ’02]
0
2
4
6
8
10
12
14
16
180 150 120 100 70 50 35
Wire
Late
ncy
(Clock
Cyc
les)
Unbufferd
Optimally buffered
Buffered w/ DP=25%
• Despite optimal buffering ~8X wire latency increase in a 35nm technology due to increases in: clock frequency (~3X), chip length(~1.45X), delay per unit length (~1.85X)
Retiming of Global Interconnects [Cong et al., DAC ’03]
• By using the Sequential Timing Analysis theory, combine retiming and placement in the physical design phase to optimize problematic interconnect (those that must be crossed in one cycle)
a
b
cd
• retiming does not reduce critical path delay (D=4)
a
b
cd
a b
d
c
• retiming reduce critical path delay from D=5 to D=3
a b
d
c
Beyond Retiming: Wire Pipelining by Flip-Flop Insertion • A theoretical lower-bound limits retiming
– max delay-to-register cycle ratio [Papefthymiou, ‘91]• Wire pipelining
– trades-off fixing a wire exception with increasing the wire latency by one or more cycles
– will become pervasive in DSM designs, where most global wires will be heavily pipelined anyway
• Combining wire buffering and wire pipelining– buffer/FF feasible region planning at architectural
floorplanning/wireplanning stage [Koh et al., DATE ’02]– performance-driven concurrent buffer/FF insertion for
latency-constrained interconnects [Cocchini, ICCAD ’02]– both approaches assume that latency constrained are
predefined by micro-architecture designers
Stateless Repeaters vs. Stateful Repeaters
• Both buffers and flip-flops are wire repeaters– regenerate the signals traveling on long wires
• Stateful repeaters– storage elements, which carry a state
• flip-flops, latches, registers, relay stations… • generally, the state must be initialized
• Inserting stateful repeaters has an impact on the surrounding control logic– if the interface logic of two communicating modules
assumed a certain latency, then costly rework is necessary to account for additional pipeline stages
– formal methods are necessary to enable automatic insertion of stateful repeaters
Breaking the Synchronous Assumption
• Presence of multi-cycle interconnect paths– long global wires are pipelined
• “in high-end microprocessors the clock frequency is determined primarily by the time needed to complete local computation loops, not by the time needed for global communication” [2001 ITRS]
– Compaq Alpha 21264, Intel Pentium 4• “there are plenty of places in the Pentium 4 where the wires
were pipelined” [Sprangle & Carmean, ISCA ’02] • High variance of on-chip clock rate
– more than 2 orders of magnitude• Today’s high-end microprocessors and
tomorrow SOCs are distributed systems– design methodologies for component reuse/plug-and-
play that are robust to interconnect latency variance
Microprocessors as Distributed Systems
0
500
1000
1500
2000
2500
386
486-
33
486-
DX2
486-
DX4
P-10
0
P-15
0
P-20
0
PII-
300
PII-
400
PIII
-600
PIII
-700
PIII
-800
PIII
-1G
P4-1
.5G
P4-2
.0G
Core Clock
Bus Clock
• Bus Clock vs. Processor Core Clock– Bus frequency is
not keeping up with the processor core
[ source Intel Corp. ]
• Increasing interconnect latency will penalize current memory-oriented microprocessor architecture – based on the assumption of low latency communication with centralized
structures such as caches, register files, and rename/reorder tables• Studies employing cache delay analysis tools predict that in a 35nm
design running at 10Ghz accessing even a 4KB L-1 cache will require 3 clock cycles [ Burger, Keckler et al., ISCA ’00]
The Evolution of Scalar Operand Networks [Agarwal et al. ’02]
• more functional units ⇒ more live values at greater distances ⇒more physical registers, register file ports, bypass paths… ⇒ more difficult to build larger, high-frequency scalar operand networks based on centralized register files
Pipelined processor with bypassing links and multiple ALUs• additional MUXs, pipeline registers, and bypass paths make the scalar operand network look more like a “traditional network”
The Impact of Wire Latency on Microprocessor Design• Alpha 21264
– marks the point when wire delays can no longer be ignored at the micro-architectural level
– integer unit is partitioned into two physically dispersed clusters, with a one-cycle penalty for communication of results between clusters.
• Pentium 4 – two pipeline stages (“Drive Stages”) solely for the
traversal of long wires between remote components• The architecture becomes a distributed
system whose components must be designed while accounting for communication delays
Scalability Challenges
• Delay scalability– maintain high clock frequencies as the design scales
• Logic (and wire) pipelining– turn propagation delay into pipeline latency is the only
option to build larger structures and still maintain high frequencies
• Bandwidth scalability– a design scales without inordinately increasing the
relative percentage of interconnect resources • Communication efficiency
– replace broadcasting mechanisms (like busses in superscalars) with unicast routes and p-2-p protocols
• Deadlock & starvation• less centralized structures, more produce/consume mismatches
The RAW Microprocessor [Agarwal et al., ’02]
• RAW scalar operand network addresses– the delay scalability challenge through tiling
• a signal can travel across the logic of a tile in one clock cycle• modulo building a good clock tree, the frequency does not decrease
as more tiles are added– the bandwidth scalability challenge by replacing buses with a
point-to-point mesh interconnect• the p2p static network is programmed to route operands only to
those tiles that need them
16 programmabletiles
122M transistors25GB/s I/O bndw
2MB SRAM43GB/s mem bndw
PE8-stage MIPS4-stage FPU32KB cache
96KB SRAM
static &dynamicrouters
256wires
• Scaling properties– the design has no centralized resources, no global buses, and no
structures that get larger as the tile or pin count increases– the longest wire, the design complexity, and the verification
complexity are all independent of transistor count
Architectural Solution: Tiling
FPGAmillions of gates
PIM256 PE
Fine-grain CMP64 in-order cores
Coarse-grain CMP16 out-order cores
TRIPS4 ultra-large cores
• Raw processor– scalable ISA to provide a parallel software interface to on-chip physical resources
that become first-class architectural entities• gates → tiles
– dynamic mapping of sequential program to small number of ALUs• wire delays → network hops
– dynamic stalls for non-fast paths and mispredicted code• pins → I/O ports
– speculative cache-miss handling (prefetching) – wire delay is exposed to the software programmer
• to go from corner to corner of the processor takes 6 network hops, which corresponds to approximately 6 cycles of wire delay.
• Classification of on-chip architectures based on the granularity of parallel processing elements and memories [Burger, Keckler et al., ISCA’03]– Polymorphous TRIPS can be configure to adapt to various parallelism
• data-level, instruction-level, thread-level
Parallelisms between Chip Multi-Processor Design and SOC Design• Mapping applications on a tiled architecture
– designing application specific hardware circuits• Compiler code optimizations
– similar to CAD-flow optimizations for SOC design• Impact of interconnect latency
– balance communication and computation latencies in SOC design
– architecture (re-)configuration and application mapping must account for exposed communication latency
• Design of on-chip network– map power/performance trade-off to target
applications
Packet-Routing On-Chip Networks [Dally et al. ‘01]
• Regular tiled structure– with on-chip network using a 2-dimension
folded torus topology • cyclically connected as 0,2,3,1
• Packet routing– between any pair of tiles (not just neighbors)
• On-chip network– implicit sharing of interconnect resources– regular topology enables interconnect
optimization (e.g. aggressive signaling circuits)– higher bandwidth and multiple concurrent
communication with respect to buses
30 31 32 33
00 01 02 03
10 11 12 13
20 21 22 23
• Key features– on-chip communication bandwidth is not an issue because of the
many wiring tracks– unlike modern designs with their all-or-nothing locality, latency
varies continuously with distance– latency is controlled by placing data near their point of use, not
just at their point of use
Latency Insensitive Design [ICCAD99]• Chip assembled using synchronous intellectual property
(IP) cores exchanging data by means of point-to-point communication channels
• Channels use a simple communication protocol which allows arbitrary latency
• Interface Logic Blocks (the shells) encapsulate and “protect” the IP cores (the pearls)
• Assume-Guarantee Reasoning to formally verify IP cores and communication protocols in separate steps
• Recycling, a new design transformation– an arbitrary number of sequential element (relay stations) can be
distributed on a long wire between any pair of shells to pipelineit and driving it at higher frequency
The IP Encapsulation Approach
Channels (short wires)Channels (long wires)
Shells (interface logic blocks)
P1
P2
P3
P4
P5
P6
P7
Pearls (synchronous IP cores)
Channel Segmentation
Shells (interface logic blocks)Channels (short wires)Channels (long wires)
P1
P2
P3
P4
P5
P6
P7
Pearls (synchronous IP cores)
RSRS
RS
RS
RS
RS RS
RS
Relay Stations
LIP: Key Points
• Relax time constraints during early phases of the design when correct measures of the delay paths among the SOC modules are not available
• ASSUMPTION: The functionality of the design is based on the sequencing of the signals and not on their exact timing– sufficient to require IPs be stallable (via clock-gating)
• Design specification relies on synchronous assumption, design implementation can be synchronous, asynchronous, GALS
The Theory of Latency Insensitive Protocols
STRICT SYSTEMP1 P2
PATIENT SYSTEMP1’ P2’
Synthesis
S = . . a . . b . . c . . d . .
S’ = . . . a . b . . . . c . . . d .
• Event : a member of V x T, • V : set of values, T : set of tags
• Signal : a set of events • s = { (v1, t1), (v2, t2), … , (vk, tk) }
The Tagged-Signal Model [Lee & Sangiovanni ‘96]
• Process : a subset P of the set of N-tuplesof signals
• Behavior : a tuple of signals b = (s1, s2, …, sN) which satisfies a process P
• System : a composition of processes P1,… ,PM• (i.e. the intersection of their behaviors)
• Synchronous Events have the same tag– signals s1, s2 are synchronous if each event in s1 is synchronous
with an event in s2 and vice versa
• Synchronous System: every signal in the system is synchronous with every other signal
• Timed System: the set T of tags (timestamps) is a totally ordered set.
– the ordering among the timestamps of a signal s induces a natural order on the set of events of s
The Tagged-Signal Model [Lee & Sangiovanni ‘96]
• Latency Insensitive System : a synchronous timed system with T = Ζ and V = Σ ∪ τ
• Σ is the set of informative symbols exchanged among the system processes
• τ ∉ Σ is the stalling symbol representing the absence of an informative symbol
Latency Insensitive Systems [Carloni et al. CAV ‘99]
+
• Finite Horizon : for each signal s there is a greatest timestamp T(s) which corresponds to the last informative event
• assume an infinite sequence of τ after T(s)
Informative Events and Stalling Events
s = i1 i2 τ i1 i2 i3 τ i1 i2 τ τ τ i1 i5 τ τ τ τ τ τ ...
| Fi [3, 13] (s) | = 6| Fi (s) | = 9• Sequence Length:
1 2 3
T(s) = 14
i54 5 6 7 8 9 10 11 12 13
• Filtering Operator:Fi(s) = i1 i2 i1 i2 i3 i1 i2 i1 i5
Fi [3, 13] (s) = i1 i2 i3 i1 i2 i1
• Strict Signal: all informative events precede all stalling events
Strict Signals and Stalled Signals
s1 = i1 i2 i1 i2 i3 i1 i2 i1 i5 τ τ τ τ τ τ ....
s2 = i1 i2 τ i1 i2 i3 τ i1 i2 τ τ τ i1 i5 τ τ ...
• Stalled Signal: a signal which is not strict
ord ( ek ) = | Fi [0, k] (s) | - 1
• Ordinal of an informative event ek of a signal s
Latency Equivalence
s1 ≡ s2 iff Fi (s1) = Fi (s2)
CorrespondingInformativeEventshave thesame
ordinals
s1 = i1 i2 i1 i2 i3 i1 i2 i1 i5 τ τ τ τ τ τ ....
s2 = i1 i2 τ i1 i2 i3 τ i1 i2 τ τ τ i1 i5 τ τ ...
• Example:
Stalling
• A stall move postpones an informative event of a signal sj of a behavior b by 1 timestamp
s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 i4 τ i2 τ τ i1 i5 τ τ ...
b = (s1,s2,s3)
b’ =stall (b, e3(s2))
s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 τ i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 i4 τ i2 τ τ i1 i5 τ τ ...
Ordering The Events
e1 ≤lo e2 iff ord(e1) ≤ ord(e2) or ord(e1) = ord(e2) and s(e1) ≤c s(e2)
• To avoid cyclic behaviors by processing events with the same ordinal, we assume a well-founded order among signals
– in real-life design it corresponds to Mealy’s input-output relations
• The ordering among events is motivated by causality relations
– Past events do not depend on future events
Procrastination Effects
• After a stall move on a signal sj , causality relations may induce procrastination effects on other signal sk of behavior b
s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 τ i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 i4 τ i2 τ τ i1 i5 τ τ ...
b’ =stall (b, e3(s2))
b’’ ∈PE [stall (b, e3(s2))]
s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 τ i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 τ i4 τ i2 τ τ i1 i5 τ τ ...
Patient Processes
P is patient iff ∀b = (s1, .. , sN)∈P,∀j ∈ [1,N], ∀ek ∈ E(sj),
PE [ stall (ek(sj)) ] ∩ P ≠ ∅
• A patient process can take stall moves on any signal of its behaviors by reacting with the appropriate procrastination effects
• Examples : – if an input event is stalled, then some output events may be
delayed– if a downlink process requests to delay an output event
(backpressure) then future input events may be delayed
Compositionality
• For patient processes the notion of latency equivalence is compositional
• Th.1: P1 and P2 patient ⇒ P1 ∩ P2 patient • Th.2: for all patient P1, Q1, P2, Q2
P1 ≡ Q1 and P2 ≡ Q2 ⇒ (P1 ∩ P2) ≡ (Q1 ∩ Q2)• Th.3: for all strict P1, P2 and patient Q1, Q2
P1 ≡ Q1 and P2 ≡ Q2 ⇒ (P1 ∩ P2) ≡ (Q1 ∩ Q2)
• Major Resultif all processes in a strict system are replaced by corresponding patient processesthen the resulting system is latency equivalent to the original one
Strict Process and Channels
Pa
PbPc
PdPg
PePf
• A channel is a connection processC (j,k) constraining two signals to be identical
b=(s1, .. , sN)∈C(j,k) iff sj = sk
• A channel is NOT a patient process• Communication is based on the synchronous hypothesis• Unfortunately, the final system implementation may
require more than one clock cycle to “travel” a channel
C (4,6)
C (3,5)
C (1,2)
Buffers
• A buffer is a process B[c,f,b] (i,j) s.t.:
– c : capacity– f : minimum forward latency– b : minimum backward latency
(s1, .. , sN)∈B[c,f,b] (i,j) iff si ≡ sj and ∀k∈N:|Fi [0, k-f ] (si)| - |Fi [0, k] (sj)| ≥ 0|Fi [0, k ] (si)| - |Fi [0, k-b] (sj)| ≤ c
• If (c=0, f = 0, b = 0) ⇒ B[c,f,b] (i,j) = C(i,j)• Th.4: (c ≥ 1, f = 1, b = 1) ⇒ B[c,f,b] (i,j)
patient
Relay Stations
B[1,1,1] (i,j)si sj
si = i1 τ i2 τ i3 τ i4 τ τ τ i5 τ i6 τ i7 τ i8 τ i9 τ τ ....sj = τ i1 τ i2 τ i3 τ i4 τ τ τ i5 τ i6 τ i7 τ i8 τ i9 τ τ ....
• B[2,1,1] (i,j) is called relay station and is the minimum capacity patient buffer able to “transfer” one informative unit per timestamp, thus allowing, in the best case, a maximum throughput equal to 1
B[2,1,1] (i,j)si sj
si = i1 i2 i3 τ τ i4 i5 i6 τ τ τ i7 τ i8 i9 τ τ ....sj = τ i1 i2 i3 τ τ i4 τ τ τ i5 i6 i7 τ i8 i9 τ τ ....
StopReg
Implementation of a Relay Station
dataIn dataOut
stopInstopOut
voidOutvoidIn
AuxReg
control logic
MainReg
mux
demux
•Stop signals implement the back-pressure mechanism of the protocol•Void signals detect invalid data (stalling, or τ, events) due to
unexpected latencies
Encapsulation of a Stallable Process
• Th.5: Shell and Core are latency equivalent
Shell Process
ERS
Stalling Signal Generator
Equalizer
P1
P2
P3
P4
P5
P6P7
StallableCore
Process
Implementation of a Shell
stallablecore module
Queue 1
Queue 2
Queue 3
dataIn1
dataIn2
dataIn3
dataOut
stopIn
voidOut
stopOut1stopOut2stopOut3voidIn1voidIn2voidIn3
control logic
• min queue length = 2
• min queue latency = 0
The IP Encapsulation Approach - reprise
Channels (short wires)Channels (long wires)
Shells (patient processes)
P1
P2
P3
P4
P5
P6
P7
Pearls (strict processes)
Channel Segmentation - reprise
Shells (patient processes)Channels (short wires)Channels (long wires)
P1
P2
P3
P4
P5
P6
P7
Pearls (strict processes)
RSRS
RS
RS
RS
RS RS
RS
Relay Stations
Remarks
• RTL design, layout and routing of individual blocks do not need to be changed to reflect any necessary changes in wire latencies during the chip-level layout
• Obviously the final result is satisfactory only to the extent that a sufficient throughput can be maintained after increasing channel latencies
Correct-by-Construction Design
Design and validate the chip as a collection of synchronous modules
Encapsulate each module within a shellto make it latency-insensitive
Apply traditional logic synthesis and place & route
Insert relay stationsto meet clock cycle
Case Study: PDLX Microprocessor
• Complete latency insensitive design of PDLX, an out-of-order microprocessor with speculative execution– ISA: same as Hennessy&Patterson’s DLX– PDLX architecture: based on an extended version
of Tomasulo’s Algorithm, which combines traditional dynamic scheduling with hardware-based speculative execution
PDLX Architecture
Load
Uni
t
Load
Uni
t
ALU
ALU
FP U
nit
D-C
ache
MM
U (d
)
Stor
e U
nit
Stor
e U
nit
Dispatch Unit
Decode Unit
Fetch Unit I-Cache
MMU (i)BranchProcessingUnit
RSRS
RSRS
RSRS
RSRS
RSRS
RSRS
RSRSRS
resresresresresresres
Completion Unit(reorder buffer)
System Register
Unit
I-Cache
MMU (i)
Dispatch Unit
Decode Unit
Fetch Unit
resresresresresresres
Completion Unit(reorder buffer)
Load
Uni
t
Load
Uni
t
ALU
ALU
FP U
nit
D-C
ache
MM
U (d
)
Stor
e U
nit
Stor
e U
nit
Completion Unit(reorder buffer)
System Register
Unit
RSRS
RSRS
RSRS
RSRS
RSRS
RSRS
RSRSRS
PDLX: IP Encapsulation
resresLo
ad U
nit
res
ALU
res
FP U
nit
res
Dispatch Unit
Decode Unit
Fetch Unit I-Cache
MMU (i)
D-C
ache
MM
U (d
)
Stor
e U
nit
Stor
e U
nit
Completion Unit(reorder buffer)
System Register
Unit
BranchProcessingUnit
RSRS
RSRS
RSRS
RSRS
RSRS
RSRS
ALU
res
Load
Uni
t
res
RSRS
PDLX: Experimental Framework
architecturespec
channel latency
spec
a particularPDLX
implementation
C program(binary search,permutations)
DLX compilerDLX
assemblycode
DLX simulator
latencyequivalence
testlog of
memoryaccesses
log of memory
accesses
PDLX: Performance Analysis
0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
thro
ughp
ut
L.0 .0 L.2 .0 L.4 .0 L.0 .1 L.2 .1 L.4 .1 L.0 .2 L.2 .2 L.4 .2
PDLX 1
PDLX 2
PDLX 3
0
0 . 2
0 . 4
0 . 6
0 . 8
1
1. 2
1. 4
effe
ctiv
e th
roug
hput
L . 0 . 0 L. 1. 0 L. 2 . 0 L. 3 . 0 L. 4 . 0 L. 5 . 0 L. 0 . 1 L. 1. 1 L. 2 . 1 L. 3 . 1 L. 4 . 1 L. 5 . 1 L. 0 . 2 L. 1. 2 L. 2 . 2 L. 3 . 2 L. 4 . 2 L. 5 . 2
PDLX 1
PDLX 2
PDLX 3
PDLX 1PDLX 2PDLX 3
Moving Around the Latency
S V1 V2 V3 V4
V5
V6V7
V8
V9
V10
V11
V12 V13 TV14
critical cycle =
V15
thp(G) =
RS
RS
C6
10 / (10 + 2)= 0.83
38% performancegain
relay stationscan be pushed around without
the need of redesigning any component
LIP Advantages• By orthogonalizing communication and
computation latency-insensitive design addresses– the timing closure problem
• enables wire pipelining among sequential elements regardless of feedback-loops thanks to automatic synthesis of interface logic
– the productivity gap problem• complex systems can be built by assembling pre-designed and
pre-verified blocks regardless of inter-communication latency
• Formal methods to build systems robust with respect to arbitrary latency variations– on-chip communication/computation latencies can be
rebalanced up to late phases of the design process my moving-around latency across computational elements
– protocol to power-down components (via stalling events)
On-Chip Communication Design and Latency-Insensitive Protocols
Luca P. Carloni
EECS DepartmentUniversity of California at Berkeley