clockless logic montek singh tue, apr 6, 2004. case study: an adaptively-pipelined mixed...
Post on 22-Dec-2015
215 views
TRANSCRIPT
Clockless LogicClockless Logic
Montek SinghMontek SinghTue, Apr 6, 2004Tue, Apr 6, 2004
Case Study: An Adaptively-Pipelined Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous Mixed Synchronous-Asynchronous
SystemSystem
Montek SinghMontek SinghUniv. of North Carolina at Chapel HillUniv. of North Carolina at Chapel Hill
Jose Tierno, Alexander Rylyakov and Sergey Jose Tierno, Alexander Rylyakov and Sergey
RylovRylovIBM TJ Watson Research CenterIBM TJ Watson Research Center
Steven M. NowickSteven M. NowickColumbia UniversityColumbia University
3
MotivationMotivationRead Channel Filters: key components of disk drivesRead Channel Filters: key components of disk drives
used in:used in: hard disks, CD/DVD drives, IBM microdrives … hard disks, CD/DVD drives, IBM microdrives … used for:used for: removing removing “inter-symbol interference”“inter-symbol interference”
adjacent bits of data overlap due to dispersion of read pulsesadjacent bits of data overlap due to dispersion of read pulses
read
head
pic
kup
read
head
pic
kup
timetime
successive bitssuccessive bits
10 years ago10 years ago
read
head
pic
kup
read
head
pic
kup
timetime
TodayToday
High-speed high-density drives:High-speed high-density drives: require high-speed high-resolution filtersrequire high-speed high-resolution filters
4
ChallengesChallenges High data rates: ~1 Giga items/secondHigh data rates: ~1 Giga items/second
To handle high disk rotational speeds (up to 10,000 To handle high disk rotational speeds (up to 10,000 RPM)RPM)
Complex filter architecturesComplex filter architectures To handle high disk storage densitiesTo handle high disk storage densities
Short design cycle/low designer effortShort design cycle/low designer effort To target consumer electronics ($5 per chip or less)To target consumer electronics ($5 per chip or less) To allow ease of migration to new silicon technologyTo allow ease of migration to new silicon technology
Variable clock-rate operationVariable clock-rate operation To handle variable input data ratesTo handle variable input data rates
innermost tracks produce 1/5th data rate of outermost tracks!innermost tracks produce 1/5th data rate of outermost tracks!
Very low latencyVery low latency Filter is part of a tight feedback loop which…Filter is part of a tight feedback loop which…
… … aligns clock frequency and phase with input dataaligns clock frequency and phase with input data
5
ContributionContributionA fabricated real-world read channel filter chip:A fabricated real-world read channel filter chip:
Provides high-speed operation (over 1.3 Giga items/sec)Provides high-speed operation (over 1.3 Giga items/sec)asynchronous portion estimated capable of 1.8 Giga items/secasynchronous portion estimated capable of 1.8 Giga items/sec
Adaptively-pipelined:Adaptively-pipelined: provides variable “pipelining provides variable “pipelining
depth”depth”behaves as a deep pipeline (7 clocked stages) for high input ratesbehaves as a deep pipeline (7 clocked stages) for high input ratesbehaves as a shallow pipeline (4 clocked stages) at lowest ratesbehaves as a shallow pipeline (4 clocked stages) at lowest rates
latency can be kept low at all input rates!latency can be kept low at all input rates!
Provides clocked interfacesProvides clocked interfacescan be embedded into a synchronous environmentcan be embedded into a synchronous environment
Fairly low power consumption (400 mW at 1 GHz)Fairly low power consumption (400 mW at 1 GHz)
Easy to implementEasy to implement large fraction of design uses standard library componentslarge fraction of design uses standard library componentsautomated placement and routingautomated placement and routingno post-layout tweaking neededno post-layout tweaking needed
8
OutlineOutline BackgroundBackground
Read Channel FiltersRead Channel Filters High-Capacity (HC) Asynchronous Pipeline StyleHigh-Capacity (HC) Asynchronous Pipeline Style
New Filter DesignNew Filter Design ImplementationImplementation OperationOperation Performance AnalysisPerformance Analysis
Layout and FabricationLayout and Fabrication
Experimental ResultsExperimental Results
ConclusionsConclusions
9
Background: Read Channel Background: Read Channel FiltersFiltersRead channel filter:Read channel filter:
finite impulse response (FIR) finite impulse response (FIR) filterfilter output determined by aoutput determined by a finite history of inputs finite history of inputs
An An “N-tap”“N-tap” filter computes the weighted sum: filter computes the weighted sum:
y(k) = hy(k) = h00x(k) + hx(k) + h11x(k-1) + … + hx(k-1) + … + hN-1N-1x(k-N+1),x(k-N+1),
where:where: x(k)…x(k-N+1)x(k)…x(k-N+1) is the is the input sequence input sequence hh0…0…hhN-1N-1 are constant are constant tap weights tap weights
ComputeComputeengineengine
Data inputData input
Data outputData output
Shift regShift reg
TapTapweightsweights
TapTapweightsweights
10
Background: Background: HCHC Pipeline Style Pipeline StyleHigh-Capacity Pipelines (HC)High-Capacity Pipelines (HC) [Singh/Nowick WVLSI-[Singh/Nowick WVLSI-
00]00] bundled datapaths; dynamic logic function blocksbundled datapaths; dynamic logic function blocks latch-free: no explicit latches neededlatch-free: no explicit latches needed
dynamic logic provides implicit latchingdynamic logic provides implicit latching novel highly-concurrent protocol novel highly-concurrent protocol maximizes storage maximizes storage
capacitycapacity traditional latch-free approaches: “spacers” limit capacity to 50%traditional latch-free approaches: “spacers” limit capacity to 50%
Key Idea: Obtain greater control of stage’s operationKey Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-downseparate control of pull-up/pull-down result = new result = new “isolate phase”“isolate phase” stage holds outputs/impervious to input changesstage holds outputs/impervious to input changes
Advantage: Each stage can hold a distinct data itemAdvantage: Each stage can hold a distinct data item 100% storage capacity100% storage capacity
Extra Benefit: Obtain greater concurrencyExtra Benefit: Obtain greater concurrency High throughputHigh throughput
11
HC: Basic StructureHC: Basic Structure
Key Idea:Key Idea:2 independent control 2 independent control signals:signals:pc: pc: controls prechargecontrols prechargeeval: eval: controls evaluationcontrols evaluation
Allows novel 3-phase cycle:Allows novel 3-phase cycle:
EvaluateEvaluate
““Isolate” (hold)Isolate” (hold)
Precharge Precharge
delaydelay
stagestagecontrollercontroller
pcpc evaleval
ackack
N N+1 N+2
delaydelay
Single-rail “Bundled Datapath”: Single-rail “Bundled Datapath”: matched delay: matched delay: produces delayed produces delayed “done” “done”
signalsignalworst-case delay: longer than slowest path worst-case delay: longer than slowest path
for datafor data
delaydelay
12
HC: Inside a StageHC: Inside a StageIndependent ControlsIndependent Controls of of pull-uppull-up and pull-down: and pull-down:
allows new 3allows new 3rdrd phase: “isolate” phase: “isolate”
pcpc asserted: asserted: prechargeprecharge evaleval asserted: asserted: evaluateevaluate pcpc and and evaleval de-asserted: enter de-asserted: enter “isolate” (hold) “isolate” (hold)
phasephase
“keeper”
controlscontrolsevaluationevaluation
controlscontrolsprechargeprecharge
evaleval
inputsoutputs
pcpc
13
HC: ProtocolHC: Protocol
Most Existing Protocols: Most Existing Protocols: 3 synchronization 3 synchronization
arcsarcs1 forward arc: 1 forward arc: data dependencydata dependency2 backward arcs: 2 backward arcs: control synchronizationcontrol synchronization
Our protocol: Our protocol: only 2only 2 synchronization arcssynchronization arcsonly 1 backward arconly 1 backward arc
once stage N+1 evaluates, N can complete entire next once stage N+1 evaluates, N can complete entire next cycle!cycle!
EvalEval
IsolateIsolate
PrechargePrecharge
pc=1pc=1eval=1eval=1
pc=1pc=1eval=0eval=0
pc=0pc=0eval=0eval=0
EvalEval
IsolateIsolate
PrechargePrecharge
Stage NStage N Stage N+1Stage N+1
X
14
++
evalevalpcpc
HC: Stage ImplementationHC: Stage Implementation
reqreq donedone
ackack
NANDNANDINVINV
delaydelay
state variable:state variable: off the critical pathoff the critical path
from currentfrom currentstagestage
self-loop:self-loop: key to fastkey to fast “ “isolation”isolation”
from nextfrom nextstagestage
early ackearly ack
15
HC: OperationHC: Operation
11
NN N+1N+1N evaluatesN evaluates N+1 starts toN+1 starts to
evaluateevaluateN prechargesN precharges
N enables itself for next evaluationN enables itself for next evaluation
22
33
(fast(fastself-loop)self-loop)
N isolatesN isolates
(fast(fastself-loop)self-loop)
(early Ack)(early Ack)
Cycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delays
16
OutlineOutline BackgroundBackground
Read Channel FiltersRead Channel Filters High-Capacity (HC) Asynchronous Pipeline StyleHigh-Capacity (HC) Asynchronous Pipeline Style
New Filter DesignNew Filter Design ImplementationImplementation OperationOperation Performance AnalysisPerformance Analysis
Layout and FabricationLayout and Fabrication
Experimental ResultsExperimental Results
ConclusionsConclusions
17
Filter Architecture: OverviewFilter Architecture: Overview
108108
CarryCarrySaveSaveAdderAdder
Partial Partial sumssums
MuMuxx
InputInputshift-regshift-reg
10w10w
1818
Bit sliceBit slice
66
Table lookupTable lookup
CarryCarryLook-Look-AheadAheadAdderAdder
2525 1515
18
Filter Architecture (contd.)Filter Architecture (contd.)Distributed Arithmetic Architecture:Distributed Arithmetic Architecture:
Bit-slice the weighted sum computationBit-slice the weighted sum computationeach x is a 6-bit value; compute 6 partial sums in paralleleach x is a 6-bit value; compute 6 partial sums in parallel
Precompute all partial sums, store in registers/memoryPrecompute all partial sums, store in registers/memory
Problem: Lookup table can get quite big…Problem: Lookup table can get quite big… e.g., for 10-tap filter, all addresses are 10-bit wordse.g., for 10-tap filter, all addresses are 10-bit words
1024-word memory1024-word memory
Solution: Use two techniques to reduce table size:Solution: Use two techniques to reduce table size: Partitioning: split data into odd and even interleavesPartitioning: split data into odd and even interleaves
each interleave generates 5-bit addresseseach interleave generates 5-bit addressesmemory requirement drops to two 32-word tablesmemory requirement drops to two 32-word tables
Exploit Symmetry: Exploit Symmetry: use use signed-digit offset binarysigned-digit offset binary notationnotation““1” means +1, while “0” means –11” means +1, while “0” means –1makes lookup table symmetric, allowing half to be discardedmakes lookup table symmetric, allowing half to be discarded
19
Filter Architecture (contd.)Filter Architecture (contd.)
TableLookup
L1d
om
ino latc
h
XO
RC
SA
Sta
ge 1
CSA
Sta
ge 2
CSA
Sta
ge 3
CSA
Sta
ge 4
CSA
Sta
ge 5
CLA
Sta
ge 1
CLA
Sta
ge 2
CLA
Sta
ge 3
L2
Asynchronous
ClockedClocked
Sync-Async interface
Restores sign bit to
partial sums
Async-Sync interface
.
dynamic
static
Pipelined using the High-Capacity Style
20
Asynchronous Pipelined PortionAsynchronous Pipelined PortionChallenge: Providing Adequate Control BufferingChallenge: Providing Adequate Control Buffering
reqreq reqreq reqreq
ackack ackack
21
Asynchronous Pipelined Portion Asynchronous Pipelined Portion (cont.)(cont.)Optimization: Eliminating Buffer DelaysOptimization: Eliminating Buffer Delays
reqreq reqreq reqreq
ackack ackack
22
FIR Filter: Sync-Async InterfacesFIR Filter: Sync-Async Interfaces
L1d
om
ino
XO
RC
SA
Sta
ge 1
CSA
Sta
ge 2
CSA
Sta
ge 3
CSA
Sta
ge 4
CSA
Sta
ge 5
CLA
Sta
ge 1
CLA
Sta
ge 2
CLA
Sta
ge 3
L2
ClkClk Clk’Clk’
reqreq
ackackXX
reqreq
Clk’Clk’ProgrammedProgrammeddelay (shift-reg)delay (shift-reg)
datadata datadatadatadata
ackack
TableLookup
23
OutlineOutline BackgroundBackground
Read Channel FiltersRead Channel Filters High-Capacity (HC) Asynchronous Pipeline StyleHigh-Capacity (HC) Asynchronous Pipeline Style
New Filter DesignNew Filter Design ImplementationImplementation OperationOperation Performance AnalysisPerformance Analysis
Layout and FabricationLayout and Fabrication
Experimental ResultsExperimental Results
ConclusionsConclusions
24
Filter OperationFilter Operation
Performance Goals:Performance Goals: operation desired over wide range of clock frequenciesoperation desired over wide range of clock frequencies
input data rate to a read channel can vary greatlyinput data rate to a read channel can vary greatlydata rate varies as the read head moves from innermost to data rate varies as the read head moves from innermost to
outermost trackoutermost trackvariation up to factor of 5!variation up to factor of 5!
low filter latency required at all clock frequencieslow filter latency required at all clock frequenciesfilter is part of closed feedback loop filter is part of closed feedback loop (“clock recovery loop”)(“clock recovery loop”) low loop latency critical to accurate alignment of clock w.r.t. low loop latency critical to accurate alignment of clock w.r.t.
datadata
Challenge: Challenge: purely synchronous pipeline cannot easily satisfy above purely synchronous pipeline cannot easily satisfy above
goalsgoalsdeep pipeline design required to meet highest data rates…deep pipeline design required to meet highest data rates…… … but: but: deep pipelining implies deep pipelining implies long clock cycle latencylong clock cycle latencyat lowest data rates: at lowest data rates: long clock cycle latency is long clock cycle latency is
unacceptableunacceptable
25
Our Solution: Adaptive PipeliningOur Solution: Adaptive Pipelining
Key Idea: Key Idea: exploit constant exploit constant ns latencyns latency of asynchronous pipelines of asynchronous pipelines behaves similar to a clocked pipeline with variable behaves similar to a clocked pipeline with variable
depthdepth
clockedclockedinput sideinput side
clockedclockedoutput sideoutput side
asynchronous pipelineasynchronous pipeline
Slow speed scenarioSlow speed scenarioSlow speed scenarioSlow speed scenarioHigh speed scenarioHigh speed scenarioHigh speed scenarioHigh speed scenario
Benefit:Benefit: filter appears to the external clocked environment as filter appears to the external clocked environment as
a clocked pipeline with variable deptha clocked pipeline with variable depth obtain 1 clock cycle latency at lowest data ratesobtain 1 clock cycle latency at lowest data rates
26
Comparison: Adaptive vs. Wave Comparison: Adaptive vs. Wave PipeliningPipeliningSimilarity:Similarity:
Both allow variable number of data tokens in the Both allow variable number of data tokens in the datapathdatapath
Both allow interfacing with clocked environmentsBoth allow interfacing with clocked environments
… … But: significant differences:But: significant differences: Wave pipelining: requires much designer effortWave pipelining: requires much designer effort
at all levels of design: from architectural down to layout levelat all levels of design: from architectural down to layout levelneeds accurate balancing of path delays (incl. data needs accurate balancing of path delays (incl. data
dependent)dependent)vulnerable to process, temperature and voltage variationsvulnerable to process, temperature and voltage variationscannot handle varying input/output data ratescannot handle varying input/output data rates
Adaptive pipelining: significantly more robustAdaptive pipelining: significantly more robustuses robust handshake protocol between stagesuses robust handshake protocol between stages is elastic: can handle stalls, congestion, etc.is elastic: can handle stalls, congestion, etc.
27
Performance AnalysisPerformance Analysis
Key Result: Filter behaves similar to a self-timed Key Result: Filter behaves similar to a self-timed ringring … … with 9 ½ stages!with 9 ½ stages!
clockedclockedinput/outputinput/output
self-timedself-timedringring
2 N0 1Tokens in Pipeline
Th
roug
hpu
t
1/TF
1/TB
1/TCReachableThroughput
28
OutlineOutline BackgroundBackground
Read Channel FiltersRead Channel Filters High-Capacity (HC) Asynchronous Pipeline StyleHigh-Capacity (HC) Asynchronous Pipeline Style
New Filter DesignNew Filter Design ImplementationImplementation OperationOperation Performance AnalysisPerformance Analysis
Layout and FabricationLayout and Fabrication
Experimental ResultsExperimental Results
ConclusionsConclusions
29
Layout and FabricationLayout and Fabrication
Technology:Technology: IBM’s CMOS-7SF technology with Cu interconnectIBM’s CMOS-7SF technology with Cu interconnect 0.18 micron process, 5 metal layers, and 1.8V supply0.18 micron process, 5 metal layers, and 1.8V supply
Layout:Layout: part standard-cell, part full-custompart standard-cell, part full-custom
entire clocked portion: standard-cellentire clocked portion: standard-cellasynchronous datapath: full-custom dynamic gatesasynchronous datapath: full-custom dynamic gatesasynchronous control: standard-cell for basic gates, full-asynchronous control: standard-cell for basic gates, full-
custom for C- and aC-elementscustom for C- and aC-elements
Placement and Routing (P&R):Placement and Routing (P&R): fully automated using the Silicon Ensemble toolfully automated using the Silicon Ensemble tool
chip partitioned into 8 parts, each P&R’ed automaticallychip partitioned into 8 parts, each P&R’ed automatically top-level P&R also automatedtop-level P&R also automated
No resizing of gates performed after P&RNo resizing of gates performed after P&R
30
Results: ThroughputResults: Throughput
Over 1.3 Giga items/secondOver 1.3 Giga items/second
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7 8 9 10
Number of Tokens
Th
rou
gh
pu
t (M
eg
a it
em
s/s
ec
)
@1.4V
@1.6V
@1.8V
@2.0V
31
Results: Power ConsumptionResults: Power Consumption
Less than 500 mW at 1 Giga items/secLess than 500 mW at 1 Giga items/sec
0
100
200
300
400
500
0 1 2 3 4 5 6 7 8 9 10
Number of Tokens
Po
we
r C
on
su
mp
tio
n (
mW
)
@1.4V
@1.6V
@1.8V
@2.0V
32
ConclusionsConclusions
Designed, fabricated and tested a real-world FIR Designed, fabricated and tested a real-world FIR filter:filter: Hybrid synchronous-asynchronous designHybrid synchronous-asynchronous design Exhibits adaptive pipeliningExhibits adaptive pipelining
variable number of tokens in the datapathvariable number of tokens in the datapathenable low clock cycle latency operation at all frequenciesenable low clock cycle latency operation at all frequencies
Exceeds all performance specifications:Exceeds all performance specifications:obtains throughput over 1.3 GigaHertzobtains throughput over 1.3 GigaHertz
– 15% faster than best existing read channel filter15% faster than best existing read channel filter– asynchronous portion estimated capable of up to 1.8 asynchronous portion estimated capable of up to 1.8
Gigaitems/secGigaitems/secobtains latency as low as 4 clock cyclesobtains latency as low as 4 clock cycles
TestableTestable Required low designer effort:Required low designer effort:
Layout: mostly using library componentsLayout: mostly using library componentsPlacement and routing: full automatedPlacement and routing: full automated