clockless logic montek singh tue, apr 6, 2004. case study: an adaptively-pipelined mixed...

Clockless LogicClockless Logic

Montek SinghMontek SinghTue, Apr 6, 2004Tue, Apr 6, 2004

Case Study: An Adaptively-Pipelined Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous Mixed Synchronous-Asynchronous

SystemSystem

Montek SinghMontek SinghUniv. of North Carolina at Chapel HillUniv. of North Carolina at Chapel Hill

Jose Tierno, Alexander Rylyakov and Sergey Jose Tierno, Alexander Rylyakov and Sergey

RylovRylovIBM TJ Watson Research CenterIBM TJ Watson Research Center

Steven M. NowickSteven M. NowickColumbia UniversityColumbia University

3

MotivationMotivationRead Channel Filters: key components of disk drivesRead Channel Filters: key components of disk drives

used in:used in: hard disks, CD/DVD drives, IBM microdrives … hard disks, CD/DVD drives, IBM microdrives … used for:used for: removing removing “inter-symbol interference”“inter-symbol interference”

adjacent bits of data overlap due to dispersion of read pulsesadjacent bits of data overlap due to dispersion of read pulses

read

head

pic

kup

read

head

pic

kup

timetime

successive bitssuccessive bits

10 years ago10 years ago

read

head

pic

kup

read

head

pic

kup

timetime

TodayToday

High-speed high-density drives:High-speed high-density drives: require high-speed high-resolution filtersrequire high-speed high-resolution filters

4

ChallengesChallenges High data rates: ~1 Giga items/secondHigh data rates: ~1 Giga items/second

To handle high disk rotational speeds (up to 10,000 To handle high disk rotational speeds (up to 10,000 RPM)RPM)

Complex filter architecturesComplex filter architectures To handle high disk storage densitiesTo handle high disk storage densities

Short design cycle/low designer effortShort design cycle/low designer effort To target consumer electronics ($5 per chip or less)To target consumer electronics ($5 per chip or less) To allow ease of migration to new silicon technologyTo allow ease of migration to new silicon technology

Variable clock-rate operationVariable clock-rate operation To handle variable input data ratesTo handle variable input data rates

innermost tracks produce 1/5th data rate of outermost tracks!innermost tracks produce 1/5th data rate of outermost tracks!

Very low latencyVery low latency Filter is part of a tight feedback loop which…Filter is part of a tight feedback loop which…

… … aligns clock frequency and phase with input dataaligns clock frequency and phase with input data

5

ContributionContributionA fabricated real-world read channel filter chip:A fabricated real-world read channel filter chip:

Provides high-speed operation (over 1.3 Giga items/sec)Provides high-speed operation (over 1.3 Giga items/sec)asynchronous portion estimated capable of 1.8 Giga items/secasynchronous portion estimated capable of 1.8 Giga items/sec

Adaptively-pipelined:Adaptively-pipelined: provides variable “pipelining provides variable “pipelining

depth”depth”behaves as a deep pipeline (7 clocked stages) for high input ratesbehaves as a deep pipeline (7 clocked stages) for high input ratesbehaves as a shallow pipeline (4 clocked stages) at lowest ratesbehaves as a shallow pipeline (4 clocked stages) at lowest rates

latency can be kept low at all input rates!latency can be kept low at all input rates!

Provides clocked interfacesProvides clocked interfacescan be embedded into a synchronous environmentcan be embedded into a synchronous environment

Fairly low power consumption (400 mW at 1 GHz)Fairly low power consumption (400 mW at 1 GHz)

Easy to implementEasy to implement large fraction of design uses standard library componentslarge fraction of design uses standard library componentsautomated placement and routingautomated placement and routingno post-layout tweaking neededno post-layout tweaking needed

8

OutlineOutline BackgroundBackground

Read Channel FiltersRead Channel Filters High-Capacity (HC) Asynchronous Pipeline StyleHigh-Capacity (HC) Asynchronous Pipeline Style

New Filter DesignNew Filter Design ImplementationImplementation OperationOperation Performance AnalysisPerformance Analysis

Layout and FabricationLayout and Fabrication

Experimental ResultsExperimental Results

ConclusionsConclusions

9

Background: Read Channel Background: Read Channel FiltersFiltersRead channel filter:Read channel filter:

finite impulse response (FIR) finite impulse response (FIR) filterfilter output determined by aoutput determined by a finite history of inputs finite history of inputs

An An “N-tap”“N-tap” filter computes the weighted sum: filter computes the weighted sum:

y(k) = hy(k) = h00x(k) + hx(k) + h11x(k-1) + … + hx(k-1) + … + hN-1N-1x(k-N+1),x(k-N+1),

where:where: x(k)…x(k-N+1)x(k)…x(k-N+1) is the is the input sequence input sequence hh0…0…hhN-1N-1 are constant are constant tap weights tap weights

ComputeComputeengineengine

Data inputData input

Data outputData output

Shift regShift reg

TapTapweightsweights

TapTapweightsweights

10

Background: Background: HCHC Pipeline Style Pipeline StyleHigh-Capacity Pipelines (HC)High-Capacity Pipelines (HC) [Singh/Nowick WVLSI-[Singh/Nowick WVLSI-

00]00] bundled datapaths; dynamic logic function blocksbundled datapaths; dynamic logic function blocks latch-free: no explicit latches neededlatch-free: no explicit latches needed

dynamic logic provides implicit latchingdynamic logic provides implicit latching novel highly-concurrent protocol novel highly-concurrent protocol maximizes storage maximizes storage

capacitycapacity traditional latch-free approaches: “spacers” limit capacity to 50%traditional latch-free approaches: “spacers” limit capacity to 50%

Key Idea: Obtain greater control of stage’s operationKey Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-downseparate control of pull-up/pull-down result = new result = new “isolate phase”“isolate phase” stage holds outputs/impervious to input changesstage holds outputs/impervious to input changes

Advantage: Each stage can hold a distinct data itemAdvantage: Each stage can hold a distinct data item 100% storage capacity100% storage capacity

Extra Benefit: Obtain greater concurrencyExtra Benefit: Obtain greater concurrency High throughputHigh throughput

11

HC: Basic StructureHC: Basic Structure

Key Idea:Key Idea:2 independent control 2 independent control signals:signals:pc: pc: controls prechargecontrols prechargeeval: eval: controls evaluationcontrols evaluation

Allows novel 3-phase cycle:Allows novel 3-phase cycle:

EvaluateEvaluate

““Isolate” (hold)Isolate” (hold)

Precharge Precharge

delaydelay

stagestagecontrollercontroller

pcpc evaleval

ackack

N N+1 N+2

delaydelay

Single-rail “Bundled Datapath”: Single-rail “Bundled Datapath”: matched delay: matched delay: produces delayed produces delayed “done” “done”

signalsignalworst-case delay: longer than slowest path worst-case delay: longer than slowest path

for datafor data

delaydelay

12

HC: Inside a StageHC: Inside a StageIndependent ControlsIndependent Controls of of pull-uppull-up and pull-down: and pull-down:

allows new 3allows new 3rdrd phase: “isolate” phase: “isolate”

pcpc asserted: asserted: prechargeprecharge evaleval asserted: asserted: evaluateevaluate pcpc and and evaleval de-asserted: enter de-asserted: enter “isolate” (hold) “isolate” (hold)

phasephase

“keeper”

controlscontrolsevaluationevaluation

controlscontrolsprechargeprecharge

evaleval

inputsoutputs

pcpc

13

HC: ProtocolHC: Protocol

Most Existing Protocols: Most Existing Protocols: 3 synchronization 3 synchronization

arcsarcs1 forward arc: 1 forward arc: data dependencydata dependency2 backward arcs: 2 backward arcs: control synchronizationcontrol synchronization

Our protocol: Our protocol: only 2only 2 synchronization arcssynchronization arcsonly 1 backward arconly 1 backward arc

once stage N+1 evaluates, N can complete entire next once stage N+1 evaluates, N can complete entire next cycle!cycle!

EvalEval

IsolateIsolate

PrechargePrecharge

pc=1pc=1eval=1eval=1



EvalEval

IsolateIsolate

PrechargePrecharge

Stage NStage N Stage N+1Stage N+1

X

14

++

evalevalpcpc

HC: Stage ImplementationHC: Stage Implementation

reqreq donedone

ackack

NANDNANDINVINV

delaydelay

state variable:state variable: off the critical pathoff the critical path

from currentfrom currentstagestage

self-loop:self-loop: key to fastkey to fast “ “isolation”isolation”

from nextfrom nextstagestage

early ackearly ack

15

HC: OperationHC: Operation

11

NN N+1N+1N evaluatesN evaluates N+1 starts toN+1 starts to

evaluateevaluateN prechargesN precharges

N enables itself for next evaluationN enables itself for next evaluation

22

33

(fast(fastself-loop)self-loop)

N isolatesN isolates

(fast(fastself-loop)self-loop)

(early Ack)(early Ack)

Cycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delays

16







17

Filter Architecture: OverviewFilter Architecture: Overview

108108

CarryCarrySaveSaveAdderAdder

Partial Partial sumssums

MuMuxx

InputInputshift-regshift-reg

10w10w

1818

Bit sliceBit slice

66

Table lookupTable lookup

CarryCarryLook-Look-AheadAheadAdderAdder

2525 1515

18

Filter Architecture (contd.)Filter Architecture (contd.)Distributed Arithmetic Architecture:Distributed Arithmetic Architecture:

Bit-slice the weighted sum computationBit-slice the weighted sum computationeach x is a 6-bit value; compute 6 partial sums in paralleleach x is a 6-bit value; compute 6 partial sums in parallel

Precompute all partial sums, store in registers/memoryPrecompute all partial sums, store in registers/memory

Problem: Lookup table can get quite big…Problem: Lookup table can get quite big… e.g., for 10-tap filter, all addresses are 10-bit wordse.g., for 10-tap filter, all addresses are 10-bit words

1024-word memory1024-word memory

Solution: Use two techniques to reduce table size:Solution: Use two techniques to reduce table size: Partitioning: split data into odd and even interleavesPartitioning: split data into odd and even interleaves

each interleave generates 5-bit addresseseach interleave generates 5-bit addressesmemory requirement drops to two 32-word tablesmemory requirement drops to two 32-word tables

Exploit Symmetry: Exploit Symmetry: use use signed-digit offset binarysigned-digit offset binary notationnotation““1” means +1, while “0” means –11” means +1, while “0” means –1makes lookup table symmetric, allowing half to be discardedmakes lookup table symmetric, allowing half to be discarded

19

Filter Architecture (contd.)Filter Architecture (contd.)

TableLookup

L1d

om

ino latc

h

XO

RC

SA

Sta

ge 1

CSA

Sta

ge 2

CSA

Sta

ge 3

CSA

Sta

ge 4

CSA

Sta

ge 5

CLA

Sta

ge 1

CLA

Sta

ge 2

CLA

Sta

ge 3

L2

Asynchronous

ClockedClocked

Sync-Async interface

Restores sign bit to

partial sums

Async-Sync interface

.

dynamic

static

Pipelined using the High-Capacity Style

20

Asynchronous Pipelined PortionAsynchronous Pipelined PortionChallenge: Providing Adequate Control BufferingChallenge: Providing Adequate Control Buffering

reqreq reqreq reqreq

ackack ackack

21

Asynchronous Pipelined Portion Asynchronous Pipelined Portion (cont.)(cont.)Optimization: Eliminating Buffer DelaysOptimization: Eliminating Buffer Delays

reqreq reqreq reqreq

ackack ackack

22

FIR Filter: Sync-Async InterfacesFIR Filter: Sync-Async Interfaces

L1d

om

ino

XO

RC

SA

Sta

ge 1

CSA

Sta

ge 2

CSA

Sta

ge 3

CSA

Sta

ge 4

CSA

Sta

ge 5

CLA

Sta

ge 1

CLA

Sta

ge 2

CLA

Sta

ge 3

L2

ClkClk Clk’Clk’

reqreq

ackackXX

reqreq

Clk’Clk’ProgrammedProgrammeddelay (shift-reg)delay (shift-reg)

datadata datadatadatadata

ackack

TableLookup

23







24

Filter OperationFilter Operation

Performance Goals:Performance Goals: operation desired over wide range of clock frequenciesoperation desired over wide range of clock frequencies

input data rate to a read channel can vary greatlyinput data rate to a read channel can vary greatlydata rate varies as the read head moves from innermost to data rate varies as the read head moves from innermost to

outermost trackoutermost trackvariation up to factor of 5!variation up to factor of 5!

low filter latency required at all clock frequencieslow filter latency required at all clock frequenciesfilter is part of closed feedback loop filter is part of closed feedback loop (“clock recovery loop”)(“clock recovery loop”) low loop latency critical to accurate alignment of clock w.r.t. low loop latency critical to accurate alignment of clock w.r.t.

datadata

Challenge: Challenge: purely synchronous pipeline cannot easily satisfy above purely synchronous pipeline cannot easily satisfy above

goalsgoalsdeep pipeline design required to meet highest data rates…deep pipeline design required to meet highest data rates…… … but: but: deep pipelining implies deep pipelining implies long clock cycle latencylong clock cycle latencyat lowest data rates: at lowest data rates: long clock cycle latency is long clock cycle latency is

unacceptableunacceptable

25

Our Solution: Adaptive PipeliningOur Solution: Adaptive Pipelining

Key Idea: Key Idea: exploit constant exploit constant ns latencyns latency of asynchronous pipelines of asynchronous pipelines behaves similar to a clocked pipeline with variable behaves similar to a clocked pipeline with variable

depthdepth

clockedclockedinput sideinput side

clockedclockedoutput sideoutput side

asynchronous pipelineasynchronous pipeline

Slow speed scenarioSlow speed scenarioSlow speed scenarioSlow speed scenarioHigh speed scenarioHigh speed scenarioHigh speed scenarioHigh speed scenario

Benefit:Benefit: filter appears to the external clocked environment as filter appears to the external clocked environment as

a clocked pipeline with variable deptha clocked pipeline with variable depth obtain 1 clock cycle latency at lowest data ratesobtain 1 clock cycle latency at lowest data rates

26

Comparison: Adaptive vs. Wave Comparison: Adaptive vs. Wave PipeliningPipeliningSimilarity:Similarity:

Both allow variable number of data tokens in the Both allow variable number of data tokens in the datapathdatapath

Both allow interfacing with clocked environmentsBoth allow interfacing with clocked environments

… … But: significant differences:But: significant differences: Wave pipelining: requires much designer effortWave pipelining: requires much designer effort

at all levels of design: from architectural down to layout levelat all levels of design: from architectural down to layout levelneeds accurate balancing of path delays (incl. data needs accurate balancing of path delays (incl. data

dependent)dependent)vulnerable to process, temperature and voltage variationsvulnerable to process, temperature and voltage variationscannot handle varying input/output data ratescannot handle varying input/output data rates

Adaptive pipelining: significantly more robustAdaptive pipelining: significantly more robustuses robust handshake protocol between stagesuses robust handshake protocol between stages is elastic: can handle stalls, congestion, etc.is elastic: can handle stalls, congestion, etc.

27

Performance AnalysisPerformance Analysis

Key Result: Filter behaves similar to a self-timed Key Result: Filter behaves similar to a self-timed ringring … … with 9 ½ stages!with 9 ½ stages!

clockedclockedinput/outputinput/output

self-timedself-timedringring

2 N0 1Tokens in Pipeline

Th

roug

hpu

t

1/TF

1/TB

1/TCReachableThroughput

28







29


Technology:Technology: IBM’s CMOS-7SF technology with Cu interconnectIBM’s CMOS-7SF technology with Cu interconnect 0.18 micron process, 5 metal layers, and 1.8V supply0.18 micron process, 5 metal layers, and 1.8V supply

Layout:Layout: part standard-cell, part full-custompart standard-cell, part full-custom

entire clocked portion: standard-cellentire clocked portion: standard-cellasynchronous datapath: full-custom dynamic gatesasynchronous datapath: full-custom dynamic gatesasynchronous control: standard-cell for basic gates, full-asynchronous control: standard-cell for basic gates, full-

custom for C- and aC-elementscustom for C- and aC-elements

Placement and Routing (P&R):Placement and Routing (P&R): fully automated using the Silicon Ensemble toolfully automated using the Silicon Ensemble tool

chip partitioned into 8 parts, each P&R’ed automaticallychip partitioned into 8 parts, each P&R’ed automatically top-level P&R also automatedtop-level P&R also automated

No resizing of gates performed after P&RNo resizing of gates performed after P&R

30

Results: ThroughputResults: Throughput

Over 1.3 Giga items/secondOver 1.3 Giga items/second

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10

Number of Tokens

Th

rou

gh

pu

t (M

eg

a it

em

s/s

ec

)

@1.4V

@1.6V

@1.8V

@2.0V

31

Results: Power ConsumptionResults: Power Consumption

Less than 500 mW at 1 Giga items/secLess than 500 mW at 1 Giga items/sec

0

100

200

300

400

500

0 1 2 3 4 5 6 7 8 9 10

Number of Tokens

Po

we

r C

on

su

mp

tio

n (

mW

)

@1.4V

@1.6V

@1.8V

@2.0V

32


Designed, fabricated and tested a real-world FIR Designed, fabricated and tested a real-world FIR filter:filter: Hybrid synchronous-asynchronous designHybrid synchronous-asynchronous design Exhibits adaptive pipeliningExhibits adaptive pipelining

variable number of tokens in the datapathvariable number of tokens in the datapathenable low clock cycle latency operation at all frequenciesenable low clock cycle latency operation at all frequencies

Exceeds all performance specifications:Exceeds all performance specifications:obtains throughput over 1.3 GigaHertzobtains throughput over 1.3 GigaHertz

– 15% faster than best existing read channel filter15% faster than best existing read channel filter– asynchronous portion estimated capable of up to 1.8 asynchronous portion estimated capable of up to 1.8

Gigaitems/secGigaitems/secobtains latency as low as 4 clock cyclesobtains latency as low as 4 clock cycles

TestableTestable Required low designer effort:Required low designer effort:

Layout: mostly using library componentsLayout: mostly using library componentsPlacement and routing: full automatedPlacement and routing: full automated

clockless logic montek singh tue, apr 6, 2004. case study: an adaptively-pipelined mixed...

Documents

clock power

input data slide

high input rates

asynchronous design

challenges high data

clock frequency

clock distribution lower

highspeed operation