july 20101 clock distribution shmuel wimer bar ilan univ. eng. faculty technion, ee faculty

July 2010 1

Clock Distribution

Shmuel WimerBar Ilan Univ. Eng. Faculty

Technion, EE Faculty

July 2010 2

Clock System Architecture

GatersBuffers

Clocked Elements

Clock Distribution

Clock Generator

clk1 clk2

External Clock

gclkext_clk

Clock generator adjusts the global clock to the external clock.

Global clock is distributed across the chip.

Chip receives external clock through I/O pad.

Local drivers and gaters drive the physical clocks to clocked elements.

July 2010 3

Global Clock Generation

• Receives external clock signal and produce the global

clock distributed across the die.

• A large skew occurs between external clock and the

physical clocks at clocked elements due to delay of

distribution network (wires, buffers, gaters).

• Therefore, data at clocked elements is no more in sync

with data at I/O pins.

• Phased Locked Loop (PLL) compensates this delay.

• PLL can perform frequency multiplication to obtain the

required on-chip frequencies.

July 2010 4

CLKinext_clk

ref_clk

fdbk_clk

clk_out

Synchronous Chip Interface with PLL

CLKout

Chip A Chip B

Chip A communicates synchronously with chip B

Chip B uses the clock sent by chip A. Data in and out must be synchronized to the common clock.

A PLL produces the global clock of chip B such that it is in sync with the external clock.

clkgclk

Clock Distribution

July 2010 5

How PLL Works?

Voltage Controlled Oscillator

clk_out

ref_clk

fdbk_clk

Phase Detect

Charge Pump

Loop Filter

July 2010 6

Phase – Frequency Detector (PFD)

Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.

Q_A: B should go faster

1 Q_B: B should go slower

The two flip-flops receive the signals at their clock input (one is usually a

reference and the other is the sampled).

The output of the leading flip-flop is 1 for the lead duration.

July 2010 7

A: reference

B: sampled

Q_A: sampled should go faster

The spikes at Q_B are a result of the delay of the AND gate driving the

CLR input of flip-flip and the internal delay from CLR to Q.

What happens when the reference and the sampled signals are a shift of each other?

Q_B: sampled should go slower

July 2010 8

A: reference

Sampled is more often 1-value than the reference is, since rising edge of

B occurs more often than rising edge of A.

What happens when the reference and the sampled signals have different frequencies?

B: sampled

Q_B: sampled should go slower

Q_A: sampled should go faster

July 2010 9

Charge Pump

CLK_ref

CLK_fdbk

faster

slower

Converts PFD error (digital) to charge (analog), which then controls PLL VCO.

cp up faster dn slowerCharge is proportional to PFD widths: .Q I t I t

July 2010 10

Charge pump consists of current mirrors which are sources of constant

current. Device N1 is in saturation since its gate is connected to high voltage.

Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.

This is an ideal current source with infinite output impedance since Iout is

independent of N2 load; a change in output voltage doesn’t affect Iout.

Iin IoutV

Current mirror works similarly for P transistors.

Current Mirror

July 2010 11

faster

slower

Switch – open when faster = 1

Switch – open when slower = 1

How Charge Pump Works?

P-type current mirror

N-type current mirror

R load – determines the current through current mirror

July 2010 12

faster=1

slower=0

Faster Mode Vout → Vcc

July 2010 13

faster=0

slower=1

Slower Mode Vout → Vss

July 2010 14

Loop Filter

Vctrl+

Differential amplifier connected as a unity-gain follower is used.

July 2010 15

Voltage Controlled Oscillator (VCO)

A Ring Oscillator cascades an odd number of inverters and feeds back the

last output to first inverter (even number of inverters will be stable). It starts

to oscillate spontaneously.

If is the delay of an inverter and is the number of inverters,

oscillation frequency is 1 2

Frequency can be controlled by number of inverters and supply voltage

of inverter (higher voltage obtains faster inverter).

July 2010 16

Ring of 5 inverters

Buffering for driving clk_out

Components of VCO

Level Converter from Vctrl-Vss to Vcc-Vss

clk_out

July 2010 17

Delay Locked Loop (DLL)

• It is a variant of PLL that uses voltage-controlled delay

line rather than oscillator.

• It adjusts phase only. Frequency multiplication is

impossible.

• It is simpler than PLL, less sensitive to Vctrl noise and

requires simpler loop filter.

• It is very difficult to correctly design PLL and DLL. It

requires expertise in control systems and analog circuit

design.

July 2010 18

How DLL Works?

Voltage Controlled Delay Line

clk_out

ref_clk

fdbk_clk

Phase Detect

Charge Pump

Loop Filter

July 2010 19

Delay Line

The signal from input to output is delayed

or not according to the control bit.

Out22n 1

1nD 2nD 0D

Connecting a series of delay elements, each of delay 2 , 1 0,

it is possible to control the dely of the line to any value from 0 to 2 1

dealy units by an -bit control word , , , .

n D D D

July 2010 20

Clock Distribution Networks

July 2010 21

Tree Clock Network (Unconstrained)

No constraints imposed on buffers and wires.

Used mostly by automatic tools in automatic synthesis flows.

Can be used for small blocks within large design.

Tools aim at minimizing the variance of clock delays.

July 2010 22

is the clock delay at a leaf, the variance exapression

should be minimized. n n

CLK CLK

Serpentine routing or extra buffers may be introduced

to obtain small variance.

Constraints on power can be imposed by limiting

number and size of clock buffers and width of wires.

July 2010 23

Clock Distribution with Grids

Grid feeds flops directly, no local buffers

Clock driver tree spans height of chipInternal levels shorted together

Low skew but high power

July 2010 24

Clock drivers are on perimeter Clock drivers are on grid points

July 2010 25

Delay and Skew in Grid Distribution

July 2010 26

DEC’s Alpha Microprocessor Clocking

July 2010 27

DEC Alpha 21264 Microprocessor Clock distribution

July 2010 28

Clock Distribution with Spines

July 2010 29

Intel’s Pentium4 Clock Distribution

July 2010 30

July 2010 31

Clock Distribution with Trees

Recursive pattern to distribute signals uniformly with equal delay over area

H-Tree

Each branch is individually routed to balance RC delay

RC-Tree

More skew but less power

July 2010 32

Clock H-Tree

clock / PLL

chip / functional block / IP

sequential elements

July 2010 33

IBM / Motorola PowerPC Clock Distribution

July 2010 34

Delay Calculation

We use Elmore delay model. Sub trees are modeled as capacitive loads

July 2010 35

Clock Skew and Jitter

• Clock should theoretically arrive simultaneously to all

sequential circuits.

• Practically it arrives in different times. The differences

are called clock skews.

• Skews result from paths mismatches, process variations

and ambient conditions, resulting physical clocks.

• Most system distribute a global clock and then use local

clock gaters located near clocked elements.

Clock skew consists of the following components:

July 2010 36

• Systematic is the portion existing under nominal

conditions. It can be minimized by appropriate design.

• Random is caused by process variations like devices’

channel length, oxide thickness, threshold voltage, wire

thickness, width and space. It can be measured on

silicon and adjusted by delay components.

• Drift is caused by time-dependent environmental

variations, occurring relatively slowly. Compensation of

those must takes place periodically.

• Jitter is rapid clock changes, occurring by power noise

and clock generator jitter. It cannot be compensated.

July 2010 37

Factors affecting clock skew, Intel 1998, 0.25u.

July 2010 38

Skew, Clock Cycle and Design Margins

Clock Jitter is the same order as skew, but far more difficult to compensate.

July 2010 39

Skew Modeling

Clock Generator

CLK1 1

CLK2 1

l d CC - capacitive load, - drive current, - voltage swingC I V

l CC dC V I

Point of divergence

July 2010 40

CC CClCC l d CC l d2

CC l d d d d

CC l d

Consider a small change in the delay, taking the linear term.

lV CVCV C I V C I

V C I I I I

Standard deviation of stage delay is = . is around 5%.

CLK1 CLK2

skew CLK2 CLK1 skew

If clock buffers delays are normally distributed and independent

of each other, then , and .

Skew is: | | .

T m T n

T T T m n

July 2010 41

Consider - level clock tree.m

Clock Distribution Switching Power

l_Let be the total load of its far-end driven sequential

elements.

Assuming a fixed fanout of each clock tree buffer, the

dynamic power is:

2CLK_ l_ CC ,m mP C V f

CCCLK_ , 0 1.m

CP V f j m

July 2010 42

Summing over whole clock tree:

1 2CLK CCl_0

1 l_2 2CC CC l_0

P C V f

C kV f V fC

How much of the power is consumed by the far end drivers of clock tree?

July 2010 43

Given the number of sequential elements in a block, at

least 50% of the switching power is consumed by the

far end drivers (clock tree is binary, k=2). This number

approaches 1 rapidly with k growth.

Example: Assume a block with 214 sequential elements

and H-tree clock distribution. Then k=4 and m=7. The

far end drivers consume nearly 75% of the clock tree

switching power, while adding the next upper level

drivers brings it to more than 90%.

July 2010 44

Active Clock De-Skewing

• Compensates process variability, temperature gradients,

imperfect design.

• Can be implemented for global fixes (small HW

overhead) or local fixes (high HW overhead).

• Can be used at testing for one time fix (variability

occurring during manufacturing), or dynamically

concurrently with chip operation.

• Its implementation is a difficult design challenge.

July 2010 45

Intel’s Pentium2 De-Skewing System

1998, 450MHz Clock

0.25u process

60pSec skew w/o fix

15pSec skew with fix

Two clock spines for two clock regions.

A phase detector detects relative shifts.

Clock of a region is shifted by a delay line.

July 2010 46

Delay line consists of two cascaded inverters.

Each has a programmable load consists of eight parallel P-N gate

capacitors.

The shift register stores a thermometer code for load programming

in steps of 12pSec.

July 2010 47

Intel’s IA64 Itanium1 De-Skewing System

2000, 800MHz clock

0.18u process

28pSec skew with fix

X4 increase w/o fix

30 independent de-skew regions. PLL

Each cluster is driven from a global H-tree.

Delay circuit in de-skew region are similar to Pentium3 with 20-bit registers.

July 2010 48

July 2010 49

Proposal for H-Tree Clock De-Skew – Hierarchical Approach

If a phase detector (PD) has a skew guard band g, then guard bands

may accumulate along tree paths.

For example, if a logic stage is shared between region B and C, it may

add 7g time units to path delay.

July 2010 50

Proposal for H-Tree Clock De-Skew – Mesh Approach

Clock is distributed by H-

tree, but de-skew takes

place by neighbor leaves

phase detection.

A delay buffer accepts

phase inputs from its 4

neighbors and then

decides of whether to

increase, decrease or

not change its delay.

July 2010 51

Clock Characteristics of Commercial Processors

Power Consumption in Chips

• Clock power may reach 50% of total (dynamic + static).

• Clock gating is very useful and standard design practice

• Four gating methods:– Synthesis based, automated by EDA tools, RTL

compilers, inserted into clock-tree– Clock enable signals manually defined by designer,

inserted into clock-tree, FFs’ clock input– Data-Driven clock gating, inserted at FF-level– Auto-gated FF, inserted at latch-level

FF Data Toggling (DSP core)

23k FFs, Test bench of 250k CLK cycles

10% of application run-time CLK is enabled.

Of which, only 1.6% CLK pulses are useful!

block count

FF Data Toggling 63 control blocks of ϻP, 200k FFs

Data-Driven CLK Gating

tight constraint

Net saving per FF

Gater’s disabling probability

Latch overhead amortized over k FFs

Probability of enabling FF

How Many FFs To Group ?

k: # flip-flops, q: FF probability of D=Q q=1-p

Worst case : All FF are toggling independently of each other.

latch w OFFsaving

w R1kc q c c k qc c c

Derivate by k: 2FF W latchln 0kq q c c c k

Optimal Flip-flop k-size Grouping

Given flip-flops and 1 clock cyclesn m

1, , is the activity (toggling) of flip-flopma aa

is the number of redundant clock pulses

ocurring by jointly clocking FF and FF

FF1: 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0

FF2: 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1

FF Pairwise Activity Model

, , : FF pairwise activity graph.G V E w

: vertex matchingE E

corresponds to FF . i iv V

, is FF pairing.ij i je v v E

| is joint toggling.i ja a

is redundant clock pulses, hence a waste.ij i jw e a a

| |i ij ij

i i i j j i jv V e E e E a a a a a a a

i je EP a a

Total power:

j i ii ji

i i jv V e iV ije EE vwaa a ea

Essential + Waste

FF FFi j

Flip-Flop Grouping Algorithm

Minimize: FF FFi j

Can be repeated for groups of size 4, 8, 16 …

Optimal FFs pairing

can be solved by

minimal cost perfect

graph matching.

1a 3a 2a 7a 4a 6a 5a 8a

1 2|a a3 4|a a

7 8|a a

6 5|a a

Is repeated perfect matching optimal ?

No! Here is the optimal 4-size grouping

Master Latch

Slave Latch

1-bit flip-flop

CLKCLK

Master Latch

Slave Latch

CLKCLK

Master Latch

Slave Latch

CLK CLK

Master Latch

Slave Latch

2-bit flip-flop

Multi-Bit Flip-FlopSaving the power of internal CLK drivers

MBFF should be combined with data-driven CG to maximize energy savings.

Toggling vectors (VCD) are unfortunately not always available.

Data-to-clock toggling ratio (probability) is more often available.

How to utilize it for MBFF optimal grouping?

1 1 2j i i j i j i jp p p p p p p p

What is the energy waste in 2-bit MBFF?

j s ti ij i

For n FFs grouped in n/2 MBFFs it is

What is the optimal MBFF grouping?

1 2 np p p Group the FFs such that

Auto-Gated FF

Look-Ahead CLK Gating

Relaxed constraint

Power Saving Per FF

FF+CLKX O A FF O

FF+CLK FF O

savedyn

cp c kc c c c

p c c c

Which FF to Gate?

Results

july 20101 clock distribution shmuel wimer bar ilan univ. eng. faculty technion, ee faculty

clk clk

chip b chip b

global clock of chip

clk clock generator

clock input

common clock

reference b

clk fdbk

Documents

net-ordering for optimal circuit timing in nanometer...

1 the israeli cleantech by israel drori, shmuel ellis, mali...

li/cfx batteries the renaissance - shmuel de-leon - the...

issn 2314-1638 romero wimer, fernando y dalbosco, julia el

addition circuits shmuel wimer bar ilan university,...

even shmuel

convex hulls may 20121 shmuel wimer bar ilan univ., eng....

august 20101 vlsi memory design shmuel wimer bar ilan...

shmuel abuav 2015 master

specification and verification of aspects shmuel katz

asheville law group, by michael g. wimer and jake a ... ·...

march 20111 trapezoidal maps shmuel wimer bar ilan univ.,...

instruction-level parallelism dynamic scheduling prepared...

introduction to cmos vlsi design instructed by shmuel wimer...

track a-shmuel panijel, windriver

instruction-level parallelism compiler techniques and branch...

feb 20111 polygon triangulation shmuel wimer bar ilan univ.,...

may 20121 visible surface detection shmuel wimer bar ilan...

june 20101 3d object representation shmuel wimer bar ilan...

probability-driven multi bit flip-flop design...