july 20101 clock distribution shmuel wimer bar ilan univ. eng. faculty technion, ee faculty

Post on 18-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

July 2010 1

Clock Distribution

Shmuel WimerBar Ilan Univ. Eng. Faculty

Technion, EE Faculty

July 2010 2

Clock System Architecture

GatersBuffers

Clocked Elements

Clock Distribution

Clock Generator

Chip

clk1 clk2

clk3

External Clock

gclkext_clk

Clock generator adjusts the global clock to the external clock.

Global clock is distributed across the chip.

Chip receives external clock through I/O pad.

Local drivers and gaters drive the physical clocks to clocked elements.

July 2010 3

Global Clock Generation

• Receives external clock signal and produce the global

clock distributed across the die.

• A large skew occurs between external clock and the

physical clocks at clocked elements due to delay of

distribution network (wires, buffers, gaters).

• Therefore, data at clocked elements is no more in sync

with data at I/O pins.

• Phased Locked Loop (PLL) compensates this delay.

• PLL can perform frequency multiplication to obtain the

required on-chip frequencies.

July 2010 4

Din

Dout

CLKinext_clk

PLL

ref_clk

fdbk_clk

clk_out

Synchronous Chip Interface with PLL

Din

Dout

CLKout

Chip A Chip B

Chip A communicates synchronously with chip B

Chip B uses the clock sent by chip A. Data in and out must be synchronized to the common clock.

A PLL produces the global clock of chip B such that it is in sync with the external clock.

clkgclk

Clock Distribution

July 2010 5

How PLL Works?

Voltage Controlled Oscillator

clk_out

ref_clk

fdbk_clk

Phase Detect

Up

DownI

I

Charge Pump

C

R

Vctrl

Loop Filter

M

N

July 2010 6

Phase – Frequency Detector (PFD)

Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.

D Q

CLR

D Q

CLR

Q_A: B should go faster

A

1

1 Q_B: B should go slower

B

The two flip-flops receive the signals at their clock input (one is usually a

reference and the other is the sampled).

The output of the leading flip-flop is 1 for the lead duration.

July 2010 7

A: reference

B: sampled

Q_A: sampled should go faster

The spikes at Q_B are a result of the delay of the AND gate driving the

CLR input of flip-flip and the internal delay from CLR to Q.

What happens when the reference and the sampled signals are a shift of each other?

Q_B: sampled should go slower

July 2010 8

A: reference

Sampled is more often 1-value than the reference is, since rising edge of

B occurs more often than rising edge of A.

What happens when the reference and the sampled signals have different frequencies?

B: sampled

Q_B: sampled should go slower

Q_A: sampled should go faster

July 2010 9

Charge Pump

D Q

CLR

D Q

CLR

CLK_ref

1

1

CLK_fdbk

Vctrl

faster

slower

Icp

Icp

Sup

Sdn

Converts PFD error (digital) to charge (analog), which then controls PLL VCO.

cp up faster dn slowerCharge is proportional to PFD widths: .Q I t I t

July 2010 10

Charge pump consists of current mirrors which are sources of constant

current. Device N1 is in saturation since its gate is connected to high voltage.

Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.

This is an ideal current source with infinite output impedance since Iout is

independent of N2 load; a change in output voltage doesn’t affect Iout.

Iin IoutV

N1 N2

Vss

Iin IoutV

P1 P2

Vcc

Current mirror works similarly for P transistors.

Current Mirror

July 2010 11

Vss

Vcc

faster

slower

Vcc

R

C

Vout

Switch – open when faster = 1

Switch – open when slower = 1

How Charge Pump Works?

P-type current mirror

II

N-type current mirror

III

I

R load – determines the current through current mirror

July 2010 12

Vcc

Vss

faster=1

slower=0

Vcc

R

C

Vout

Faster Mode Vout → Vcc

July 2010 13

Vcc

R

C

Vout

Vcc

Vss

faster=0

slower=1

Slower Mode Vout → Vss

July 2010 14

Loop Filter

Vcc

R

C

Vctrl+

-

Vctrl

Differential amplifier connected as a unity-gain follower is used.

July 2010 15

Voltage Controlled Oscillator (VCO)

A Ring Oscillator cascades an odd number of inverters and feeds back the

last output to first inverter (even number of inverters will be stable). It starts

to oscillate spontaneously.

Vout

inv

inv

If is the delay of an inverter and is the number of inverters,

oscillation frequency is 1 2

t n

f nt

Frequency can be controlled by number of inverters and supply voltage

of inverter (higher voltage obtains faster inverter).

July 2010 16

Ring of 5 inverters

Buffering for driving clk_out

Components of VCO

Level Converter from Vctrl-Vss to Vcc-Vss

clk_out

Vcc

Vss

Vctrl

+

- Vcc

--

Vcc -

+

July 2010 17

Delay Locked Loop (DLL)

• It is a variant of PLL that uses voltage-controlled delay

line rather than oscillator.

• It adjusts phase only. Frequency multiplication is

impossible.

• It is simpler than PLL, less sensitive to Vctrl noise and

requires simpler loop filter.

• It is very difficult to correctly design PLL and DLL. It

requires expertise in control systems and analog circuit

design.

July 2010 18

How DLL Works?

Voltage Controlled Delay Line

clk_out

ref_clk

fdbk_clk

Phase Detect

Up

DownI

I

Charge Pump

C

R

Vctrl

Loop Filter

July 2010 19

Delay Line

The signal from input to output is delayed

or not according to the control bit.

In

Out22n 1

0

12n 1

0

1 1

0

1nD 2nD 0D

In

Out1

0

1 2 0

Connecting a series of delay elements, each of delay 2 , 1 0,

it is possible to control the dely of the line to any value from 0 to 2 1

dealy units by an -bit control word , , , .

i

n

n n

n n i

n D D D

July 2010 20

Clock Distribution Networks

July 2010 21

Tree Clock Network (Unconstrained)

clk1

clk2

clk3

clkn

No constraints imposed on buffers and wires.

Used mostly by automatic tools in automatic synthesis flows.

Can be used for small blocks within large design.

Tools aim at minimizing the variance of clock delays.

July 2010 22

1 1

2

If

1

is the clock delay at a leaf, the variance exapression

should be minimized. n n

i i

CLK

CLK CLK

i

i i

T

T Tn

Serpentine routing or extra buffers may be introduced

to obtain small variance.

Constraints on power can be imposed by limiting

number and size of clock buffers and width of wires.

July 2010 23

Clock Distribution with Grids

Grid feeds flops directly, no local buffers

Clock driver tree spans height of chipInternal levels shorted together

Low skew but high power

July 2010 24

Clock drivers are on perimeter Clock drivers are on grid points

July 2010 25

Delay and Skew in Grid Distribution

July 2010 26

DEC’s Alpha Microprocessor Clocking

July 2010 27

DEC Alpha 21264 Microprocessor Clock distribution

July 2010 28

Clock Distribution with Spines

July 2010 29

Intel’s Pentium4 Clock Distribution

July 2010 30

July 2010 31

Clock Distribution with Trees

Recursive pattern to distribute signals uniformly with equal delay over area

H-Tree

Each branch is individually routed to balance RC delay

RC-Tree

More skew but less power

July 2010 32

Clock H-Tree

clock / PLL

chip / functional block / IP

sequential elements

July 2010 33

IBM / Motorola PowerPC Clock Distribution

July 2010 34

Delay Calculation

We use Elmore delay model. Sub trees are modeled as capacitive loads

July 2010 35

Clock Skew and Jitter

• Clock should theoretically arrive simultaneously to all

sequential circuits.

• Practically it arrives in different times. The differences

are called clock skews.

• Skews result from paths mismatches, process variations

and ambient conditions, resulting physical clocks.

• Most system distribute a global clock and then use local

clock gaters located near clocked elements.

Clock skew consists of the following components:

July 2010 36

• Systematic is the portion existing under nominal

conditions. It can be minimized by appropriate design.

• Random is caused by process variations like devices’

channel length, oxide thickness, threshold voltage, wire

thickness, width and space. It can be measured on

silicon and adjusted by delay components.

• Drift is caused by time-dependent environmental

variations, occurring relatively slowly. Compensation of

those must takes place periodically.

• Jitter is rapid clock changes, occurring by power noise

and clock generator jitter. It cannot be compensated.

July 2010 37

Factors affecting clock skew, Intel 1998, 0.25u.

July 2010 38

Skew, Clock Cycle and Design Margins

Clock Jitter is the same order as skew, but far more difficult to compensate.

July 2010 39

Skew Modeling

D

Q

CL

1 2 m

1 2 n

Clock Generator

Tclk1

Tclk2

CLK1 1

mii

T T m

CLK2 1

nii

T T n

l d CC - capacitive load, - drive current, - voltage swingC I V

l CC dC V I

Point of divergence

July 2010 40

CC CClCC l d CC l d2

CC l d d d d

CC dl

CC l d

Consider a small change in the delay, taking the linear term.

lV CVCV C I V C I

V C I I I I

V IC

V C I

Standard deviation of stage delay is = . is around 5%.

CLK1 CLK2

skew CLK2 CLK1 skew

If clock buffers delays are normally distributed and independent

of each other, then , and .

Skew is: | | .

T m T n

T T T m n

July 2010 41

Consider - level clock tree.m

Clock Distribution Switching Power

l_Let be the total load of its far-end driven sequential

elements.

mC

Assuming a fixed fanout of each clock tree buffer, the

dynamic power is:

k

2CLK_ l_ CC ,m mP C V f

l_ 2

CCCLK_ , 0 1.m

m j j

CP V f j m

k

July 2010 42

Summing over whole clock tree:

1 2CLK CCl_0

1 l_2 2CC CC l_0

1 1

1 1

m

m jj

mm m

mjj

P C V f

C kV f V fC

k k

CLK _

CLK

1 1.

1 1

m

m

P k

P k

How much of the power is consumed by the far end drivers of clock tree?

July 2010 43

Given the number of sequential elements in a block, at

least 50% of the switching power is consumed by the

far end drivers (clock tree is binary, k=2). This number

approaches 1 rapidly with k growth.

Example: Assume a block with 214 sequential elements

and H-tree clock distribution. Then k=4 and m=7. The

far end drivers consume nearly 75% of the clock tree

switching power, while adding the next upper level

drivers brings it to more than 90%.

July 2010 44

Active Clock De-Skewing

• Compensates process variability, temperature gradients,

imperfect design.

• Can be implemented for global fixes (small HW

overhead) or local fixes (high HW overhead).

• Can be used at testing for one time fix (variability

occurring during manufacturing), or dynamically

concurrently with chip operation.

• Its implementation is a difficult design challenge.

July 2010 45

Intel’s Pentium2 De-Skewing System

1998, 450MHz Clock

0.25u process

60pSec skew w/o fix

15pSec skew with fix

Two clock spines for two clock regions.

A phase detector detects relative shifts.

Clock of a region is shifted by a delay line.

July 2010 46

Delay line consists of two cascaded inverters.

Each has a programmable load consists of eight parallel P-N gate

capacitors.

The shift register stores a thermometer code for load programming

in steps of 12pSec.

July 2010 47

Intel’s IA64 Itanium1 De-Skewing System

2000, 800MHz clock

0.18u process

28pSec skew with fix

X4 increase w/o fix

30 independent de-skew regions. PLL

Each cluster is driven from a global H-tree.

Delay circuit in de-skew region are similar to Pentium3 with 20-bit registers.

July 2010 48

July 2010 49

Proposal for H-Tree Clock De-Skew – Hierarchical Approach

If a phase detector (PD) has a skew guard band g, then guard bands

may accumulate along tree paths.

For example, if a logic stage is shared between region B and C, it may

add 7g time units to path delay.

July 2010 50

Proposal for H-Tree Clock De-Skew – Mesh Approach

Clock is distributed by H-

tree, but de-skew takes

place by neighbor leaves

phase detection.

A delay buffer accepts

phase inputs from its 4

neighbors and then

decides of whether to

increase, decrease or

not change its delay.

July 2010 51

Clock Characteristics of Commercial Processors

Power Consumption in Chips

• Clock power may reach 50% of total (dynamic + static).

• Clock gating is very useful and standard design practice

• Four gating methods:– Synthesis based, automated by EDA tools, RTL

compilers, inserted into clock-tree– Clock enable signals manually defined by designer,

inserted into clock-tree, FFs’ clock input– Data-Driven clock gating, inserted at FF-level– Auto-gated FF, inserted at latch-level

52

FF Data Toggling (DSP core)

23k FFs, Test bench of 250k CLK cycles

53

10% of application run-time CLK is enabled.

Of which, only 1.6% CLK pulses are useful!

dat

a ac

tivi

ty /

clo

ck

acti

vity

cum

ula

tive

no

rmal

ized

clo

ck

cap

acit

ive

load

block count

b

a

0.03

FF Data Toggling 63 control blocks of ϻP, 200k FFs

Data-Driven CLK Gating

55

56

tight constraint

Net saving per FF

Gater’s disabling probability

Latch overhead amortized over k FFs

Probability of enabling FF

How Many FFs To Group ?

k: # flip-flops, q: FF probability of D=Q q=1-p

Worst case : All FF are toggling independently of each other.

latch w OFFsaving

FF

w R1kc q c c k qc c c

Derivate by k: 2FF W latchln 0kq q c c c k

57

58

59

Optimal Flip-flop k-size Grouping

Given flip-flops and 1 clock cyclesn m

1, , is the activity (toggling) of flip-flopma aa

is the number of redundant clock pulses

ocurring by jointly clocking FF and FF

i j

i j

a a

FF1: 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0

FF2: 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1

FF Pairwise Activity Model

60

, , : FF pairwise activity graph.G V E w

: vertex matchingE E

corresponds to FF . i iv V

  , is FF pairing.ij i je v v E

| is joint toggling.i ja a

is redundant clock pulses, hence a waste.ij i jw e a a

61

| |i ij ij

i i i j j i jv V e E e E a a a a a a a

2 |ij

i je EP a a

Total power:

j i ii ji

i i jv V e iV ije EE vwaa a ea

Essential + Waste

FFi

FFj

FF FFi j

Flip-Flop Grouping Algorithm

62

Minimize: FF FFi j

Can be repeated for groups of size 4, 8, 16 …

Optimal FFs pairing

can be solved by

minimal cost perfect

graph matching.

63

1a 3a 2a 7a 4a 6a 5a 8a

1 2|a a3 4|a a

7 8|a a

6 5|a a

Is repeated perfect matching optimal ?

64

No! Here is the optimal 4-size grouping

Master Latch

Slave Latch

1-bit flip-flop

CLK

CLKCLK

D Q

Master Latch

Slave Latch

CLK

CLKCLK

D Q

Master Latch

Slave Latch

CLK CLK

CLK

Master Latch

Slave Latch

2-bit flip-flop

Q1

Q2

D1

D2

+ =

Multi-Bit Flip-FlopSaving the power of internal CLK drivers

65

66

67

MBFF should be combined with data-driven CG to maximize energy savings.

Toggling vectors (VCD) are unfortunately not always available.

Data-to-clock toggling ratio (probability) is more often available.

How to utilize it for MBFF optimal grouping?

68

1 1 2j i i j i j i jp p p p p p p p

What is the energy waste in 2-bit MBFF?

2

1 1

2nn

j s ti ij i

p p p

For n FFs grouped in n/2 MBFFs it is

What is the optimal MBFF grouping?

1 2 np p p Group the FFs such that

Auto-Gated FF

69

Look-Ahead CLK Gating

70

71

Relaxed constraint

Power Saving Per FF

72

FF+CLKX O A FF O

FF+CLK FF O

savedyn

int3

1k

cp c kc c c c

c

p c c c

Which FF to Gate?

73

Results

74

75

top related