cs 152 computer architecture and engineering …cs152/sp14/lecnotes/lec2-1.pdftodd hamilton, iwatch...

UC Regents Spring 2014 © UCBCS 152 L3: Metrics

2014-1-28John Lazzaro

(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 3 – Metrics

Play:1Tuesday, February 11, 14

http://www.eecs.berkeley.edu/~johnw/

http://www.eecs.berkeley.edu/~johnw/

UC Regents Spring 2014 © UCBCS 152 L3: Metrics + Microcode

Topics for today’s lecture

Metrics: Estimating the “goodness” of a CPU design ... so that we can redesign the CPU to be “better”.

A case study in microcode control: the Motorola 68000, the CPU that powered the original Macintosh.[see Lecture 5 slides for this topic]

Short Break.

Administrivia: Will announce office hours soon ...2Tuesday, February 11, 14

Todd Hamilton, iWatch concept.

On the drawing board ...

3Tuesday, February 11, 14


Gray-scale computer graphics model.



Colorcomputer graphics model ...



Animated model ...

Then the baton is passed to us.

We use models to do stepwise refinement of

the silicon that powers the

consumer product.


Four metrics:

Performance

Execution time of a program.

Cost

How many dollars to manufacture.

Energy

Joules required to execute a program.

Time to Market

Will we ship a product before our competitors?

Today’s Focus For a later lecture ...

For a later lecture ... For a later lecture ...


CS 152 L6: Performance UC Regents Fall 2006 © UCB

Performance Measurement(as seen by the customer)


UC Regents Fall 2006 © UCBCS 152 L6: Performance

Who (sensibly) upgrades CPUs often?A professional who turns CPU cycles into money, and who is cycle-limited.

Artist tool: animation,

video special effects.



How to decide to buy a new machine?

Measure After Effects “execution time” on a representative render “workload”

“Night flight”

City map and cloudscomputed

“on the fly” with fractals

CPU intensive Trivial I/O

(still shot from the movie)10Tuesday, February 11, 14


Interpreting Execution Time

Performance 1Execution Time

= = 2.85 renders/hour

1.5 GHz PB (Y) is N times faster than 1.25 GHz PB (X). N is ?

N =Performance (Y)

Execution Time (Y)

Execution Time (X)

Performance (X)= = 1. 19

PB 1.5 Ghz : 3. 4 renders/hour. PB 1.25 : 2.85 renders/hour.Might make the difference in meeting a deadline ...

Execution Time: 1265 seconds

PowerBookG41.25 GHz



Execution Time: time for one job to complete

2 CPUs: Execution Time vs Throughput

Throughput: # of independent jobs/hour completed

However, G5 and Opteron may have same throughput.

Assume G5 MP executiontime faster because AE isn’t parallelized on Opteron CPUs.

1.8xfaster.Implies parallel code on a Mac.

2 CPUs vs1 CPU,otherwisesimilar



Performance Measurement(as seen by a CPU designer)

Q. Why do we care about After Effect’s performance?A. We want the CPU we are designing to run it well !



Step 1: Analyze the right measurement!

CPU Time:Time the CPU spends running program under measurement.

Response Time:

Total time: CPU Time + time spent waiting (for disk, I/O, ...).

Guides CPU design

Guides system design

Measuring CPU time (Unix):% time <program name>25.77u 0.72s 0:29.17 90.8%



CPU time: Proportional to Instruction Count

CPU timeProgram

Machine InstructionsProgram

∝

Q. Static count?(lines of program printout)Or dynamic count? (trace of execution)

Rationale: Every additional

instruction you execute takes time.

Q. How does a architect influence the number of machine instructions needed to run an algorithm?A. Create new instructions:instruction set architect.

A. Dynamic.

Q. Once ISA is set, who can influence instructioncount?A. Compiler writer,application developer.



CPU time: Proportional to Clock Period

TimeProgram

TimeOne Clock Period

∝

Q. What ultimately limitsan architect’s ability to reduce clock period ?

Q. How can architects (not technologists) reduce clock period?

We will revisit these questions later in lecture ...

Rationale: We measure each instruction’s

execution time in “number of cycles”. By shortening the period for

each cycle, we shorten execution time.



Completing the performance equation

SecondsProgram

InstructionsProgram

= SecondsCycle

We need all three terms, and only these terms, to

compute CPU Time!

Q. When is it OK to compare clock rates?

What factors make different programs have different CPIs? Instruction mix varies.

Cache behavior varies.

Branch prediction varies.

“CPI” -- The Average Number of Clock

Cycles Per Instruction For the Program

InstructionCycles

A. When other RHS terms are equal.17Tuesday, February 11, 14

UC Regents Fall 2008 © UCBCS 194-6 L5: Pipelining

Consider Lecture 2 single-cycle CPU ...

All instructions take 1 cycle to execute every time they run.

CPI of any program running on machine? 1.0

“average CPI for the program” is a more-useful concept for more complicated machines ...


UC Regents Spring 2014 © UCBCS 152 L3: Metrics + Microcode

Recall Lecture 2: Multi-flow VLIW CPU

opcode rs rt rd functshamt

opcode rs rt rd functshamt

Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10

Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9

N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family)

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

Q. Which right-hand-side term decreases with “N” ?

A. This one gets smaller.

A. We hope this one doesn’t grow.



Consider machine with a data cache ...

InstructionsProgram

= SecondsCycle

A program’s load instructions “stride”

through every memory address.

The cache never “hits”, so every load goes to DRAM

(100x slower than loads that go to cache).

Thus, the average number of cycles for load instructions is higher for this program.

InstructionCycles

Thus, the average number of cycles for all instructions is higher for this program.

SecondsProgram

Thus, program takes longer to run!20Tuesday, February 11, 14


CPI as an analytical tool to guide design

Multiply

Other A

LU Load

Store

Branch

2221

5

Machine CPI(throughput, not latency)

5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20100

= 2.7 cycles/instruction

Branch20%

Store10%

Load20%

Other ALU20%

Multiply30%

ProgramInstruction Mix

Where program spends its time

Branch15%7%

Load15%

7%

Multiply56%

20/270

Now we know how to optimize the design ...



Final thoughts: Performance Equation

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

Goal is to optimize execution time, notindividualequationterms.

The CPI of the program.

Reflectsthe

program’s instruction

mix.

Machinesare

optimizedwith

respect toprogram

workloads.

Clockperiod.

Optimizejointlywith

machineCPI.



Invented the “one ISA, many implementations” business model.23Tuesday, February 11, 14


Amdahl’s Law (of Diminishing Returns)

If enhancement “E” makes multiply infinitely fast, but other

instructions are unchanged, what is the maximum speedup “S”?

Branch16%

8%

Load16%

8%

Multiply52%

Where programspends its time

S =1

(post-enhancement %) / 100%= 2.081

48%/100%=

Attributed to Gene Amdahl -- “Amdahl’s Law”

What is the lesson of Amdahl’s Law? Must enhance computers in a balanced way!



ProgramWeWishTo RunOn N CPUs

Serial30%

Parallel70%

The program spends 30%of its time running code that can not be recoded to run in parallel.

S =1

(30 % + (70% / N) ) / 100 %

CPUs 2 3 4 5 ∞

Speedup 1.54 1.85 2.1 2.3 3.3

Amdahl’s Law in Action

S(∞)

2 3 # CPUs



Real-world 2006: 2 CPUs vs 4 CPUs

20 in iMacCore Duo 2, 2.16 GHz$1500

Mac Pro2 Dual-Core Xeons, 2.66 GHz$3200 w/ 20 inch display.



Real-world 2006: 2 CPUs vs 4 CPUs 2 cores on one die.

4 cores on two dies.

Caveat: Mac Pro CPUs are server-class and have architectural advantages (better I/O, ECC DRAM, ETC)

ZIPing a file: very difficult to parallelize.

Simple audio and video tasks: easier to parallelize.

Amdahl’s Law + Real-World Legacy Code Issues in action. Source: MACWORLD



Break

Play:28Tuesday, February 11, 14


Timing



CPU time: Proportional to Clock Period

TimeProgram


∝

Q. What ultimately limitsan architect’s ability to reduce clock period ?

Q. How can architects (not technologists) reduce clock period?

In this part of lecture: we answer these questions ...





UC Regents Fall 2008 © UCBCS 194-6 L6: Timing

Goal: Determine minimum clock period

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

Equal

Control Lines

Combinational Logic

Clk

32

Addr Data

Instr

Mem

32D

PC

Q

32

32

+

32

32

0x4

PCSrc

32

+

32

op rs rt immediate016212631

E

x

t

e

n

d


UC Regents Fall 2013 © UCBCS 250 L3: Timing

A Logic Circuit Primer

“Models should be as simple as possible, but no simpler ...” Albert Einstein.



Inverters: A simple transistor model

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail


Lec3.6

Logic Components


Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo


Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B


Lec3.29

Delay Model:

CMOS


Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

“1”

“0”

pFET.A switch.

“On” if gate is

grounded.

nFET.A switch.

“On” if gate is at Vdd.

“1”“0”

“1” “0”

Correctly predicts logic output for simple static CMOS circuits.

Extensions to model subtler circuit families, or to predict timing, have not worked well ...



Transistors as water valves.If electrons are water molecules,

transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...

A “on” p-FET fillsup the capacitor

with charge.


Lec3.29

Delay Model:

CMOS


Lec3.30








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31










Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32


Vdd

Circuit


OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

A “on” n-FET empties the

bucket.


Lec3.29

Delay Model:

CMOS


Lec3.30








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31










Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32


Vdd

Circuit


OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“1”

“0”Time

Water level

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“0”

“1”

TimeWater level

This model is often good enough ...

(Cartoon physics)



What is the bucket? A gate’s “fan-out”.

Driving other gates slows a gate down.

Spring 2003 EECS150 – Lec10-Timing Page 10

Gate Switching Behavior

• Inverter:

• NAND gate:

Driving wires slows a gate down.

“Fan-out”: The number of gate inputs driven by a gate’s output.

Driving it’s own parasitics slows a gate down.



Fanout



A closer look at fan-out ...


Lec3.37

Series Connection

Vdd

Cout

Vout

C1

V1G2

Vdd

Voltage

Vdd

Vin

GND

V1 Vout

Vdd/2

d1 d2

G1

V1Vin Vout

VinG1 G2

Time

° Total Propagation Delay = Sum of individual delays = d1 + d2

° Capacitance C1 has two components:

• Capacitance of the wire connecting the two gates

• Input capacitance of the second inverter


Lec3.38

Calculating Aggregate Delays

Vdd

G2

Vdd

° Sum delays along serial paths

° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)

• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)

° Critical Path = The longest among the N parallel paths

° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3

V2

V1Vin V2

G1V1

C1

Vin

Vdd

G3V3

V3


Lec3.39

Characterize a Gate

° Input capacitance for each input

° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)

- Internal delay (ns)

- Load dependent delay (ns / fF)

° Example: 2-input NAND Gate

OutA

B

Delay A -> Out

Out: Low -> High

0.5ns

Slope =

0.0021ns / fF

For A and B: Input Load (I.L.) = 61 fF

For either A -> Out or B -> Out:

Tlh = 0.5ns Tlhf = 0.0021ns / fF

Thl = 0.1ns Thlf = 0.0020ns / fF

Cout


Lec3.40

A Specific Example: 2 to 1 MUX

Y = (A and !S)

or (B and S)

A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

A

B

Y

S

2 x

1M

ux

° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF

• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF

° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF

• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF

• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF

Linear model works for

reasonablefan-out


Gate Delay

• Fan-out:

• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)

1

3

2

Delay time of an inverter driving 4 inverters.

FO4: Fanout of four delay.

Driving more gates adds delay.


Gate Delay

• Fan-out:


1

3

2


Gate Delay

• Fan-out:


1

3

2



Propagation delay graphs ...


Gate Delay

• Cascaded gates:

Vout

Vin


Gate Delay

• Cascaded gates:

Vout

Vin


Gate Delay

• Cascaded gates:

Vout

Vin


Gate Delay

• Cascaded gates:

Vout

Vin

1 ->0 1 ->0

0 ->1 0 ->1

inverter transfer function



Worst-case delay through combinational logic


Gate Delay

• “Fan-in”

• What is the delay in this circuit?

• Critical Path: the path with the maximum delay, from any

input to any output.

– In general, we include register set-up and clk-to-Q times in

critical path calculation.

• Why do we care about the critical path?

x = g(a, b, c, d, e, f)

T2 might be the

worst-case delay path

(critical path)

If d going 0-to-1 switches x 0-to-1, delay is T1.If a going 0-to-1 switches x 0-to-1, delay is T2.

It would be surprising if T1 > T2.

T1

T2

0 ->1

0 ->10 ->1



Why “might”? Wires have delay too ...


Wire Delay

• Even in those cases where the

transmission line effect is

negligible:

– Wires posses distributed

resistance and capacitance

– Time constant associated with

distributed RC is proportional to

the square of the length

• For short wires on ICs,

resistance is insignificant

(relative to effective R of

transistors), but C is important.

– Typically around half of C of

gate load is in the wires.

• For long wires on ICs:

– busses, clock lines, global

control signal, etc.

– Resistance is significant,

therefore distributed RC effect

dominates.

– signals are typically “rebuffered”

to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4

Looksbenign,

but ...



Clocked Logic Circuits



From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-


Example

• Parallel to serial converter:

a

b T ! time(clk"Q) + time(mux) + time(setup)

T ! #clk"Q + #mux + #setup

clk

f T1 MHz 1 μs

10 MHz 100 ns100 MHz 10 ns

1 GHz 1 ns

Timing AnalysisWhat is the

smallest T that produces correct

operation?



1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Timing Analysis and Logic Delay

If our clock period T > worst-case delay through CL, does this ensure correct operation?

1600

IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001

Fig.1.ProcessSEMcrosssection.

Theprocess

wasraisedfrom[1]tolimitstandbypower.

Circuitdesignandarchitecturalpipeliningensurelowvoltage

performanceandfunctionality.Tofurtherlimitstandbycurrent

inhandheldASSPs,alongerpolytargettakesadvantageofthe

versus

dependenceandsource-to-bodybiasisused

toelectricallylimittransistor

instandbymode.Allcore

nMOSandpMOStransistorsutilizeseparatesourceandbulk

connectionstosupportthis.Theprocessincludescobaltdisili-

cidegatesanddiffusions.Lowsourceanddraincapacitance,as

wellas3-nmgate-oxidethickness,allowhighperformanceand

low-voltageoperation. III.ARCHITECTURE

Themicroprocessorcontains32-kBinstructionanddata

cachesaswellasaneight-entrycoalescingwritebackbuffer.

Theinstructionanddatacachefillbuffershavetwoandfour

entries,respectively.Thedatacachesupportshit-under-miss

operationandlinesmaybelockedtoallowSRAM-likeoper-

ation.Thirty-two-entryfullyassociativetranslationlookaside

buffers(TLBs)thatsupportmultiplepagesizesareprovided

forbothcaches.TLBentriesmayalsobelocked.A128-entry

branchtargetbufferimprovesbranchperformanceapipeline

deeperthanearlierhigh-performanceARMdesigns[2],[3].

A.PipelineOrganization

Toobtainhighperformance,themicroprocessorcoreutilizes

asimplescalarpipelineandahigh-frequencyclock.Inaddition

toavoidingthepotentialpowerwasteofasuperscalarapproach,

functionaldesignandvalidationcomplexityisdecreasedatthe

expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,

thepipelinepartitioningbalancestheworkloadandensuresthat

noonepipelinestageistight.Themainintegerpipelineisseven

stages,memoryoperationsfollowaneight-stagepipeline,and

whenoperatinginthumbmodeanextrapipestageisinserted

afterthelastfetchstagetoconvertthumbinstructionsintoARM

instructions.Sincethumbmodeinstructions[11]are16b,two

instructionsarefetchedinparallelwhileexecutingthumbin-

structions.Asimplifieddiagramoftheprocessorpipelineis

Fig.2.Microprocessorpipelineorganization.

showninFig.2,wherethestateboundariesareindicatedby

gray.Featuresthatallowthemicroarchitecturetoachievehigh

speedareasfollows.

TheshifterandALUresideinseparatestages.TheARMin-

structionsetallowsashiftfollowedbyanALUoperationina

singleinstruction.Previousimplementationslimitedfrequency

byhavingtheshiftandALUinasinglestage.Splittingthisop-

erationreducesthecriticalALUbypasspathbyapproximately

1/3.Theextrapipelinehazardintroducedwhenaninstructionis

immediatelyfollowedbyonerequiringthattheresultbeshifted

isinfrequent.

DecoupledInstructionFetch.Atwo-instructiondeepqueueis

implementedbetweenthesecondfetchandinstructiondecode

pipestages.Thisallowsstallsgeneratedlaterinthepipetobe

deferredbyoneormorecyclesintheearlierpipestages,thereby

allowinginstructionfetchestoproceedwhenthepipeisstalled,

andalsorelievesstallspeedpathsintheinstructionfetchand

branchpredictionunits.

Deferredregisterdependency

stalls.Whileregisterdepen-

denciesarecheckedintheRFstage,stallsduetothesehazards

aredeferreduntiltheX1stage.Allthenecessaryoperandsare

thencapturedfromresult-forwardingbussesastheresultsare

returnedtotheregisterfile.

Oneofthemajorgoalsofthedesignwastominimizetheen-

ergyconsumedtocompleteagiventask.Conventionalwisdom

hasbeenthatshorterpipelinesaremoreefficientduetore-


Lec3.9

General C/L Cell Delay Model



• Input load factor of each input

• Propagation delay from each input to each output for each transition



Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11

Clocking Methodology

Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick


Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Register:

An Array of Flip-Flops


Lec3.9








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10


Clk

D Q



° Clock-to-Q time:






HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11


Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.





Lec3.12


Clk

.

.

.

.

.

.

.

.

.

.

.

.





Combinational Logic



Flip Flops have internal delays ...

D Q

CLK

Value of D is sampled on positive clock edge.Q outputs sampled value for rest of cycle.

D

Q

t_setup

t_clk-to-Q



Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.

















Example

• Parallel to serial converter:

a

b T ! time(clk"Q) + time(mux) + time(setup)

T ! #clk"Q + #mux + #setup

clk

ALU “time budget”


General Model of Synchronous Circuit

• In general, for correct operation:

for all paths.

• How do we enumerate all paths?

– Any circuit input or register output to any register input or circuit

output.

– “setup time” for circuit outputs depends on what it connects to

– “clk-Q time” for circuit inputs depends on from where it comes.

reg regCL CL

clock input

output

option feedback

input output

T ! time(clk"Q) + time(CL) + time(setup)

T ! #clk"Q + #CL + #setup


Lec3.9








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10


Clk

D Q



° Clock-to-Q time:






HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11


Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.





Lec3.12


Clk

.

.

.

.

.

.

.

.

.

.

.

.





Combinational Logic



Clock skew also eats into “time budget”


Clock Skew (cont.)

• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.

• Therefore:

1. Control clock skew

a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.

b) don’t “gate” clocks.

2. T " TCL+Tsetup+Tclk!Q + worst case skew.

• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.

clock skew, delay in distribution

CL

CLKCLK’

CLK

CLK’


Clock Skew (cont.)

• Note reversed buffer.

• In this case, clock skew actually provides extra time (adds

to the effective clock period).

• This effect has been used to help run circuits as higher

clock rates. Risky business!

CL

CLK

CLK’


CLK

CLK’

As T →0, which circuit

fails first?


Clock Skew (cont.)


• Therefore:







CL

CLKCLK’

CLK

CLK’

CLKd CLKd


Clock Skew (cont.)


• Therefore:







CL

CLKCLK’

CLK

CLK’CLKd



Clocks have dedicated wires (low skew)

Spartan-3 FPGA Family: Functional Description

30 www.xilinx.com DS099-2 (v1.3) August 24, 2004Preliminary Product Specification

40

R

width of the die. In turn, the horizontal spine branches out into a subsidiary clock interconnect that accesses the CLBs.

2. The clock input of either DCM on the same side of the die — top or bottom — as the BUFGMUX element in use.

A Global clock input is placed in a design using either aBUFGMUX element or the BUFG (Global Clock Buffer) ele-ment. For the purpose of minimizing the dynamic power dis-sipation of the clock network, the Xilinx developmentsoftware automatically disables all clock line segments thata design does not use.

Figure 18: Spartan-3 Clock Network (Top View)

4

4

4

4

4

4

4

8

8

4

4

88

Horizontal Spine

Top

Spi

neB

otto

m S

pine

4

DCM DCM

DCM DCM

Array Dependent

Array Dependent

•

•

•

•

•

•

•

•

•

•

•

•

DS099-2_18_070203

4 BUFGMUX

GCLK2GCLK3

GCLK0GCLK1

4 BUFGMUX

GCLK6 GCLK4GCLK7 GCLK5

From: Xilinx Spartan 3 data sheet. Virtex issimilar.

“Clock tree”

Flip flop clock inputs are the “leaves” of the “tree”.


Gold wires form clock tree.

Die photo: Xilinx Virtex Pro



the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ayGrid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

32

Clock Tree Delays,

IBM “Power” CPU

Del

ay



the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ay

Grid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

32

Clock Tree Delays, IBM Power

clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.

Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.

Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.

The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,

including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.

Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.

The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description

Figure 9

Global clock waveforms showing 20 ps of measured skew.

1.5

1.0

0.5

0.0

0 500 1000 1500 2000 2500

20 ps skew

Vol

ts (

V)

Time (ps)

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.

33



Some Flip Flops have “hold” time ...

D

t_setup

CLK

t_hold

D must stay

stable here

D Q

CLK

Does flip-flop hold time affect operation of this

circuit? Under what conditions?

t_inv

What is the intended function of this circuit?

t_clk-to-Q + t_inv > t_holdFor correct operation.


UC Regents Fall 2008 © UCBCS 194-6 L6: Timing

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

Equal

Control Lines

Combinational Logic

Clk

32

Addr Data

Instr

Mem

32D

PC

Q

32

32

+

32

32

0x4

PCSrc

32

+

32

op rs rt immediate016212631

E

x

t

e

n

d

Searching for processor critical path



Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Timing AnalysisWhat is the

smallest T that produces correct

operation?Must considerall connectedregister pairs.

?

Q. Why might I suspect this one?A. Very long wire on the path.



Combinational paths for IBM Power 4 CPU

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

Figure 25

POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

VIM

Timer files ReportsAsserts

Spice

Spice

GL/1

Reports

< 12 hr

< 12 hr

< 12 hr

< 48 hr

< 24 hr

Non-uplift timing

Noiseimpacton timing

Upliftanalysis

Capacitanceadjust

Chipbench /EinsTimer

Chipbench /EinsTimer

Extraction

Core or chipwiring

Analysis/update(wires, buffers)

Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

Figure 26

Histogram of the POWER4 processor path delays.

!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)

Lat

e-m

ode

timin

g ch

ecks

(th

ousa

nds)

0

50

100

150

200

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.

47

Most wires have hundreds of picoseconds to spare.The critical path



Power 4: Timing Estimation, Closure

Timing EstimationPredicting a

processor’s clock rate early in the

project




Power 4: Timing Estimation, Closure

Timing ClosureMeeting

(or exceeding!) the timing estimate




1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Floorplaning: Essential to meet timing.

(Intel XScale 80200)57Tuesday, February 11, 14


CPU time: Proportional to Clock PeriodQ. What ultimately limitsan architect’s ability to reduce clock period ?

TimeProgram


∝

A. Clock-to-Q , setup times, 2-D floorplanning geometry.

Q. How can architects (not technologists) reduce clock period?A. Shorten the machine’s critical path.





On Thursday

Pipeline design - with enough detail to do a design.

Have fun in section!


cs 152 computer architecture and engineering …cs152/sp14/lecnotes/lec2-1.pdftodd hamilton, iwatch...

Documents