[ppt]lazy logic - pharm--computer architecture...

Lazy LogicMikko H. Lipasti

Associate ProfessorDepartment of Electrical and

Computer EngineeringUniversity of Wisconsin—

Madisonhttp://www.ece.wisc.edu/~pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMOS History CMOS has been a faithful servant

40+ years since invention Tremendous advances

Device size, integration level Voltage scaling Yield, manufacturability, reliability

Nearly 20 years now as high-performance workhorse

Result: life has been easy for architects Ease leads to complacency & laziness


Toronto

CMOS Futures“The reports of my demise are greatly

exaggerated.” – Mark Twain CMOS has some life left in it

Device scaling will continue What comes after CMOS…

Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk


Toronto

Dynamic Power

Static CMOS: current flows when transistors switch Combinational logic evaluates new inputs Flip-flop, latch captures new value (clock edge)

Terms C: capacitance of circuit

wire length, number and size of transistors V: supply voltage A: activity factor f: frequency

Architects can/should focus on Ci x Ai Reduce capacitance of each unit Reduce activity of each unit

unitsi

iiidyn fAVCkP 2


Toronto

Design Objective Inversion Historically, hardware was expensive

Every gate, wire, cable, unit mattered Squeeze maximum utilization from each

Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai

Logic should be simple, infrequently used Both sequential and combinational

Lazy Logic


Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview


Toronto

What is Lazy Logic? Design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage

Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance


Toronto

Lazy Logic Applications CMP interconnection networks

Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable

Stall cycle redistribution Transparent pipelines want fine-grained

stalls Redistribute coarse stalls into fine stalls

High-performance dynamic scheduling Cycle time goal achieved by replicating

ALUs


Toronto

CMP Interconnection Networks Options

Buses don’t scale Crossbars are too

expensive Rings are too slow Packet-switched

mesh Attractive for all the

DSM reasons Scalable Low latency High link utilization


Toronto

CMP Interconnection Networks

But… Cables/traces are now

on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop

Router latency adds up 3-4 cpu cycles per hop

Store-and-forward Lots of activity/power

Is this the right answer?


Toronto

Circuit-switched Interconnects Communication

patterns Spatial locality to

memory Pairwise

communication Circuit-switched links

Avoid switching/routing

Reduce latency Save power?


Toronto

Router Design

Switches can be logically configured to appear as wires (no routing overhead)

Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here

NSE W

P


Toronto

Dirty Miss coverage

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Circuit-Switched Connections/Processor

% o

f Dirt

y M

isse

s

SPECjbbSPECwebTPC-HTPC-W


Toronto

Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests

Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list

Benefits Reduced 3-hop latency Less activity, less power


Toronto

Circuit-switched Performance

0

0.2

0.4

0.6

0.8

1

1.2

TPC

-H

SP

EC

jbb2

000

SP

EC

web

99

TPC

-W

Bar

nes-

Hut

Oce

an

Rad

iosi

ty

Nor

mal

ized

Cyc

le C

ount

Base Fully connected, Oracle Limit 1, Oracle Limit 1, Region Prediction


Toronto

Link Activity

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%TP

C-H

SP

EC

jbb2

000

SP

EC

web

99

TPC

-W

Bar

nes-

Hut

Oce

an

Rad

iosi

ty

Nor

mal

ized

Lin

k A

ctiv

ity

Limit 1, Oracle Limit 1, Region Prediction


Toronto

Buffer Activity

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%TP

C-H

SP

EC

jbb2

000

SP

EC

web

99

TPC

-W

Bar

nes-

Hut

Oce

an

Rad

iosi

ty

Nor

mal

ized

Inpu

t buf

fer A

ctiv

ity

Limit 1, Oracle Limit 1, Region Prediction


Toronto

Circuit-switched Coherence Summary

Reconfigurable interconnect Circuit-switched links

Some performance benefit Substantial reduction in activity Current status (slides are out of date)

Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper


Toronto




May 9, 2023 Eric L. Hill – Preliminary Exam 20

Pipeline Clocking Revisited

AB

Two units of work, 10 clock pulses

Latches clocked to propagate data

Conventional pipeline clock gating Each valid work unit gets clocked into each latch This is needlessly conservative


Transparent Pipeline Gating

AB

Two units of work, 5 clock pulses

return

Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005] Both master and slave latch can remain transparent Gating logic ensures no races Pipeline registers are clocked lazily only when race occurs

Quite effective for low utilization pipelines Gaps between valid work units enable transparent mode


Toronto

Applications Best suited for low utilization pipelines

E.g. FP, Media processing functional units High utilization pipelines see least

benefit E.g. Instruction fetch pipelines

To benefit from transparent approach: Valid data items need fine-grained gaps

(stalls) 1-cycle gap provides lion’s share (50%)


Toronto

Application: Front-end Pipelines Provide back-end with sufficient

supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating

Designed to feed peak demand Poor match for transparent

pipeline gating


Toronto

In-Order Execution Model In-order Cores

Power efficient Low design complexity Throughput oriented

CMP systems trending towards simple cores (e.g. Sun Niagara)

Data dependences cause fine-grained stalls at dispatch

Can we project these back to fetch?

Exploit fetch slack

time


Pipeline Diagram

BpredPC

bpred update

0x0

RPInstruction

FetchExecution

Core

clock vectorIssue Buffer


Toronto

Available Fetch Slack

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

frac

tion

of in

stru

ctio

n gr

oups

obs

erve

d

7+6543210


Toronto

Implementation Stall cycle bits embedded in BTB

EPIC ISAs (IA64) could use stop bits Verify prediction by observing

unperturbed groups Let high confidence groups

periodically execute unperturbed Observe overall increase in execution

time Modeled Cell PPU-like PowerPC

core with aggressive clock gating


Toronto

Latch Activity Reduction

0

0.2

0.4

0.6

0.8

1

1.2

norm

aliz

ed la

tch

activ

ity fa

ctor

scrscr+tcg


Toronto

FE Energy Delay Product

0

0.2

0.4

0.6

0.8

1

1.2

norm

aliz

ed fr

ont e

nd e

nerg

y-de

lay

proj

ect (

j*s)

fe_latchbpredicache

base

scr

scr+

tpg


Toronto

Stall Cycle Redistribution Summary [ISLPED 2006]

Transparent pipelines reduce latch activity Not effective in pipelines with coarse-

grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed

without affecting performance (fetch slack)

Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive


Toronto





Toronto

A Brief Scheduler Overview

Fetch Decode Sched/Exe WritebackCommit

Atomic Sched/Exe

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

wakeup/select

Fetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Spec wakeup/select

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted




Spec wakeup/select
















Latency Changed!!



Invalid input value

Speculatively issued instructionsFetch Decode ScheduleDispatch RF Exe Writeback

/Recover CommitSpeculatively issued instructions

Data capture/ non-data capture scheduler

Speculative scheduling

Data capture scheduler desirable for many reasonsCycle time is not competitive because of data path

delay Current machines use speculative scheduling

Misscheduled/replayed instructions burn power Depending on recovery policy, up to 17% issued insts need to

replay


Toronto

Slicing the Core

Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture

Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk)

Front-End Back-End

OoO Core


Toronto

Dynamic Scheduling with Partial Operand Values

Narrow core Computes partial operand Determines load latency Avoids misscheduling

Wide core Computes the rest of the operand (if needed)

wakeup/select

Fetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover Commit

wakeup/select

Fetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback





/Recover Commit

the rest of the data


Toronto

Scheduler w/ Narrow Data-Path

Non-data capture schedulerSelect – mux – tag bcast

& compare – ready wrR O B ID Data1Tag1 Data2Tag2

= =

... ......

...

... sele

ct lo

gic

...

Dest

(1 )

(2)

To W ide D ata P ath

In t ALULS Q Cache

Adde r

...

(a)

Naïve narrow data capture schedulerSelect – mux – tag bcast

& compare – ready wrSelect – mux – narrow

ALU – data bcast – data wrIncreased cycle time


Toronto

RO B ID Data1T ag1R Data2T ag2R

= =

......

...

... ......

Dest

(1)

(2)

To W ide Da ta P ath

In t A LU

Int ALUse

lect

logi

c

(b)

M M

LS Q C ache

latc

h

Scheduler w/ Embedded ALUs

With embedded ALUsSelect – mux – tag bcast &

compare – ready wrMax(select, data bcast –

mux – narrow ALU) – mux – latch setup

Lazy LogicReplicated ALUsLow utilizationOff critical delay

path


Toronto

Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design

Compiler and LSI Logic’s gflxp 0.11um

1.431.531.491.98

Area (mm2)

1.541.481.461.40

Energy(nJ)

2.04Full-Data Capture

1.28Non-Data Capture1.28Narrow-Data Capture w/

ALUs

1.71Narrow-Data Capture

Cycle Time (ns)


Toronto

Dynamic Scheduling Summary

Benefits: [JILP 2007] Save 25-30% of total OoO window energy

=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time

Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel

with scheduling select logic


Toronto





Toronto

Conclusions Lazy Logic

Promising new design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices Initial Results

Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling


Toronto

Who Are We? Faculty: Mikko Lipasti Current Ph.D. students:

Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic

Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi

Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease

Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu

Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay

Koka


Toronto

Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students

Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease

Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig

Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan

Cheveresan, Pranay Koka


Toronto

Current Focus Areas Multiprocessors

Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems

Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions

Software Java Virtual Machine run-time optimization Workload development and characterization


Toronto

Funding IBM

Faculty Partnership Awards Shared University Research equipment

Intel Research council support Equipment donations

National Science Foundation CSA, ITR, NGS, CPA Career Award

Schneider ECE Faculty Fellowship UW Graduate School


Toronto

Questions?http://www.ece.wisc.edu/

~pharm


Toronto

Questions?


Toronto

Backup slides


Toronto

Technology Parameters 65 nm technology generation 16 tiled processors

Approximately 4 mm x 4mm Signal can travel approximately 4

mm/cycle Circuit switched interconnect

consists of 5 mm unidirectional links


Toronto

Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with

owner of data Future broadcasts will use Circuit-

Switched path to reduce power Predict when CS path will suffice

Use LRU information for paths to tear down old paths when resources need to be claimed by new path


Toronto

Switch Design from paper

E

ProcessorCM

CM

CM

CM

CM

CM = Configuration Memory

N

S

WBuffer


Toronto

Race example from paper (1 of 2)

P0 P1 P2

Dir3

1a. CS Req

4. CS Resp (S)

2.

Upgrade

5. Invalidate

6. Inval Resp

1b. CS Notify

3.

7. Downgrad

e


Toronto

Race example (2 of 2)

P0 P1 P2

Dir3

1a. CS Req

4a. CS Resp (S)5. Invalidate

6. Inval Resp

1b. CS Notify

3.

4b. Nack 2. Upgrade


Toronto

LRU pairs for Dirty Misses

23 or fewer pairs capture >80% of dirty misses for 3 out of 4 benchmarks (16p)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

145

154

163

172

181

190

199

208

217

226

235

Specjbbspecwebtpchtpcw


Toronto

Local LRU pairs

2 Circuit-Switched Paths per processor covers between 55% and 85% of dirty misses

Miss Rate (Local LRU)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Specjbbspecwebtpchtpcw


Toronto

Concurrent Links

5 concurrent links cover 90% necessary pairs Captures 50%-77% of overall opportunity

2 Circuit-Switched Paths per Processor

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

110.00%

1 2 3 4 5 6 7 8 9

SpecJBBSpecwebTPC-HTPC-W


Toronto

Experimental Setup PHARMsim

Activity-based power model based on Wattch added InOrder issue 4/2/2 fetch/issue/commit (based on Cell PPU) 10 stage transparent front-end pipeline

(conventional latches at endpoints) Gshare (8k entry) branch predictor, 1024 set,

4-way BTB 32KB I/D cache (1/4), 512KB L2 cache (12) 4 confidence bits / >4 high conf threshold /

predictions checked randomly 10% of the time Benchmarks simulated for 250M instructions


Toronto

Branch Predictor Activity

0

0.2

0.4

0.6

0.8

1

1.2

norm

aliz

ed b

pred

act

ivity

scr_extranormal


Toronto

Related Work Removing Wrong Path Instructions

[Manne 1998] Flow Based Throttling Techniques

[Baniasadi 2001, Karkhanis 2002]


Toronto

Future Work Explore performance of other fetch

gating schemes with transparent pipelining

Explore dependence driven gating on Itanium machine model

Explore latch soft error vulnerability (TVF) when lazy clocking is used

Explore change in AVF when fetch gating is used Less ACE state in-flight


Toronto

LDADD

OR

Cachemiss

ANDBR

Scheduling Replay Example

Squashing/non-selective replay – alpha 21264 Replays all dependent and independent instructions

issued under load shadow Analogous to squashing recovery in branch

misprediction Simple but high performance penalty

Independent instructions are unnecessarily replayedSched Disp RF Exe Retire

Invalidate & replay ALL instructions in the load

shadow

LDADDORANDBR

LDADDOR

ANDBR

LDADDOR

ANDBR

missresolvedLD

ADDOR

ANDBR


Toronto

Narrow Core Narrow Scheduler

Captures partial operands Determines load latency (hit/miss)

Narrow Data-Path Narrow ALU – provides partial data to consumers Nar row LSQ and partial tag cache

Finds only possible load data source Uses least significant 16 bits

Large enough to help predict load latency Small enough to achieve fast cycle time


Toronto

L/S Disambiguation &Partial Tag Matching

Exploits operand significance[Brooks et.al. 1999, Canal et al. 2000]

Load/store disambiguation 10 bits finds 99% of matching stores

Partial tag match 16 bits for 97%(mcf) - 99%(bzip2)

accuracy


Toronto

Outline Motivation Dynamic Scheduling with Narrow

Values Scheduler with Narrow Data-Path Pipelined Data Cache Pipeline Integration

Implementation and Experiments Conclusions and Future Work


Toronto

Dynamic Scheduling withPartial Operands

Stores a subset of operands in scheduler Exploits partial operand knowledge

Load-store disambiguation Partial tag match

Front-End Back-End

OoO Core


Toronto

Pipelined Cache w/ Early Bits

TagA rray

DataA rray

Com parator Muxes

TagS ubarray

D ataS ub-array

Com parator Muxes

Com para tor

Narrow B ank W ide B ank

Row

Dec

oder

Row

Dec

oder

Sub

arra

y D

ecod

er

Sub

arra

y D

ecod

er

T o N arrow D ata Pa th To W ide D ata P ath

P artia l B its

Full

Bits

Latc

h

Latc

h

Latc

h

Latc

h

Latc

h

Disp1 D isp2

D isp1 D isp2 A gen

Narrow bank for partial access, wide bank for the rest

Uses partial tag match in narrow bank Saves power in wide bank Hide wide cache bank latency by starting early


Toronto

Narrow LSQ Stores partial addresses of stores Used for partial load-store

disambiguation Accessed in parallel with narrow

bank Saves power in the wide LSQ

Cheaper direct mapped access rather than full associative search


Toronto

Pipeline Integration

Simple ALU insts link dependences in back-to-back cycle

Fetch D ecode R enam e Q ueue Sched D isp D isp

P artia lLoad

In tALU

M ult/D iv M ult/D iv M ult/D iv

AgenC ache

W B C om m itD ecodeD ecodeFetch

C ache

Complex ALU insts link dependences non-speculatively

Load insts need another cycle to schedule dependences


Toronto

Pipelined Data Cache & LSQ Modeled using modified CACTI 3.0 Configuration: 16KB, 4-way, 64B blocks

(1.21 + 0.40) mm2

(1.50 + 0.40) mm2

Total Area

(0.62 + 0.11) nJ(0.37 + 0.08) nJ Total Energy Consumption (Cache + LSQ)

1.24ns0.60nsAccess Latency – Wide Bank

N/A0.80nsAccess Latency – Narrow Bank

Conventional Data Cache

PipelinedData Cache


Toronto

Experiments Simplescalar / Alpha 3.0 tool set Machine Model

64-entry ROB 4-wide fetch/issue/commit 16-entry SQ, 16-entry LQ 32-entry scheduler 13-stage pipeline 64KB I-Cache (2-cyc), 16KB D-Cache (2-cyc) 2-cycle store to load forwarding


Toronto

Energy Dissipation

On average narrow captured scheduling consume 25% less energy than non-data captured scheduling

0

0.2

0.4

0.6

0.8

1

bzip2 mcf parser vpr avg

Benchmarks

Tota

l Ene

rgy

narrow_refetchnarrow_squashsquashparallel_selective


Toronto

Mispredicted Load Instructions

Reduce misspeculated loads by 75%-80%

0

2

4

6

8

10

12

14

bzip2 mcf parser vpr

Benchmarks

Num

ber o

f M

issc

hedu

led

Load

Inst

ruct

ions

(m

illio

ns)

miss-forwardstore no-datamisalign storecache aliascache miss


Toronto

Optimized model Using refetch replay scheme to

reduce replay complexity Clear the scheduler entries once

instructions are issued Decreases scheduler occupancy Instructions enters OoO window

sooner Reduce L1 cache latency from 2-

cycle to 1-cycle


Toronto

Optimized Model Performance

Small variations Always perform as good or better

0.5

1

1.5

2

bzip2 mcf parser vpr avg

Benchmarks

Spee

d U

p

improved narrow_refetch

narrow_refetch

narrow_squash

squash

selective


Toronto

Future Work Implement a more accurate

dynamic power model Study custom design vs.

synthesized model Study opportunities for leakage

power reduction


Toronto

Delay Model

Processor 0 can reach Processor 15 in 9 fewer cycles

Circuit Switched Interconnect

432-- 432

976764643

Baseline Store and Forward Mesh

963-- 963

181512151291296


Toronto

Pipeline Unrolling

[ppt]lazy logic - pharm--computer architecture...

Documents