Download - Lazy Logic

Lazy Logic

Mikko H. LipastiAssociate Professor

Department of Electrical and Computer Engineering

University of Wisconsin—Madison

http://www.ece.wisc.edu/~pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMOS History CMOS has been a faithful servant

40+ years since invention Tremendous advances

Device size, integration level Voltage scaling Yield, manufacturability, reliability

Nearly 20 years now as high-performance workhorse

Result: life has been easy for architects Ease leads to complacency & laziness


Toronto

CMOS Futures“The reports of my demise are greatly

exaggerated.” – Mark Twain CMOS has some life left in it

Device scaling will continue What comes after CMOS…

Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk


Toronto

Dynamic Power

Static CMOS: current flows when transistors switch

Combinational logic evaluates new inputs Flip-flop, latch captures new value (clock edge)

Terms C: capacitance of circuit

wire length, number and size of transistors V: supply voltage A: activity factor f: frequency

Architects can/should focus on Ci x Ai Reduce capacitance of each unit Reduce activity of each unit

unitsi

iiidyn fAVCkP 2


Toronto

Design Objective Inversion Historically, hardware was expensive

Every gate, wire, cable, unit mattered Squeeze maximum utilization from each

Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai

Logic should be simple, infrequently used Both sequential and combinational

Lazy Logic


Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview


Toronto

What is Lazy Logic? Design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage

Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance


Toronto

Lazy Logic Applications CMP interconnection networks

Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable

Stall cycle redistribution Transparent pipelines want fine-grained

stalls Redistribute coarse stalls into fine stalls

High-performance dynamic scheduling Cycle time goal achieved by replicating

ALUs


Toronto

CMP Interconnection Networks Options

Buses don’t scale Crossbars are too

expensive Rings are too slow Packet-switched

mesh Attractive for all the

DSM reasons Scalable Low latency High link utilization


Toronto

CMP Interconnection Networks

But… Cables/traces are now

on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop

Router latency adds up 3-4 cpu cycles per hop

Store-and-forward Lots of activity/power

Is this the right answer?


Toronto

Circuit-switched Interconnects Communication

patterns Spatial locality to

memory Pairwise

communication Circuit-switched links

Avoid switching/routing

Reduce latency Save power?


Toronto

Router Design

Switches can be logically configured to appear as wires (no routing overhead)

Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here

NSE W

P


Toronto

Dirty Miss coverage

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Circuit-Switched Connections/Processor

% o

f D

irty

Mis

se

s

SPECjbbSPECwebTPC-HTPC-W


Toronto

Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests

Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list

Benefits Reduced 3-hop latency Less activity, less power


Toronto

Circuit-switched Performance

0

0.2

0.4

0.6

0.8

1

1.2

TP

C-H

SP

EC

jbb

20

00

SP

EC

we

b9

9

TP

C-W

Ba

rne

s-H

ut

Oce

an

Ra

dio

sity

No

rma

lize

d C

yc

le C

ou

nt

Base Fully connected, Oracle Limit 1, Oracle Limit 1, Region Prediction


Toronto

Link Activity

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%T

PC

-H

SP

EC

jbb

20

00

SP

EC

we

b9

9

TP

C-W

Ba

rne

s-H

ut

Oce

an

Ra

dio

sity

No

rma

lize

d L

ink

Ac

tiv

ity

Limit 1, Oracle Limit 1, Region Prediction


Toronto

Buffer Activity

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%T

PC

-H

SP

EC

jbb

20

00

SP

EC

we

b9

9

TP

C-W

Ba

rne

s-H

ut

Oce

an

Ra

dio

sity

No

rma

lize

d I

np

ut

bu

ffe

r A

cti

vit

y

Limit 1, Oracle Limit 1, Region Prediction


Toronto

Circuit-switched Coherence Summary

Reconfigurable interconnect Circuit-switched links

Some performance benefit Substantial reduction in activity Current status (slides are out of date)

Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper


Toronto




April 21, 2023 Eric L. Hill – Preliminary Exam 20

Pipeline Clocking Revisited

AB

Two units of work, 10 clock pulses

Latches clocked to propagate data

Conventional pipeline clock gating Each valid work unit gets clocked into each latch This is needlessly conservative


Transparent Pipeline Gating

AB

Two units of work, 5 clock pulses

return

Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005]

Both master and slave latch can remain transparent Gating logic ensures no races Pipeline registers are clocked lazily only when race occurs

Quite effective for low utilization pipelines Gaps between valid work units enable transparent mode


Toronto

Applications Best suited for low utilization pipelines

E.g. FP, Media processing functional units High utilization pipelines see least benefit

E.g. Instruction fetch pipelines To benefit from transparent approach:

Valid data items need fine-grained gaps (stalls)

1-cycle gap provides lion’s share (50%)


Toronto

Application: Front-end Pipelines Provide back-end with sufficient

supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating

Designed to feed peak demand Poor match for transparent

pipeline gating


Toronto

In-Order Execution Model In-order Cores

Power efficient Low design complexity Throughput oriented

CMP systems trending towards simple cores (e.g. Sun Niagara)

Data dependences cause fine-grained stalls at dispatch

Can we project these back to fetch?

Exploit fetch slack

time


Pipeline Diagram

BpredPC

bpred update

0x0

RPInstruction

Fetch

Execution Core

clock vector

Issue Buffer


Toronto

Available Fetch Slack

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

fracti

on

of

instr

ucti

on

gro

up

s o

bserv

ed

7+

6

5

4

3

2

1

0


Toronto

Implementation Stall cycle bits embedded in BTB

EPIC ISAs (IA64) could use stop bits Verify prediction by observing

unperturbed groups Let high confidence groups

periodically execute unperturbed Observe overall increase in execution

time Modeled Cell PPU-like PowerPC

core with aggressive clock gating


Toronto

Latch Activity Reduction

0

0.2

0.4

0.6

0.8

1

1.2

no

rmali

zed

latc

h a

cti

vit

y f

acto

r

scr

scr+tcg


Toronto

FE Energy Delay Product

0

0.2

0.4

0.6

0.8

1

1.2

no

rma

lize

d f

ron

t e

nd

en

erg

y-d

ela

y p

roje

ct

(j*s

)

fe_latch

bpred

icache

base

scr

scr+

tpg


Toronto

Stall Cycle Redistribution Summary [ISLPED 2006]

Transparent pipelines reduce latch activity Not effective in pipelines with coarse-

grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed

without affecting performance (fetch slack)

Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive


Toronto





Toronto

A Brief Scheduler Overview

Fetch DecodeSched/Exe

WritebackCommit

Atomic Sched/Exe

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

wakeup/select

Fetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Spec wakeup/select

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted


Commit



Spec wakeup/select


Commit




Commit




Commit




Commit




Commit



Latency Changed!!


Commit


Invalid input value



Commit


Data capture/ non-data capture scheduler

Speculative scheduling

Data capture scheduler desirable for many reasonsCycle time is not competitive because of data path

delay Current machines use speculative scheduling

Misscheduled/replayed instructions burn power Depending on recovery policy, up to 17% issued insts need to

replay


Toronto

Slicing the Core

Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture

Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk)

Front-End Back-End

OoO Core


Toronto

Dynamic Scheduling with Partial Operand Values

Narrow core Computes partial operand Determines load latency Avoids misscheduling

Wide core Computes the rest of the operand (if needed)

wakeup/select

Fetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

CommitFetch DecodeSched &Nrw Exe


Commit

wakeup/select

Fetch DecodeSched &Nrw Exe










Commit

the rest of the data


Toronto

Scheduler w/ Narrow Data-Path

Non-data capture schedulerSelect – mux – tag bcast

& compare – ready wrR O B ID D ata1T ag1 D ata2T ag2

= =

... ......

...

... sele

ct lo

gic

...

Dest

(1)

(2)

To W ide Data Path

In t ALULSQ C ache

Adder

...

(a)

Naïve narrow data capture schedulerSelect – mux – tag bcast

& compare – ready wr

Select – mux – narrow ALU – data bcast – data wr

Increased cycle time


Toronto

R O B ID D ata1T ag1R D ata2T ag2R

= =

......

...

... ......

Dest

(1)

(2)

To W ide D ata P ath

In t ALU

Int ALUse

lect

logi

c

(b)

M M

LS Q C ache

latc

h

Scheduler w/ Embedded ALUs

With embedded ALUsSelect – mux – tag bcast

& compare – ready wrMax(select, data bcast – mux – narrow ALU) – mux – latch setup

Lazy LogicReplicated ALUsLow utilizationOff critical delay

path


Toronto

Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design

Compiler and LSI Logic’s gflxp 0.11um

1.43

1.53

1.49

1.98

Area (mm2)

1.54

1.48

1.46

1.40

Energy(nJ)

2.04Full-Data Capture

1.28Non-Data Capture

1.28Narrow-Data Capture w/ ALUs

1.71Narrow-Data Capture

Cycle Time (ns)


Toronto

Dynamic Scheduling Summary

Benefits: [JILP 2007] Save 25-30% of total OoO window energy

=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time

Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel

with scheduling select logic


Toronto





Toronto

Conclusions Lazy Logic

Promising new design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices Initial Results

Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling


Toronto

Who Are We? Faculty: Mikko Lipasti Current Ph.D. students:

Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic

Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi

Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease

Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu

Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay

Koka


Toronto

Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students

Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease

Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig

Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan

Cheveresan, Pranay Koka


Toronto

Current Focus Areas Multiprocessors

Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems

Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions

Software Java Virtual Machine run-time optimization Workload development and characterization


Toronto

Funding IBM

Faculty Partnership Awards Shared University Research equipment

Intel Research council support Equipment donations

National Science Foundation CSA, ITR, NGS, CPA Career Award

Schneider ECE Faculty Fellowship UW Graduate School


Toronto

Questions?http://www.ece.wisc.edu/

~pharm


Toronto

Questions?


Toronto

Backup slides


Toronto

Technology Parameters 65 nm technology generation 16 tiled processors

Approximately 4 mm x 4mm Signal can travel approximately 4

mm/cycle Circuit switched interconnect

consists of 5 mm unidirectional links


Toronto

Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with

owner of data Future broadcasts will use Circuit-

Switched path to reduce power Predict when CS path will suffice

Use LRU information for paths to tear down old paths when resources need to be claimed by new path


Toronto

Switch Design from paper

E

ProcessorCM

CM

CM

CM

CM

CM = Configuration Memory

N

S

W

Buffer


Toronto

Race example from paper (1 of 2)

P0 P1 P2

Dir3

1a. CS Req

4. CS Resp (S)

2.

Upgrad

e

5.

Invalidate

6. Inval Resp

1b.

CS Notify

3.

7.

Downgrade


Toronto

Race example (2 of 2)

P0 P1 P2

Dir3

1a. CS Req

4a. CS Resp (S)5.

Invalidate

6. Inval Resp

1b.

CS Notify

3.

4b. Nack2. Upgrade


Toronto

LRU pairs for Dirty Misses

23 or fewer pairs capture >80% of dirty misses for 3 out of 4 benchmarks (16p)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 10

19

28

37

46

55

64

73

82

91

10

0

10

9

11

8

12

7

13

6

14

5

15

4

16

3

17

2

18

1

19

0

19

9

20

8

21

7

22

6

23

5

Specjbb

specweb

tpch

tpcw


Toronto

Local LRU pairs

2 Circuit-Switched Paths per processor covers between 55% and 85% of dirty misses

Miss Rate (Local LRU)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Specjbb

specweb

tpch

tpcw


Toronto

Concurrent Links

5 concurrent links cover 90% necessary pairs Captures 50%-77% of overall opportunity

2 Circuit-Switched Paths per Processor

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

110.00%

1 2 3 4 5 6 7 8 9

SpecJBB

Specweb

TPC-H

TPC-W


Toronto

Experimental Setup PHARMsim

Activity-based power model based on Wattch added

InOrder issue 4/2/2 fetch/issue/commit (based on Cell PPU) 10 stage transparent front-end pipeline

(conventional latches at endpoints) Gshare (8k entry) branch predictor, 1024 set,

4-way BTB 32KB I/D cache (1/4), 512KB L2 cache (12) 4 confidence bits / >4 high conf threshold /

predictions checked randomly 10% of the time Benchmarks simulated for 250M instructions


Toronto

Branch Predictor Activity

0

0.2

0.4

0.6

0.8

1

1.2

no

rma

lize

d b

pre

d a

cti

vit

y

scr_extra

normal


Toronto

Related Work Removing Wrong Path Instructions

[Manne 1998] Flow Based Throttling Techniques

[Baniasadi 2001, Karkhanis 2002]


Toronto

Future Work Explore performance of other fetch

gating schemes with transparent pipelining

Explore dependence driven gating on Itanium machine model

Explore latch soft error vulnerability (TVF) when lazy clocking is used

Explore change in AVF when fetch gating is used Less ACE state in-flight


Toronto

LD

ADD

OR

Cachemiss

AND

BR

Scheduling Replay Example

Squashing/non-selective replay – alpha 21264 Replays all dependent and independent instructions

issued under load shadow Analogous to squashing recovery in branch

misprediction Simple but high performance penalty

Independent instructions are unnecessarily replayedSched Disp RF Exe Retire

Invalidate & replay ALL instructions in the load

shadow

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

missresolvedLD

ADD

OR

AND

BR


Toronto

Narrow Core Narrow Scheduler

Captures partial operands Determines load latency (hit/miss)

Narrow Data-Path Narrow ALU – provides partial data to consumers Nar row LSQ and partial tag cache

Finds only possible load data source Uses least significant 16 bits

Large enough to help predict load latency Small enough to achieve fast cycle time


Toronto

L/S Disambiguation &Partial Tag Matching

Exploits operand significance[Brooks et.al. 1999, Canal et al. 2000]

Load/store disambiguation 10 bits finds 99% of matching stores

Partial tag match 16 bits for 97%(mcf) - 99%(bzip2)

accuracy


Toronto

Outline Motivation Dynamic Scheduling with Narrow

Values Scheduler with Narrow Data-Path Pipelined Data Cache Pipeline Integration

Implementation and Experiments Conclusions and Future Work


Toronto

Dynamic Scheduling withPartial Operands

Stores a subset of operands in scheduler Exploits partial operand knowledge

Load-store disambiguation Partial tag match

Front-End Back-End

OoO Core


Toronto

Pipelined Cache w/ Early Bits

TagA rray

D ataA rray

Com parator Muxes

TagS ubarray

D ataS ub-array

Com parator Muxes

C om para tor

N arrow B ank W ide B ank

Row

Decoder

Row

Decoder

Subarr

ay D

ecoder

Subarr

ay D

ecoder

To N arrow D ata P ath To W ide D ata P ath

P artia l B its

Fu

ll B

its

La

tch

La

tch

La

tch

La

tch

La

tch

D isp1 D isp2

D isp1 D isp2 A gen

Narrow bank for partial access, wide bank for the rest

Uses partial tag match in narrow bank Saves power in wide bank Hide wide cache bank latency by starting early


Toronto

Narrow LSQ Stores partial addresses of stores Used for partial load-store

disambiguation Accessed in parallel with narrow

bank Saves power in the wide LSQ

Cheaper direct mapped access rather than full associative search


Toronto

Pipeline Integration

Simple ALU insts link dependences in back-to-back cycle

Fetch D ecode R enam e Q ueue S ched D isp D isp

P artia lLoad

In tA LU

M ult/D iv M ult/D iv M ult/D iv

A genC ache

W B C om m itD ecodeD ecodeFetch

C ache

Complex ALU insts link dependences non-speculatively

Load insts need another cycle to schedule dependences


Toronto

Pipelined Data Cache & LSQ Modeled using modified CACTI 3.0 Configuration: 16KB, 4-way, 64B blocks

(1.21 + 0.40) mm2

(1.50 + 0.40) mm2

Total Area

(0.62 + 0.11) nJ(0.37 + 0.08) nJ Total Energy Consumption (Cache + LSQ)

1.24ns0.60nsAccess Latency – Wide Bank

N/A0.80nsAccess Latency – Narrow Bank

Conventional Data Cache

PipelinedData Cache


Toronto

Experiments Simplescalar / Alpha 3.0 tool set Machine Model

64-entry ROB 4-wide fetch/issue/commit 16-entry SQ, 16-entry LQ 32-entry scheduler 13-stage pipeline 64KB I-Cache (2-cyc), 16KB D-Cache (2-

cyc) 2-cycle store to load forwarding


Toronto

Energy Dissipation

On average narrow captured scheduling consume 25% less energy than non-data captured scheduling

0

0.2

0.4

0.6

0.8

1

bzip2 mcf parser vpr avg

Benchmarks

To

tal E

ne

rgy

narrow_refetch

narrow_squash

squash

parallel_selective


Toronto

Mispredicted Load Instructions

Reduce misspeculated loads by 75%-80%

0

2

4

6

8

10

12

14

bzip2 mcf parser vpr

Benchmarks

Nu

mb

er

of

Mis

sc

he

du

led

Lo

ad

Ins

tru

cti

on

s

(mill

ion

s)

miss-forward

store no-data

misalign store

cache alias

cache miss


Toronto

Optimized model Using refetch replay scheme to

reduce replay complexity Clear the scheduler entries once

instructions are issued Decreases scheduler occupancy Instructions enters OoO window

sooner Reduce L1 cache latency from 2-

cycle to 1-cycle


Toronto

Optimized Model Performance

Small variations Always perform as good or better

0.5

1

1.5

2

bzip2 mcf parser vpr avg

Benchmarks

Sp

eed

Up

improved narrow_refetch

narrow_refetch

narrow_squash

squash

selective


Toronto

Future Work Implement a more accurate

dynamic power model Study custom design vs.

synthesized model Study opportunities for leakage

power reduction


Toronto

Delay Model

Processor 0 can reach Processor 15 in 9 fewer cycles

Circuit Switched Interconnect

4

3

2

-- 432

976

764

643

Baseline Store and Forward Mesh

9

6

3

-- 963

181512

15129

1296


Toronto

Pipeline Unrolling

Download - Lazy Logic

Top Related