[ppt]lazy logic - pharm--computer architecture...

76
Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin— Madison http://www.ece.wisc.edu/~pharm

Upload: voanh

Post on 14-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

Lazy LogicMikko H. Lipasti

Associate ProfessorDepartment of Electrical and

Computer EngineeringUniversity of Wisconsin—

Madisonhttp://www.ece.wisc.edu/~pharm

Page 2: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMOS History CMOS has been a faithful servant

40+ years since invention Tremendous advances

Device size, integration level Voltage scaling Yield, manufacturability, reliability

Nearly 20 years now as high-performance workhorse

Result: life has been easy for architects Ease leads to complacency & laziness

Page 3: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMOS Futures“The reports of my demise are greatly

exaggerated.” – Mark Twain CMOS has some life left in it

Device scaling will continue What comes after CMOS…

Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk

Page 4: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Power

Static CMOS: current flows when transistors switch Combinational logic evaluates new inputs Flip-flop, latch captures new value (clock edge)

Terms C: capacitance of circuit

wire length, number and size of transistors V: supply voltage A: activity factor f: frequency

Architects can/should focus on Ci x Ai Reduce capacitance of each unit Reduce activity of each unit

unitsi

iiidyn fAVCkP 2

Page 5: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Design Objective Inversion Historically, hardware was expensive

Every gate, wire, cable, unit mattered Squeeze maximum utilization from each

Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai

Logic should be simple, infrequently used Both sequential and combinational

Lazy Logic

Page 6: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 7: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

What is Lazy Logic? Design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage

Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance

Page 8: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Lazy Logic Applications CMP interconnection networks

Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable

Stall cycle redistribution Transparent pipelines want fine-grained

stalls Redistribute coarse stalls into fine stalls

High-performance dynamic scheduling Cycle time goal achieved by replicating

ALUs

Page 9: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMP Interconnection Networks Options

Buses don’t scale Crossbars are too

expensive Rings are too slow Packet-switched

mesh Attractive for all the

DSM reasons Scalable Low latency High link utilization

Page 10: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMP Interconnection Networks

But… Cables/traces are now

on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop

Router latency adds up 3-4 cpu cycles per hop

Store-and-forward Lots of activity/power

Is this the right answer?

Page 11: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Circuit-switched Interconnects Communication

patterns Spatial locality to

memory Pairwise

communication Circuit-switched links

Avoid switching/routing

Reduce latency Save power?

Page 12: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Router Design

Switches can be logically configured to appear as wires (no routing overhead)

Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here

NSE W

P

Page 13: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dirty Miss coverage

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Circuit-Switched Connections/Processor

% o

f Dirt

y M

isse

s

SPECjbbSPECwebTPC-HTPC-W

Page 14: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests

Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list

Benefits Reduced 3-hop latency Less activity, less power

Page 15: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Circuit-switched Performance

0

0.2

0.4

0.6

0.8

1

1.2

TPC

-H

SP

EC

jbb2

000

SP

EC

web

99

TPC

-W

Bar

nes-

Hut

Oce

an

Rad

iosi

ty

Nor

mal

ized

Cyc

le C

ount

Base Fully connected, Oracle Limit 1, Oracle Limit 1, Region Prediction

Page 16: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Link Activity

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%TP

C-H

SP

EC

jbb2

000

SP

EC

web

99

TPC

-W

Bar

nes-

Hut

Oce

an

Rad

iosi

ty

Nor

mal

ized

Lin

k A

ctiv

ity

Limit 1, Oracle Limit 1, Region Prediction

Page 17: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Buffer Activity

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%TP

C-H

SP

EC

jbb2

000

SP

EC

web

99

TPC

-W

Bar

nes-

Hut

Oce

an

Rad

iosi

ty

Nor

mal

ized

Inpu

t buf

fer A

ctiv

ity

Limit 1, Oracle Limit 1, Region Prediction

Page 18: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Circuit-switched Coherence Summary

Reconfigurable interconnect Circuit-switched links

Some performance benefit Substantial reduction in activity Current status (slides are out of date)

Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper

Page 19: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 20: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

May 9, 2023 Eric L. Hill – Preliminary Exam 20

Pipeline Clocking Revisited

AB

Two units of work, 10 clock pulses

Latches clocked to propagate data

Conventional pipeline clock gating Each valid work unit gets clocked into each latch This is needlessly conservative

Page 21: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

May 9, 2023 Eric L. Hill – Preliminary Exam 21

Transparent Pipeline Gating

AB

Two units of work, 5 clock pulses

return

Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005] Both master and slave latch can remain transparent Gating logic ensures no races Pipeline registers are clocked lazily only when race occurs

Quite effective for low utilization pipelines Gaps between valid work units enable transparent mode

Page 22: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Applications Best suited for low utilization pipelines

E.g. FP, Media processing functional units High utilization pipelines see least

benefit E.g. Instruction fetch pipelines

To benefit from transparent approach: Valid data items need fine-grained gaps

(stalls) 1-cycle gap provides lion’s share (50%)

Page 23: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Application: Front-end Pipelines Provide back-end with sufficient

supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating

Designed to feed peak demand Poor match for transparent

pipeline gating

Page 24: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

In-Order Execution Model In-order Cores

Power efficient Low design complexity Throughput oriented

CMP systems trending towards simple cores (e.g. Sun Niagara)

Data dependences cause fine-grained stalls at dispatch

Can we project these back to fetch?

Exploit fetch slack

time

Page 25: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

May 9, 2023 Eric L. Hill – Preliminary Exam 25

Pipeline Diagram

BpredPC

bpred update

0x0

RPInstruction

FetchExecution

Core

clock vectorIssue Buffer

Page 26: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Available Fetch Slack

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

frac

tion

of in

stru

ctio

n gr

oups

obs

erve

d

7+6543210

Page 27: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Implementation Stall cycle bits embedded in BTB

EPIC ISAs (IA64) could use stop bits Verify prediction by observing

unperturbed groups Let high confidence groups

periodically execute unperturbed Observe overall increase in execution

time Modeled Cell PPU-like PowerPC

core with aggressive clock gating

Page 28: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Latch Activity Reduction

0

0.2

0.4

0.6

0.8

1

1.2

norm

aliz

ed la

tch

activ

ity fa

ctor

scrscr+tcg

Page 29: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

FE Energy Delay Product

0

0.2

0.4

0.6

0.8

1

1.2

norm

aliz

ed fr

ont e

nd e

nerg

y-de

lay

proj

ect (

j*s)

fe_latchbpredicache

base

scr

scr+

tpg

Page 30: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Stall Cycle Redistribution Summary [ISLPED 2006]

Transparent pipelines reduce latch activity Not effective in pipelines with coarse-

grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed

without affecting performance (fetch slack)

Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive

Page 31: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 32: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

A Brief Scheduler Overview

Fetch Decode Sched/Exe WritebackCommit

Atomic Sched/Exe

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

wakeup/select

Fetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Spec wakeup/select

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Spec wakeup/select

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Latency Changed!!

Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit

Re-schedulewhen latency mispredicted

Invalid input value

Speculatively issued instructionsFetch Decode ScheduleDispatch RF Exe Writeback

/Recover CommitSpeculatively issued instructions

Data capture/ non-data capture scheduler

Speculative scheduling

Data capture scheduler desirable for many reasonsCycle time is not competitive because of data path

delay Current machines use speculative scheduling

Misscheduled/replayed instructions burn power Depending on recovery policy, up to 17% issued insts need to

replay

Page 33: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Slicing the Core

Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture

Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk)

Front-End Back-End

OoO Core

Page 34: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Scheduling with Partial Operand Values

Narrow core Computes partial operand Determines load latency Avoids misscheduling

Wide core Computes the rest of the operand (if needed)

wakeup/select

Fetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover Commit

wakeup/select

Fetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback

/Recover Commit

the rest of the data

Page 35: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Scheduler w/ Narrow Data-Path

Non-data capture schedulerSelect – mux – tag bcast

& compare – ready wrR O B ID Data1Tag1 Data2Tag2

= =

... ......

...

... sele

ct lo

gic

...

Dest

(1 )

(2)

To W ide D ata P ath

In t ALULS Q Cache

Adde r

...

(a)

Naïve narrow data capture schedulerSelect – mux – tag bcast

& compare – ready wrSelect – mux – narrow

ALU – data bcast – data wrIncreased cycle time

Page 36: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

RO B ID Data1T ag1R Data2T ag2R

= =

......

...

... ......

Dest

(1)

(2)

To W ide Da ta P ath

In t A LU

Int ALUse

lect

logi

c

(b)

M M

LS Q C ache

latc

h

Scheduler w/ Embedded ALUs

With embedded ALUsSelect – mux – tag bcast &

compare – ready wrMax(select, data bcast –

mux – narrow ALU) – mux – latch setup

Lazy LogicReplicated ALUsLow utilizationOff critical delay

path

Page 37: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design

Compiler and LSI Logic’s gflxp 0.11um

1.431.531.491.98

Area (mm2)

1.541.481.461.40

Energy(nJ)

2.04Full-Data Capture

1.28Non-Data Capture1.28Narrow-Data Capture w/

ALUs

1.71Narrow-Data Capture

Cycle Time (ns)

Page 38: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Scheduling Summary

Benefits: [JILP 2007] Save 25-30% of total OoO window energy

=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time

Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel

with scheduling select logic

Page 39: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 40: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Conclusions Lazy Logic

Promising new design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices Initial Results

Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling

Page 41: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Who Are We? Faculty: Mikko Lipasti Current Ph.D. students:

Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic

Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi

Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease

Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu

Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay

Koka

Page 42: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students

Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease

Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig

Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan

Cheveresan, Pranay Koka

Page 43: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Current Focus Areas Multiprocessors

Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems

Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions

Software Java Virtual Machine run-time optimization Workload development and characterization

Page 44: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Funding IBM

Faculty Partnership Awards Shared University Research equipment

Intel Research council support Equipment donations

National Science Foundation CSA, ITR, NGS, CPA Career Award

Schneider ECE Faculty Fellowship UW Graduate School

Page 45: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Questions?http://www.ece.wisc.edu/

~pharm

Page 46: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Questions?

Page 47: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Backup slides

Page 48: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Technology Parameters 65 nm technology generation 16 tiled processors

Approximately 4 mm x 4mm Signal can travel approximately 4

mm/cycle Circuit switched interconnect

consists of 5 mm unidirectional links

Page 49: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with

owner of data Future broadcasts will use Circuit-

Switched path to reduce power Predict when CS path will suffice

Use LRU information for paths to tear down old paths when resources need to be claimed by new path

Page 50: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Switch Design from paper

E

ProcessorCM

CM

CM

CM

CM

CM = Configuration Memory

N

S

WBuffer

Page 51: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Race example from paper (1 of 2)

P0 P1 P2

Dir3

1a. CS Req

4. CS Resp (S)

2.

Upgrade

5. Invalidate

6. Inval Resp

1b. CS Notify

3.

7. Downgrad

e

Page 52: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Race example (2 of 2)

P0 P1 P2

Dir3

1a. CS Req

4a. CS Resp (S)5. Invalidate

6. Inval Resp

1b. CS Notify

3.

4b. Nack 2. Upgrade

Page 53: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

LRU pairs for Dirty Misses

23 or fewer pairs capture >80% of dirty misses for 3 out of 4 benchmarks (16p)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

145

154

163

172

181

190

199

208

217

226

235

Specjbbspecwebtpchtpcw

Page 54: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Local LRU pairs

2 Circuit-Switched Paths per processor covers between 55% and 85% of dirty misses

Miss Rate (Local LRU)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Specjbbspecwebtpchtpcw

Page 55: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Concurrent Links

5 concurrent links cover 90% necessary pairs Captures 50%-77% of overall opportunity

2 Circuit-Switched Paths per Processor

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

110.00%

1 2 3 4 5 6 7 8 9

SpecJBBSpecwebTPC-HTPC-W

Page 56: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Experimental Setup PHARMsim

Activity-based power model based on Wattch added InOrder issue 4/2/2 fetch/issue/commit (based on Cell PPU) 10 stage transparent front-end pipeline

(conventional latches at endpoints) Gshare (8k entry) branch predictor, 1024 set,

4-way BTB 32KB I/D cache (1/4), 512KB L2 cache (12) 4 confidence bits / >4 high conf threshold /

predictions checked randomly 10% of the time Benchmarks simulated for 250M instructions

Page 57: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Branch Predictor Activity

0

0.2

0.4

0.6

0.8

1

1.2

norm

aliz

ed b

pred

act

ivity

scr_extranormal

Page 58: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Related Work Removing Wrong Path Instructions

[Manne 1998] Flow Based Throttling Techniques

[Baniasadi 2001, Karkhanis 2002]

Page 59: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Future Work Explore performance of other fetch

gating schemes with transparent pipelining

Explore dependence driven gating on Itanium machine model

Explore latch soft error vulnerability (TVF) when lazy clocking is used

Explore change in AVF when fetch gating is used Less ACE state in-flight

Page 60: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

LDADD

OR

Cachemiss

ANDBR

Scheduling Replay Example

Squashing/non-selective replay – alpha 21264 Replays all dependent and independent instructions

issued under load shadow Analogous to squashing recovery in branch

misprediction Simple but high performance penalty

Independent instructions are unnecessarily replayedSched Disp RF Exe Retire

Invalidate & replay ALL instructions in the load

shadow

LDADDORANDBR

LDADDOR

ANDBR

LDADDOR

ANDBR

missresolvedLD

ADDOR

ANDBR

Page 61: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Narrow Core Narrow Scheduler

Captures partial operands Determines load latency (hit/miss)

Narrow Data-Path Narrow ALU – provides partial data to consumers Nar row LSQ and partial tag cache

Finds only possible load data source Uses least significant 16 bits

Large enough to help predict load latency Small enough to achieve fast cycle time

Page 62: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

L/S Disambiguation &Partial Tag Matching

Exploits operand significance[Brooks et.al. 1999, Canal et al. 2000]

Load/store disambiguation 10 bits finds 99% of matching stores

Partial tag match 16 bits for 97%(mcf) - 99%(bzip2)

accuracy

Page 63: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Outline Motivation Dynamic Scheduling with Narrow

Values Scheduler with Narrow Data-Path Pipelined Data Cache Pipeline Integration

Implementation and Experiments Conclusions and Future Work

Page 64: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Scheduling withPartial Operands

Stores a subset of operands in scheduler Exploits partial operand knowledge

Load-store disambiguation Partial tag match

Front-End Back-End

OoO Core

Page 65: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipelined Cache w/ Early Bits

TagA rray

DataA rray

Com parator Muxes

TagS ubarray

D ataS ub-array

Com parator Muxes

Com para tor

Narrow B ank W ide B ank

Row

Dec

oder

Row

Dec

oder

Sub

arra

y D

ecod

er

Sub

arra

y D

ecod

er

T o N arrow D ata Pa th To W ide D ata P ath

P artia l B its

Full

Bits

Latc

h

Latc

h

Latc

h

Latc

h

Latc

h

Disp1 D isp2

D isp1 D isp2 A gen

Narrow bank for partial access, wide bank for the rest

Uses partial tag match in narrow bank Saves power in wide bank Hide wide cache bank latency by starting early

Page 66: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Narrow LSQ Stores partial addresses of stores Used for partial load-store

disambiguation Accessed in parallel with narrow

bank Saves power in the wide LSQ

Cheaper direct mapped access rather than full associative search

Page 67: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipeline Integration

Simple ALU insts link dependences in back-to-back cycle

Fetch D ecode R enam e Q ueue Sched D isp D isp

P artia lLoad

In tALU

M ult/D iv M ult/D iv M ult/D iv

AgenC ache

W B C om m itD ecodeD ecodeFetch

C ache

Complex ALU insts link dependences non-speculatively

Load insts need another cycle to schedule dependences

Page 68: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipelined Data Cache & LSQ Modeled using modified CACTI 3.0 Configuration: 16KB, 4-way, 64B blocks

(1.21 + 0.40) mm2

(1.50 + 0.40) mm2

Total Area

(0.62 + 0.11) nJ(0.37 + 0.08) nJ Total Energy Consumption (Cache + LSQ)

1.24ns0.60nsAccess Latency – Wide Bank

N/A0.80nsAccess Latency – Narrow Bank

Conventional Data Cache

PipelinedData Cache

Page 69: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Experiments Simplescalar / Alpha 3.0 tool set Machine Model

64-entry ROB 4-wide fetch/issue/commit 16-entry SQ, 16-entry LQ 32-entry scheduler 13-stage pipeline 64KB I-Cache (2-cyc), 16KB D-Cache (2-cyc) 2-cycle store to load forwarding

Page 70: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Energy Dissipation

On average narrow captured scheduling consume 25% less energy than non-data captured scheduling

0

0.2

0.4

0.6

0.8

1

bzip2 mcf parser vpr avg

Benchmarks

Tota

l Ene

rgy

narrow_refetchnarrow_squashsquashparallel_selective

Page 71: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Mispredicted Load Instructions

Reduce misspeculated loads by 75%-80%

0

2

4

6

8

10

12

14

bzip2 mcf parser vpr

Benchmarks

Num

ber o

f M

issc

hedu

led

Load

Inst

ruct

ions

(m

illio

ns)

miss-forwardstore no-datamisalign storecache aliascache miss

Page 72: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Optimized model Using refetch replay scheme to

reduce replay complexity Clear the scheduler entries once

instructions are issued Decreases scheduler occupancy Instructions enters OoO window

sooner Reduce L1 cache latency from 2-

cycle to 1-cycle

Page 73: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Optimized Model Performance

Small variations Always perform as good or better

0.5

1

1.5

2

bzip2 mcf parser vpr avg

Benchmarks

Spee

d U

p

improved narrow_refetch

narrow_refetch

narrow_squash

squash

selective

Page 74: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Future Work Implement a more accurate

dynamic power model Study custom design vs.

synthesized model Study opportunities for leakage

power reduction

Page 75: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Delay Model

Processor 0 can reach Processor 15 in 9 fewer cycles

Circuit Switched Interconnect

432-- 432

976764643

Baseline Store and Forward Mesh

963-- 963

181512151291296

Page 76: [PPT]Lazy Logic - PHARM--Computer Architecture Researchpharm.ece.wisc.edu/talks/lazy_logic_toronto_july07.ppt · Web viewApplications of Lazy Logic Circuit-switched coherence Stall-cycle

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipeline Unrolling