reducing delay and power consumption of the wakeup logic

35
December 5 th , 2004 Joseph Sharkey Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04) http://www.cs.binghamton.edu/~lowpower Reducing Delay and Power Consumption of the Wakeup Logic through Instruction Packing and Tag Memoization Joseph Sharkey, Dmitry Ponomarev, Kanad Ghose, Oguz Ergin* State University of New York at Binghamton * Currently with Intel Labs Barcelona, Spain Presented By: Joseph Sharkey

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Reducing Delay and Power Consumption of the Wakeup Logic through Instruction

Packing and Tag Memoization

Joseph Sharkey, Dmitry Ponomarev, Kanad Ghose, Oguz Ergin*State University of New York at Binghamton

* Currently with Intel Labs Barcelona, Spain

Presented By: Joseph Sharkey

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Introduction

• Modern superscalar processors use dynamic instruction scheduling

• Dynamic schedulers are implemented in the form of issue queues

• Destination tags are broadcast every cycle and they are associatively matched against the locally stored source tags– Large access delays– Significant power consumption

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Related Work

• Partition queue into segments– Power off unused segments

• Energy efficient comparators– Dissipate energy predominately on a tag match

• Reducing the number of comparators used

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing

• The Idea: to pack multiple instructions into a single issue queue entry– Significantly reduces the amount of CAM logic and

the length of the tag busses

TraditionalIssue

Queue Issue Queue withInstruction Packing

Tag Broadcast

I1

I2

In… I1 I2

I3… …

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing

• Motivation:

– Not all instructions require comparators for two source operands

• Only 17% require both comparators!

– Traditional hardware designed for “Worst Case Scenario” – not complexity-effective

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

gzip vpr

gcc mcfpars

er

vorte

x

bzip2

twolf

wupwise sw

im

mgrid

applu

mesa artequ

akeAve

rage

0 Non-ready Operands 1 Non-ready Operand 2 Non-ready Operands

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Traditional Issue Queue Entry

• Entry Allocated bit (A)• Payload Area (opcode, FU type, destination

register tag, literals)• Tag, comparator, and valid bit for each

source (Tag CAM 1, Tag CAM 2, V1, V2)• Ready Bit (R)

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing:Issue Queue Entry Format

• Left-half allocate bit (AL)• Source tag and comparator (Tag CAM Left)• Source valid left bit (SVL)• Left Payload Area

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing:Issue Queue Entry Format

• Entry allocated bit (A)• Mode bit (Mode)

– 0: Multiple instructions in the entry– 1: A single instruction in the entry

• Ready bit (R)– Use only if Mode bit is 1

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing

Tag CAMLeft

Tag CAMRightA

MODE

AL

LeftPayload

AreaR

SVL

SVR

RightPayload

Area

AR

MUXSwitch

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing:Allocating an Instruction to a Half Entry

MUXSwitch

ADD P71 P23 P49P23

23 ADD71

101 0 49

Right Payload Area

(P71 = P23 + P49)

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing:Allocating an Instruction to a Full Entry

MUXSwitch

SUB P72 P52P48P48

48 SUB72

11 00 0

P52 (P72 = P48 – P52)

Right Payload Area

52

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Entry Allocation

• Search the AL, AR, and A bits in parallel• If instruction contains at most one non-

ready source:– Allocate a “half” entry

• Otherwise:– Allocate a “full” entry

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Wakeup

• Same as traditional design– Comparators set the corresponding valid

bits on a match.

• For a “full” entry:– AND SVL and SVR bits to set the R bit

• For a “half” entry:– Simply set the associated SVL/SVR bit

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Selection• “Half” entry

– Each half has its own select line, driven by the SVL and SVR bits, respectively.

• “Full” entry– “R” bit drives the request on the “right-half” request line.– “Left-half” request line is gated off

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing: Instruction Issue

MUXSwitch

23ADD

71 101 1 49

48 SUB7211 11 1

Right Payload Area

52 52

MUXSwitch

To Register File

To Register File

ADD P71 P23 P49

SUB P72 P48 P52

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Benefits of Instruction Packing

• Area reduction– Significant reduction in the amount of CAM logic– Roughly the same amount of RAM (rearranged)

• Delay reduction– Due to shorter tag broadcast lines, and bit lines

• Power reduction– Again due to shorter tag broadcast lines and bit lines– Fewer comparators

• Minimal IPC degradation– 83% of instructions only require a half-entry

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

• Motivation:– Higher order bits of consecutive tag

broadcasts are likely to be the same• 35 • 32 • 41

→ 0100011b→ 0100011→ 0100000b

→ 0101011b

b

→ 0100000b

→ 0101011b

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

U L

D

S

MUX

drive_upper

clk

match

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

U L

D

S

MUX

drive_upper

clk

match

= 0

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

U L

D

S

MUX

drive_upper

clk

match

= 0

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

U L

D

S

MUX

drive_upper

clk

match

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

U L

D

S

MUX

drive_upper

clk

match

= 1

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

U L

D

S

MUX

drive_upper

clk

match

= 1

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization

• Targets width of tag broadcasts– Nicely complements Instruction Packing,

which reduces the length of tag busses

• NO IMPACT ON IPC– DOES NOT interrupt, hinder, or change the

order of tag broadcasts

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing: Results

0.0000.2000.4000.6000.8001.0001.2001.4001.600

IntAvg FPAvg Average

32IQ 16IQ_PACK

• Less than 0.5% IPC degradation on the average when packing 32-entry queue into 16 entries– Worst case: twolf – 4% IPC loss– 12 benchmarks have less than 0.1% IPC loss

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing: Area and Delay Reduction

• 26.7% reduction in issue queue area• 21.6% reduction in issue queue delay

CMOS layouts of a CAM cell and an SRAM bitcell

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Instruction Packing: Results

• 38% reduction in wakeup power

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization: Results

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization: Results Summary

• 12.9% power savings• 16.4% power savings with intelligent bus

arbitration

• Only 4% increase in delay– Smaller comparators = faster– Added delay of NAND gate and a pass transistor.

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Combining Tag Memoization &Instruction Packing

• Combined 44.7% power savings• 16% reduction in delay• 0.5% performance degradation

0.00%5.00%

10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%

gzip vp

rgc

cmcf

parse

rvo

rtex

bzip2 twolf

wupwisesw

immgri

dap

plumes

a arteq

uake

IntAvg

FPAvgAve

rage

Tag Memoization Instruction Packing Memo + Pack

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Conclusions• We proposed two complementary techniques for

reducing the energy dissipation of the instruction wakeup logic

• Instruction Packing reduces the length of the tag busses by sharing one issue queue entry among two instructions

• Tag Memoization avoids broadcasts of upper order tag bits if they match the most recently driven on the same bus

• Major results– Wakeup energy reduction: 44.7%– Wakeup delay reduction: 16%– IPC degradation: 0.5%

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Questions/Comments?

http://www.cs.binghamton.edu/~lowpower

Joseph Sharkey

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization: Delay Analysis

U L

D

S

MUX

drive_upper

clk

match

55ps

18ps: Turned on transmission gate within

MUX

50ps LESS than single long comparator

December 5th, 2004Joseph Sharkey

Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)

http://www.cs.binghamton.edu/~lowpower

Tag Memoization: Delay Analysis of the Wakeup Logic

• Total delay increase: +23ps• Base case delay: 569ps• Yields 4% increase in delay