1 power-aware system on a chip a. laffely, j. liang, r. tessier, c. a. moritz, w. burleson...

1

Power-Aware System on a Chip

A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. BurlesonUniversity of Massachusetts Amherst

Boston Area Architecture Conference30 Jan 2003

{alaffely, jliang, tessier, moritz, burleson}@ecs.umass.edu

This material is based upon work supported by the National Science Foundation under Grant No. 9988238.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

2

Motivation• Problem:

• Need low power architectures for wireless DSP • How to support dynamic clock and voltage

scaling in heterogeneous systems with data-dependent workloads (granularity, overhead, control)

• A Solution: • Use modularity of SoC; apply at IP core level• Apply discrete frequency and voltage scaling• Use interconnect utilization measures and data

rate requirements to dynamically control scaling

3

Overview• Adaptive System-on-a-Chip• Implementation Approach• Preliminary Results• Conclusions and Challenges

4

Adaptive System-on-a-Chip

• Tiled architecture with mesh interconnect• Point to point

communication pipeline

• Allows for heterogeneous cores• Differing sizes, clock

rates, voltages• Low-overhead core

interface for • On-chip bus substitute

for streaming applications

• Based on static scheduling• Fast and predictable

Proc

Tile

MultiplierFPGA

Multiplier

ctrl

South Core

West

North

East

CommunicationInterface

5

aSoC Implementation

3000

2500

.18 technology Full custom

6

Some Results• 9 and 16 core systems tested for IIR,

MPEG encoding and Image processing applications• ~ 2 x the performance compared to

Coreconnect bus Burst and Hierarchical • ~ 1.5 x the performance of an oblivious

routing network1 (Dynamic routing)• Max speedup is 5 x

1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks

Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993

7

Dynamic Properties of Statically Routed System??

• Dynamically Parameterizable Cores proposed to save power• Motion Estimation core (by P. Jain UMASS) changes from

256 cycles/pixel to 16 cycles/pixel based on input data • Streams within the scheduled communication pipeline

can be blocked and back up or go unused• Inefficient to simply run at fastest rate

ME DCT

LatencyChangeswith Data

Scheduled Communicatio

ns

8

Key Features for Dynamic Power Reduction• SoC Modularity

• Sets a manageable granularity for voltage scaling

• Heterogeneous cores• Multiple on chip clocks and voltages already

supported• Core interface already handles synchronization

and level conversion• Statically scheduled

• Interconnect traffic indicate system bottlenecks

9

Approach• Stream based cores

• Limited buffering• Core-ports

• Single buffer for each stream to cross clock/voltage barrier between core and interface

• Reading/Writing success rates indicate core utilization

• Input blocked: Core too slow

• Output blocked: Core too fast

• Controller • Interprets core-port

success rates to adjust local clock and voltage Interconnect

Buffer

InputCore-port

OutputCore-port

Core

Clockand

SupplyController

LocalVdd

LocalClock

Blocked

Blocked

ProcessingPipeline

10

Power-Aware System: Core Utilization Measurement

• Accumulate failures at each core-port to control clock change• Blocked – Add 1• Success – Subtract 1

• Threshold and compare input and output failure counts• Many input, few output: increase frequency• Many output, few input: decrease frequency• Many or few of both: do nothing

Compareand

Threshold

Increaseor Decrease Local Clock

Core

count Core-port OutCore-port In count

Out/In DataInterconnectInterface Blocked Blocke

d

11

Power-Aware System:Local Clock Selection

• Derived from high frequency global clock• 8 possible values (Global Clock/2n)• Move one up or down each transition

/128/64/32/16/8/4/2/1

count

GlobalClock

From Rate Measurement

CoreLocalClock

12

Power-Aware System: Voltage Selection System

• Choose one of 4 supply voltages• Look-up-table (LUT) used to match

voltage to frequency setting for specific core

• Using cascading buffers core Vdd can change within 30ns (250nm technology)

LUT

V1 V2 V4V3

CoreLocalSupply

From ClockSelector

13

Vdd Selection Criteria

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2

4

6

8

10

12

Voltage

NormalizedDelay

0.72 1.05

• As Vdd decreases delay increases exponentially

• Use curve to match available clock frequencies to voltages

• The voltage drop reduces power by 70%, 84%, and 89% • P = C(Vdd)2f

Normalized Core Critical Path Delay vs. Vdd

Max Speed1/2 Speed

1/4 Speed

1/8 Speed

14

Power Savings • Two core system

• ME chooses 3 different algorithms based on input data

• DCT constant rate

ME DCT

Optimal Frequencies Frequency and Voltage Scaling

Core: Mode

Frequency MHz

Power mW

Power mW

ME: Full Search 105 973 973 ME: Spiral 9.9 76 7.6 ME: Three Step Search

2.75 25 2.5

DCT 9.6 54 5.4 Core power from Synopsys RTL simulation

15

Test System Results• Simple test case

• Core 1 starts 16 x too fast

• Core 2 starts 8 x too slow

Core1 Core2

02468

1012141618

0 500 1000 1500 2000 2500

Core1Core2Relative

ClockFrequency

Number of Clock Cycles

16

Key Issues• Count value require to control

frequency shifting? • May be application and core dependent

• Core characterization• Not easy, data dependent• Some tools exist for StrongArm (JouleTrack

A. Sinha MIT)• Benchmark development

• A bit tedious

17

Conclusions• SoC: a good candidate platform for

voltage scaling implementation• Convenient granularity• Low overhead• Easily measurable control mechanism

• Hardware • Preliminary results• Now test real benchmarks and data

1 power-aware system on a chip a. laffely, j. liang, r. tessier, c. a. moritz, w. burleson...

Documents