1 power-aware system on a chip a. laffely, j. liang, r. tessier, c. a. moritz, w. burleson...
DESCRIPTION
3 Overview Adaptive System-on-a-Chip Implementation Approach Preliminary Results Conclusions and ChallengesTRANSCRIPT
1
Power-Aware System on a Chip
A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. BurlesonUniversity of Massachusetts Amherst
Boston Area Architecture Conference30 Jan 2003
{alaffely, jliang, tessier, moritz, burleson}@ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
2
Motivation• Problem:
• Need low power architectures for wireless DSP • How to support dynamic clock and voltage
scaling in heterogeneous systems with data-dependent workloads (granularity, overhead, control)
• A Solution: • Use modularity of SoC; apply at IP core level• Apply discrete frequency and voltage scaling• Use interconnect utilization measures and data
rate requirements to dynamically control scaling
3
Overview• Adaptive System-on-a-Chip• Implementation Approach• Preliminary Results• Conclusions and Challenges
4
Adaptive System-on-a-Chip
• Tiled architecture with mesh interconnect• Point to point
communication pipeline
• Allows for heterogeneous cores• Differing sizes, clock
rates, voltages• Low-overhead core
interface for • On-chip bus substitute
for streaming applications
• Based on static scheduling• Fast and predictable
Proc
Tile
MultiplierFPGA
Multiplier
ctrl
South Core
West
North
East
CommunicationInterface
5
aSoC Implementation
3000
2500
.18 technology Full custom
6
Some Results• 9 and 16 core systems tested for IIR,
MPEG encoding and Image processing applications• ~ 2 x the performance compared to
Coreconnect bus Burst and Hierarchical • ~ 1.5 x the performance of an oblivious
routing network1 (Dynamic routing)• Max speedup is 5 x
1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks
Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993
7
Dynamic Properties of Statically Routed System??
• Dynamically Parameterizable Cores proposed to save power• Motion Estimation core (by P. Jain UMASS) changes from
256 cycles/pixel to 16 cycles/pixel based on input data • Streams within the scheduled communication pipeline
can be blocked and back up or go unused• Inefficient to simply run at fastest rate
ME DCT
LatencyChangeswith Data
Scheduled Communicatio
ns
8
Key Features for Dynamic Power Reduction• SoC Modularity
• Sets a manageable granularity for voltage scaling
• Heterogeneous cores• Multiple on chip clocks and voltages already
supported• Core interface already handles synchronization
and level conversion• Statically scheduled
• Interconnect traffic indicate system bottlenecks
9
Approach• Stream based cores
• Limited buffering• Core-ports
• Single buffer for each stream to cross clock/voltage barrier between core and interface
• Reading/Writing success rates indicate core utilization
• Input blocked: Core too slow
• Output blocked: Core too fast
• Controller • Interprets core-port
success rates to adjust local clock and voltage Interconnect
Buffer
InputCore-port
OutputCore-port
Core
Clockand
SupplyController
LocalVdd
LocalClock
Blocked
Blocked
ProcessingPipeline
10
Power-Aware System: Core Utilization Measurement
• Accumulate failures at each core-port to control clock change• Blocked – Add 1• Success – Subtract 1
• Threshold and compare input and output failure counts• Many input, few output: increase frequency• Many output, few input: decrease frequency• Many or few of both: do nothing
Compareand
Threshold
Increaseor Decrease Local Clock
Core
count Core-port OutCore-port In count
Out/In DataInterconnectInterface Blocked Blocke
d
11
Power-Aware System:Local Clock Selection
• Derived from high frequency global clock• 8 possible values (Global Clock/2n)• Move one up or down each transition
/128/64/32/16/8/4/2/1
count
GlobalClock
From Rate Measurement
CoreLocalClock
12
Power-Aware System: Voltage Selection System
• Choose one of 4 supply voltages• Look-up-table (LUT) used to match
voltage to frequency setting for specific core
• Using cascading buffers core Vdd can change within 30ns (250nm technology)
LUT
V1 V2 V4V3
CoreLocalSupply
From ClockSelector
13
Vdd Selection Criteria
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
2
4
6
8
10
12
Voltage
NormalizedDelay
0.72 1.05
• As Vdd decreases delay increases exponentially
• Use curve to match available clock frequencies to voltages
• The voltage drop reduces power by 70%, 84%, and 89% • P = C(Vdd)2f
Normalized Core Critical Path Delay vs. Vdd
Max Speed1/2 Speed
1/4 Speed
1/8 Speed
14
Power Savings • Two core system
• ME chooses 3 different algorithms based on input data
• DCT constant rate
ME DCT
Optimal Frequencies Frequency and Voltage Scaling
Core: Mode
Frequency MHz
Power mW
Power mW
ME: Full Search 105 973 973 ME: Spiral 9.9 76 7.6 ME: Three Step Search
2.75 25 2.5
DCT 9.6 54 5.4 Core power from Synopsys RTL simulation
15
Test System Results• Simple test case
• Core 1 starts 16 x too fast
• Core 2 starts 8 x too slow
Core1 Core2
02468
1012141618
0 500 1000 1500 2000 2500
Core1Core2Relative
ClockFrequency
Number of Clock Cycles
16
Key Issues• Count value require to control
frequency shifting? • May be application and core dependent
• Core characterization• Not easy, data dependent• Some tools exist for StrongArm (JouleTrack
A. Sinha MIT)• Benchmark development
• A bit tedious
17
Conclusions• SoC: a good candidate platform for
voltage scaling implementation• Convenient granularity• Low overhead• Easily measurable control mechanism
• Hardware • Preliminary results• Now test real benchmarks and data