energy characterization of embedded processors for software energy optimization · 2014-05-19 ·...

Energy Characterization of Embedded Processors for

Software Energy Optimization

Tohru Ishihara

Kyoto University

1

Lovic Gauthier

Kyushu University

2011/07/05The 11th International Forum on MPSoC and Multicore

Agenda

2

Introduction Processor Energy Characterization Software Energy Optimization

Motivation

3

Portable systems execute power hungry applications Most functions are implemented by software Energy depends on the software running on

processor systems

www.princeton.edu/~wolf/

Software energy analysis is necessary for reducing the energy consumption of embedded systems

Energy Characterization

4

Microprocessors &Memory subsystems

• Calculate the energy dissipated for a hardware event using post-layout simulation of a target processor system

• The hardware events include data read, data write, cache miss, program branch execution and so on

Gate-level simulatorHW events EnergyData read 15 nJCache miss 32 nJBranch exec. 58 nJ… …

Energy Lookup Table

D. Lee, T. Ishihara, M. Muroyama, H. Yasuura, F. Fallha, “An Energy Characterization Framework for Software-Based Embedded Systems,” Proc. of ESTIMedia, pp.59-64, Oct., 2006.

Our Approach

5

• Run a number of training benches on ISS of a target processor• Run the same benches on a post-layout model of the processor• Fit a linear model through regression analysis

Gate-level Power Estimation(e.g. Verilog-XL, PowerCompiler)

ISS (GNU tool)

Post-layout model of CPU

N1: # instruction/data cache missesN2: # taken branches executed… …

Eestimated (1)= e1·N1(1)+e2·N2(1)+ … +en·Nn(1)

… …

EGL (1)Eestimated (2)= e1·N1(2)+e2·N2(2)+ … +en·Nn(2)EGL (2)

Eestimated (m)= e1·N1(m)+e2·N2(m)+ … +en·Nn(m)EGL (m)

Training benches

EGL(i): Energy estimated for training bench (i)

Software Energy Analysis

6

Software Simulator (e.g GNUPro sid )

• A linear expression for the software energy consumption

• The number of the hardware events should be counted by instruction set simulator (ISS)

# HW events

Energy of SW

HW events EnergyData read 15 nJCache miss 32 nJBranch exec. 58 nJ… …

Energy Lookup Table

eventsHW # : event,HW ofenergy : iiiiestimated NeNeE

Accuracy Evaluation

7

Target system

– Processors M32R-II, SH3-DSP (Renesus) MeP (Toshiba)

– 0.18μm CMOS library

I-CacheCPUcore D-Cache

SDRAM(Micron’s DDR2)

Processor Main Memory

Results for M32R-II & SH3-DSP

50

100

150

200

Ener

gy

Cons

umpt

ion

[μJ]

Instruction Frame Number

Post-layoutOur Approach < 1 min.

2.5 hours

Estimation error < 3%

70

90

110

130

Ene

rgy

Con

sum

ptio

n [μ

J]

Instruction Frame Number

Post-layoutOur Approach

Estimation error < 3%

Energy estimation for JPEG encoder executed on a SH3-DSP processor

Energy estimation for JPEG encoder executed on a M32R-II processor

8

Multi-Performance Processor

9

L1-cache

PE core

PE core

L2 cache

MPU cores

Based on MeP (Toshiba) PEs have the same ISA but differ in their clock speeds and

energy consumptions Intra MPU core: a single PE core runs alternatively Inter MPU cores: multiple MPU cores run concurrently

Cache associativity value can be

dynamically changed

T. Ishihara, “Real-Time Dynamic Voltage Hopping on MPSoCs,” MPSoC 2009, Savannah.

1.8V

1.0V

0

2000

4000

6000

8000

10000

12000

14000

16000

0

5000

10000

15000

20000

25000

Results for MPP

< 10 sec

9.5 hoursEstimation error = 3.2%

Estimation error = 2.3%

Energy estimation for JPEG encoder run on a MPP with 1.8V 4-w cache

Energy estimation for JPEG encoder run on a MPP with 1.0V 1-w cache

10

Applications

11

Software Energy AnalysisHelp finding energy bottleneck

Software Energy OptimizationCompiler optimizationOS-based power management

Software Energy Analysis (1/2)

12

0

5

10

15

20

25 D-main_mem D-SPM I-cache I-SPM I-main_mem cache miss CPU core Post Layout

Ener

gy di

ssipa

ted fo

r 100

0 ins

tructi

ons [J

]

The number of instructions executed

CPUInst. cache

accessCache miss

Off-chip data access

MPEG2 encoderEnergy for cache access is much larger than that

for cache misses

0

2

4

6

8

10

12

14

16

18

20

Software Energy Analysis (2/2)

13

Ener

gy di

ssipa

ted fo

r 100

0 ins

tructi

ons [J

]

The number of instructions executedCPUInst. cache access

Cache miss

Off-chip read word access

JPEG encoderMore than 50% is stack accesses

Software Energy Optimization

14

Memory consumes a large amount of energy Memory energy depends on program behavior Code optimization contributes to the total energy reduction

SPMCPU

Cache

Off-chip memory

43%10x of on-chipmemory energy

On-chopmemory

Processor Chip

DRAM Chip

Motivation:

Code and Data Allocation

Cacheable region

Scratchpad region

Non-cacheable region

SPMCPU

Cache

Off-chip memory

Size Latency Energy

SPM Small Small Small

Cache Small Small Large

Off-chip mem. Huge Huge Huge

Application program

Function AFunction B

Global variable CGlobal variable DConstant data E

Find the optimal locations of functions and data objects

in a memory address space

Our Compiler OptimizationTechniques

15

Memory address space

Our Approach

16Y. Ishitobi, T. Ishihara, H. Yasuura, “Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories,” Signal Processing Systems 60(2), pp.211-224, August, 2010

Memory area Application programFunction AFunction B

Global variable CGlobal variable DConstant data E

Find code & data placements which • maximize # SPM accesses• minimize # cache misses• minimize # off-chip accessesDo not always minimize the energy

Previous work…

Scratchpad region

Minimize the total energy consumption estimated by our

model through an ILP

Our method…

Cacheable region

Non-cacheable region

Total processor energy reduced by 10%

Stack Allocation

17

SPM

MM

Frame 0

Frame 1

Frame 0

Frame 1a

Frame 1b

Frame 0

Frame 1a

Frame 1b

Frame 2

Store

Frame 0

Frame 1a

Frame 1bLoad

Frame 0

Frame 1

Optimization is done at compile time through an ILP Find the best locations of stack frames based on profiling Store/Load inserted before and after call instructions

Frames are dynamically moved between SPM and MM Stack frame is generated in SPM when the function is called Store frames into MM if there is no space left in the SPM

Stack Allocation Results

18

Circular: Evict oldest frame first if there is no space left in the SPM Static: Place frames in the SPM only if SPM does not overflow Ours: Placement is optimized by the Integer Linear Programming

L. Gauthier, T. Ishihara, “Optimal Stack Frame Placement and Transfer for Energy Reduction Targeting Embedded Processors with Scratch-Pad Memories,” Proc. of IEEE Workshop on Embedded Systems for Real-Time Multimedia, pp.116-125, Oct., 2009.

Multi-Task System

19

The SPM is shared among tasks Several previous techniques

(a) Spatial sharing✔ No management required✔ Very small part of the SPM for each task

(b) Temporal sharing✔ Totally of SPM space to each task✔ SPM update at context switches

(c) Hybrid✔ Best of both approaches✔ Compile time profile-based assignment to

SPM

SPM space

Timet3

t2t1

t0

SPM space

Time

t3t

2

t1

t0t2

t3

t1

t0

(a)

(b)

(c)

SPM space

Time

t3

t0

t2

t0 t1

t3

t2

t1

t1

t1

t0

t3

H. Takase, H. Tomiyama, and H. Takada. Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems. in DATE ’10

SPM Sharing for Multi-Task

20

At compile time: Assign an SPM space (i. e., block) to each task Find memory objects to place in each block Find an address for each block in SPM for minimizing

overlaps At run time: Copy only a part of overlapping with coming task

SPM space

Time

t3

t1

t0

t2

t0 t0

t1

t2

t3

t1

t3t0

t1

t3

t1

t0 t1 t2 t3 t2 t1 t0 t3

t1 t1SPM space (bytes)

Taskst0 t1

β0

β1

0

512

1K

SPM space (bytes)

Taskst0 t1

β0 β10

512

1K

Efficient sharingInefficient sharing

1 2 4 8 160

10

20

30

40

50

60

70

80

SPM size in Kb

Norm

alize

d ene

rgy c

onsu

mptio

n

Set 3 data

Multi-Task Results

21 L. Gauthier, et al. “Minimizing Inter-Task Interferences in Scratch-Pad Memory Usage for Reducing the Energy Consumption of Multi-Task Systems,” Proc. of CASES, pp.157-164, Oct., 2010.

1 2 4 8 160

10

20

30

40

50

60

SPM size in Kb

Set 2 data

1 2 4 8 160

1020304050607080

SPM size in KbNorm

alize

d ene

rgy c

onsu

mptio

n Set 0 data

Space Time Hybird Block

1 2 4 8 1626

28

30

32

34

36

38

SPM size in Kb

Set 1 data

1 2 4 8 160

10

20

30

40

50

60

70

80

SPM size in Kb

Set 4 data

1 2 4 8 160

10

20

30

40

50

60

70

SPM size in Kb

Set 5 data

22

Refine stack placement technique and target heap object (WIP)

A fast accurate model for SW energy consumption

The error of our approach is 5% on an average and 20% at the maximum case

Code and data placement techniques drastically reduce the SW energy

Future work

Summary

energy characterization of embedded processors for software energy optimization · 2014-05-19 ·...

Documents