energy characterization of embedded processors for software energy optimization · 2014-05-19 ·...
TRANSCRIPT
Energy Characterization of Embedded Processors for
Software Energy Optimization
Tohru Ishihara
Kyoto University
1
Lovic Gauthier
Kyushu University
2011/07/05The 11th International Forum on MPSoC and Multicore
Agenda
2
Introduction Processor Energy Characterization Software Energy Optimization
Motivation
3
Portable systems execute power hungry applications Most functions are implemented by software Energy depends on the software running on
processor systems
www.princeton.edu/~wolf/
Software energy analysis is necessary for reducing the energy consumption of embedded systems
Energy Characterization
4
Microprocessors &Memory subsystems
• Calculate the energy dissipated for a hardware event using post-layout simulation of a target processor system
• The hardware events include data read, data write, cache miss, program branch execution and so on
Gate-level simulatorHW events EnergyData read 15 nJCache miss 32 nJBranch exec. 58 nJ… …
Energy Lookup Table
D. Lee, T. Ishihara, M. Muroyama, H. Yasuura, F. Fallha, “An Energy Characterization Framework for Software-Based Embedded Systems,” Proc. of ESTIMedia, pp.59-64, Oct., 2006.
Our Approach
5
• Run a number of training benches on ISS of a target processor• Run the same benches on a post-layout model of the processor• Fit a linear model through regression analysis
Gate-level Power Estimation(e.g. Verilog-XL, PowerCompiler)
ISS (GNU tool)
Post-layout model of CPU
N1: # instruction/data cache missesN2: # taken branches executed… …
Eestimated (1)= e1·N1(1)+e2·N2(1)+ … +en·Nn(1)
… …
EGL (1)Eestimated (2)= e1·N1(2)+e2·N2(2)+ … +en·Nn(2)EGL (2)
Eestimated (m)= e1·N1(m)+e2·N2(m)+ … +en·Nn(m)EGL (m)
Training benches
EGL(i): Energy estimated for training bench (i)
Software Energy Analysis
6
Software Simulator (e.g GNUPro sid )
• A linear expression for the software energy consumption
• The number of the hardware events should be counted by instruction set simulator (ISS)
# HW events
Energy of SW
HW events EnergyData read 15 nJCache miss 32 nJBranch exec. 58 nJ… …
Energy Lookup Table
eventsHW # : event,HW ofenergy : iiiiestimated NeNeE
Accuracy Evaluation
7
Target system
– Processors M32R-II, SH3-DSP (Renesus) MeP (Toshiba)
– 0.18μm CMOS library
I-CacheCPUcore D-Cache
SDRAM(Micron’s DDR2)
Processor Main Memory
Results for M32R-II & SH3-DSP
50
100
150
200
Ener
gy
Cons
umpt
ion
[μJ]
Instruction Frame Number
Post-layoutOur Approach < 1 min.
2.5 hours
Estimation error < 3%
70
90
110
130
Ene
rgy
Con
sum
ptio
n [μ
J]
Instruction Frame Number
Post-layoutOur Approach
Estimation error < 3%
Energy estimation for JPEG encoder executed on a SH3-DSP processor
Energy estimation for JPEG encoder executed on a M32R-II processor
8
Multi-Performance Processor
9
L1-cache
PE core
PE core
L2 cache
MPU cores
Based on MeP (Toshiba) PEs have the same ISA but differ in their clock speeds and
energy consumptions Intra MPU core: a single PE core runs alternatively Inter MPU cores: multiple MPU cores run concurrently
Cache associativity value can be
dynamically changed
T. Ishihara, “Real-Time Dynamic Voltage Hopping on MPSoCs,” MPSoC 2009, Savannah.
1.8V
1.0V
0
2000
4000
6000
8000
10000
12000
14000
16000
0
5000
10000
15000
20000
25000
Results for MPP
< 10 sec
9.5 hoursEstimation error = 3.2%
Estimation error = 2.3%
Energy estimation for JPEG encoder run on a MPP with 1.8V 4-w cache
Energy estimation for JPEG encoder run on a MPP with 1.0V 1-w cache
10
Applications
11
Software Energy AnalysisHelp finding energy bottleneck
Software Energy OptimizationCompiler optimizationOS-based power management
Software Energy Analysis (1/2)
12
0
5
10
15
20
25 D-main_mem D-SPM I-cache I-SPM I-main_mem cache miss CPU core Post Layout
Ener
gy di
ssipa
ted fo
r 100
0 ins
tructi
ons [J
]
The number of instructions executed
CPUInst. cache
accessCache miss
Off-chip data access
MPEG2 encoderEnergy for cache access is much larger than that
for cache misses
0
2
4
6
8
10
12
14
16
18
20
Software Energy Analysis (2/2)
13
Ener
gy di
ssipa
ted fo
r 100
0 ins
tructi
ons [J
]
The number of instructions executedCPUInst. cache access
Cache miss
Off-chip read word access
JPEG encoderMore than 50% is stack accesses
Software Energy Optimization
14
Memory consumes a large amount of energy Memory energy depends on program behavior Code optimization contributes to the total energy reduction
SPMCPU
Cache
Off-chip memory
43%10x of on-chipmemory energy
On-chopmemory
Processor Chip
DRAM Chip
Motivation:
Code and Data Allocation
Cacheable region
Scratchpad region
Non-cacheable region
SPMCPU
Cache
Off-chip memory
Size Latency Energy
SPM Small Small Small
Cache Small Small Large
Off-chip mem. Huge Huge Huge
Application program
Function AFunction B
Global variable CGlobal variable DConstant data E
Find the optimal locations of functions and data objects
in a memory address space
Our Compiler OptimizationTechniques
15
Memory address space
Our Approach
16Y. Ishitobi, T. Ishihara, H. Yasuura, “Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories,” Signal Processing Systems 60(2), pp.211-224, August, 2010
Memory area Application programFunction AFunction B
Global variable CGlobal variable DConstant data E
Find code & data placements which • maximize # SPM accesses• minimize # cache misses• minimize # off-chip accessesDo not always minimize the energy
Previous work…
Scratchpad region
Minimize the total energy consumption estimated by our
model through an ILP
Our method…
Cacheable region
Non-cacheable region
Total processor energy reduced by 10%
Stack Allocation
17
SPM
MM
Frame 0
Frame 1
Frame 0
Frame 1a
Frame 1b
Frame 0
Frame 1a
Frame 1b
Frame 2
Store
Frame 0
Frame 1a
Frame 1bLoad
Frame 0
Frame 1
Optimization is done at compile time through an ILP Find the best locations of stack frames based on profiling Store/Load inserted before and after call instructions
Frames are dynamically moved between SPM and MM Stack frame is generated in SPM when the function is called Store frames into MM if there is no space left in the SPM
Stack Allocation Results
18
Circular: Evict oldest frame first if there is no space left in the SPM Static: Place frames in the SPM only if SPM does not overflow Ours: Placement is optimized by the Integer Linear Programming
L. Gauthier, T. Ishihara, “Optimal Stack Frame Placement and Transfer for Energy Reduction Targeting Embedded Processors with Scratch-Pad Memories,” Proc. of IEEE Workshop on Embedded Systems for Real-Time Multimedia, pp.116-125, Oct., 2009.
Multi-Task System
19
The SPM is shared among tasks Several previous techniques
(a) Spatial sharing✔ No management required✔ Very small part of the SPM for each task
(b) Temporal sharing✔ Totally of SPM space to each task✔ SPM update at context switches
(c) Hybrid✔ Best of both approaches✔ Compile time profile-based assignment to
SPM
SPM space
Timet3
t2t1
t0
SPM space
Time
t3t
2
t1
t0t2
t3
t1
t0
(a)
(b)
(c)
SPM space
Time
t3
t0
t2
t0 t1
t3
t2
t1
t1
t1
t0
t3
H. Takase, H. Tomiyama, and H. Takada. Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems. in DATE ’10
SPM Sharing for Multi-Task
20
At compile time: Assign an SPM space (i. e., block) to each task Find memory objects to place in each block Find an address for each block in SPM for minimizing
overlaps At run time: Copy only a part of overlapping with coming task
SPM space
Time
t3
t1
t0
t2
t0 t0
t1
t2
t3
t1
t3t0
t1
t3
t1
t0 t1 t2 t3 t2 t1 t0 t3
t1 t1SPM space (bytes)
Taskst0 t1
β0
β1
0
512
1K
SPM space (bytes)
Taskst0 t1
β0 β10
512
1K
Efficient sharingInefficient sharing
1 2 4 8 160
10
20
30
40
50
60
70
80
SPM size in Kb
Norm
alize
d ene
rgy c
onsu
mptio
n
Set 3 data
Multi-Task Results
21 L. Gauthier, et al. “Minimizing Inter-Task Interferences in Scratch-Pad Memory Usage for Reducing the Energy Consumption of Multi-Task Systems,” Proc. of CASES, pp.157-164, Oct., 2010.
1 2 4 8 160
10
20
30
40
50
60
SPM size in Kb
Set 2 data
1 2 4 8 160
1020304050607080
SPM size in KbNorm
alize
d ene
rgy c
onsu
mptio
n Set 0 data
Space Time Hybird Block
1 2 4 8 1626
28
30
32
34
36
38
SPM size in Kb
Set 1 data
1 2 4 8 160
10
20
30
40
50
60
70
80
SPM size in Kb
Set 4 data
1 2 4 8 160
10
20
30
40
50
60
70
SPM size in Kb
Set 5 data
22
Refine stack placement technique and target heap object (WIP)
A fast accurate model for SW energy consumption
The error of our approach is 5% on an average and 20% at the maximum case
Code and data placement techniques drastically reduce the SW energy
Future work
Summary