estimating the worst-case energy consumption of embedded software
DESCRIPTION
Estimating the Worst-Case Energy Consumption of Embedded Software. Ramkumar Jayaseelan Tulika Mitra Xianfeng Li School of Computing National University of Singapore. Motivation. Conventional scheduling techniques give timing guarantees Processor cycles is the critical resource - PowerPoint PPT PresentationTRANSCRIPT
1
Estimating the Worst-Case Energy Consumption of Embedded Software
Ramkumar Jayaseelan Tulika Mitra Xianfeng Li
School of ComputingNational University of Singapore
2
Motivation
Conventional scheduling techniques give timing guarantees Processor cycles is the critical resource WCET of the tasks are required input
Battery life is equally important for mobile devices Scheduling technique have to give energy
guarantees Worst-Case Energy Consumption (WCEC) of the
tasks are required input
3
Remotely Deployed Systems
Available energy unevenly distributed among nodes Spatio-temporal scheduling benefits from WCEC
Local Station
Sensor Network
4
Energy-Based Guarantees
Scheduling critical and non-critical tasks in a battery-operated system
Non-critical tasks can be run only if energy constraints for critical tasks are satisfied
Worst-case energy estimation is crucial
5
Reward-Based Scheduling
Energy consumption Voltage Delay (1 / Voltage) Reward-based scheduling attempts to satisfy
constraints on energy and timing Energy guarantee only if worst-case energy
consumption of tasks are known
6
Outline
Background Relation between WCET and Worst-case
energy consumption Estimation technique: Simplified model Instruction cache and speculation Experimental results Conclusion
7
Background
Power and energy are often used interchangeably
Power is energy consumed per unit time Energy consumed during program execution
E = P × t Approximation as P is also a function of time
8
In reality when a program executes
Energy is the area under the curve E = ∫P(t)dt
E=P×T is an approximation
Power
Time
9
WCEC versus WCET
13000
14000
15000
16000
17000
18000
19000
20000
21000
Program Inputs
En
erg
y(n
an
o J
ou
les)
4500
4600
4700
4800
4900
5000
5100
Execu
tio
n T
ime(c
ycle
s)
Total Energy
Execution Time
Full Input Space Expansion for a 5-element Insertion Sort program
10
Cannot Estimate WCEC from WCETBenchmark WCET×avg_power
µJ
Observed
µJ
isort 489.92 525.88
fft 12106.49 10260.86
fdct 138.20 105.57
ludcmp 131.76 119.33
matsum 972.03 1154.31
minver 93.61 80.80
bsearch 3.84 3.07
des 724.05 643.75
matmult 178.12 166.88
qsort 54.79 43.73
qurt 23.80 17.65
Possible underestimation using WCEC=WCET × power
11
WCEC versus WCET
WCEC path need not be the same as the WCET path
WCEC cannot be directly estimated from the WCET value
12
A closer look at Power
Dynamic Power : Power Consumption due to switching of transistors
Leakage Power: Power consumed independent of switching activity
Dynamic power forms the bulk of power consumption in today’s processors
13
Dynamic Power
Dynamic Power
P=(1/2) × A × V2 × C × fV is supply voltage
C is the capacitance of the circuit
f is the frequency
A is the activity factor V, C, f are independent of program execution Variation in P is due to the variation in A
14
Variation in Activity Factor (A) Not all parts of the processor are used in
every cycle e.g., data-cache is used only for loads/stores
Clock gating disables unused components Activity factor (A) varies during the execution
of the program Model variation in A through static analysis
15
Switch-off Energy
An inactive component cannot be fully switched off A certain portion of the peak energy is consumed
even in idle cycles Switch-off energy is proportional to the
number of idle cycles
16
Clock Energy and Leakage Energy Clock power: power consumed in clock
distribution network Leakage power: power consumed due to
leakage in transistors Clock energy and leakage energy are directly
proportional to the execution time
17
Energy Components Summary Dynamic Energy
Switching of transistors during execution Independent of execution time
Switch-off Energy Energy consumed in unused components Depends on idle cycles
Clock and Leakage energy Directly proportional to execution time
18
WCEC versus WCET
13000
14000
15000
16000
17000
18000
19000
20000
21000
Program Inputs
En
erg
y(n
an
o J
ou
les)
4500
4600
4700
4800
4900
5000
5100
Execu
tio
n T
ime(c
ycle
s)
Total Energy
Execution Time
Full Input Space Expansion for a 5-element Insertion Sort program
19
Our Analysis: Overview
Operate on the control flow graph Estimate worst-case energy of basic blocks Formulate estimation for whole program as
an integer linear programming (ILP) problem
20
ILP Formulation
Input: Control flow graph of the program Objective function:
Need to estimate Worst-Case Energy Consumption( WCECB) for each basic block
Worst Case Energy = Worst Case Energy = WCEC WCECB B count countBB
21
Flow Constraints
E0,1 = B0 = 1
E2,3 + E1,3 = B3 = 1
E0,1 + E2,1 = E1,2 + E1,3 = B1
E1,2 = E2,3 + E2,1= B2
Loop bound: E2,1 <= 100
B0
B1
B2
B3
Inflow = Basic Block Execution Count = OutflowInflow = Basic Block Execution Count = Outflow
Bounds on maximum loop iterationsBounds on maximum loop iterations
22
Worst-Case Energy of a Basic Block Processor Model Energy Components
Instruction Specific Energy Pipeline Specific Energy
23
Processor Model
I-1 I-4
I-2 I-3
IBUF
ROB
ALU
MULT
FPU
I+1I
IF
ID
EX
WB
CM
ISSUE
24
Pipelined Execution of InstructionsADD R1,R2,R3
MUL R4,R5,R6SUB R7,R8,R9
1 2 3 4 5 6 7 8CC
ADD IF ID IS EX WB CM
MUL IF ID IS EX WB CM
SUB IF ID IS EX WB CM
Difficult to statically predict the energy consumption in each cycle
25
Pipelined Execution of InstructionsADD R1,R2,R3
MUL R4,R5,R6SUB R7,R8,R9
1 2 3 4 5 6 7 8CC
ADD IF ID IS EX WB CM
MUL IF ID IS EX WB
SUB IF ID IS EX
Difficult to statically predict the energy consumption in each cycle
Stall Stall
26
Our Approach
Determine the maximum energy consumed on a component by component basis
Static analysis to determine the maximum energy consumed by a component in a specified interval
27
Execution of InstructionIF
ID
EX
WB
CM
ISSUE
28
Instruction Specific Energy Energy consumed due to the sub-tasks
associated with execution of an instruction e.g., register file access, ALU usage, etc.
Depends on the type of executed instruction No correlation with execution time
29
Pipeline Specific Energy
During program execution energy is consumed due to Switch-off power (idle cycles) Leakage power (every cycle) Clock network power (every cycle)
Cannot be attributed to any instruction Energy consumed even in idle cycles
30
Energy Components
Observation: Energy consumed can be separated out as Instruction Specific energy
Energy associated with the execution of a particular instruction
Independent of execution time Pipeline Specific energy
Energy consumed in other components such as clock network, leakage etc.
Related to execution time
31
Worst-case Energy of a Basic block
dynamicBB : Instruction-Specific Energy for BB
switchoffBB , leakageBB and clockBB are energy consumed in unused components, leakage and clock network during WCETBB
BBBBBBBBclockleakageswitchoffdynamicenergyBB
32
Instruction Specific Energy
Energy consumed due to switching activity generated by the instructions in BB
Sum of energy consumed by individual instructions in BB
BBinstrinstrdynamicdynamic
BB
33
Switch-off Energy
Unused units consume 10% of peak energy Switch-off energy for a specific component (C)
Switch-off energy for basic block BB
1.0)())(()(
1.0)()(_)(
CenergyCusesWCETCswitchoff
CenergyCcyclesIdleCswitchoff
BBBBBB
BB
componentsC
CswitchoffswitchoffBBBB
)(
34
Clock Energy and Leakage Energy Clock Energy
Leakage Energy
BBcycleBBWCETyclockenergyclockenerg
BBcycleBBWCETenergyleakageenergyleakage __
35
Overlap among basic blocks
B1 B2
BB
B3
B1
B3
Time
t1
t2
t3
t4
t5
WCETBB
36
Switch-off Energy
Unused units consume 10% of peak energy Switch-off energy for a specific component (C)
Switch-off energy for basic block BB
1.0)())(()(
1.0)()(_)(
CenergyCusesWCETCswitchoff
CenergyCcyclesIdleCswitchoff
BBBBBB
BB
componentsC
BBBB Cswitchoffswitchoff )(
37
Instruction Cache Modeling
Context based ILP formulation used in WCET analysis [Li et al RTSS 2004]
Basic block divided into memory blocks A context comprises of mapping each of
these memory blocks to hit/miss Estimate the worst-case energy of each
context taking into account main memory access energy
38
Modeling Branch miss-prediction
BB’
BB
BB’
BX
BB
Time
t1
t2
t3
BX
39
Objective function
count(c,ω) is the number of times the basic block Bi is executed with path from Bj and the branch is predicted correctly
count(m,ω) is similarly defined where the branch is miss-predicted
In a similar manner energy(c,ω) and energy(m,ω) are defined The ILP problem is solved to generate values for count using
constraints similar to WCET analysis
),(),(
),(),(1 )(
mcountmenergy
ccountcenergyEnergy
ijij
N
i ij iCijij
40
Results
Platform: Simplescalar toolset Modified WCET analysis tool [Li et al RTSS
2004] to estimate worst-case energy Energy values for processor components
derived from parameterized models in Wattch ILP problem is solved using CPLEX
41
Results
Compare estimated WCEC against the observed values for eleven benchmarks
Observed values are obtained using Wattch power simulator
Actual inputs producing WCEC is unknown Manually select inputs that might produce WCEC
42
Styles of Clock Gating
Simple: Peak power is consumed even if there is one access to a specific component
Ideal : Power consumed is proportional to the number of ports accessed
Realistic: Same as ideal but unused components consume switch-off power
43
Results
Results for ideal clock gating more accurate than simple because of distribution of accesses
Benchmarks
isort
fft
fdct
ludcmp
matsum
minver
bsearch
des
matmult
qsort
qurt
Est(µJ) Obs(µJ) Ratio
468.85 422.76 1.11
9600.99 8586.49 1.12
89.92 83.63 1.08
98.75 92.77 1.06
1012.83 929.94 1.09
63.66 59.61 1.07
2.54 2.40 1.06
546.41 518.22 1.05
149.70 132.08 1.13
34.90 31.16 1.12
13.98 11.91 1.17
Ideal Clock Gating
Est(µJ) Obs(µJ) Ratio
524.95 455.94 1.15
11057.50 9185.39 1.20
99.31 88.79 1.11
115.39 100.32 1.15
1227.37 994.11 1.23
74.91 64.15 1.17
3.51 3.07 1.14
613.16 553.74 1.10
172.39 136.93 1.26
39.50 33.84 1.17
16.36 12.97 1.26
Simple Clock Gating
44
Results
Results for ideal clock gating more accurate than realistic because of conservative WCET estimation
Benchmarks
isort
fft
fdct
ludcmp
matsum
minver
bsearch
des
matmult
qsort
qurt
Est(µJ) Obs(µJ) Ratio
596.93 525.88 1.14
13631.21 10260.86 1.33
121.65 105.57 1.15
139.75 119.33 1.17
1397.72 1154.31 1.21
90.95 80.80 1.13
3.81 3.07 1.24
715.58 643.75 1.11
212.94 166.88 1.28
49.84 43.73 1.14
21.95 17.65 1.24
Realistic Clock Gating
Est(µJ) Obs(µJ) Ratio
468.85 422.76 1.11
9600.99 8586.49 1.12
89.92 83.63 1.08
98.75 92.77 1.06
1012.83 929.94 1.09
63.66 59.61 1.07
2.54 2.40 1.06
546.41 518.22 1.05
149.70 132.08 1.13
34.90 31.16 1.12
13.98 11.91 1.17
Ideal Clock Gating
45
Conclusion
Static worst-case energy estimation technique that takes into account pipelining, instruction cache and branch prediction
Future work Validation using commercial processors Explore the possibility of providing thermal
guarantees
46
Execution of an Add InstructionIF
ID
EX
WB
CM
ISSUE
I-Cache Access
Instruction Decode + Rename Logic
Wakeup + Selection logic
Register File Read + Add unit access
Result Bus
ROB-retire + Register file Update
ADD
ADD
ADD
ADD
ADD
ADD
47
Instruction Specific Energy
Each Component Accessed once Selection logic maybe accessed multiple times Instruction Specific Energy is
BBinstrinstrBBcycleBB dynamicwcetpowerselectiondynamic _