Profile-based Dynamic Voltage Scheduling with Program Checkpoints
The COPPER Team:
Ana Azevedo, Ilya Issenin, Radu Cornea, Rajesh Gupta, Nikil Dutt, Alex Nicolau, Alex Veidenbaum
2Paper 327 02CCECS, UC Irvine
The COPPER Context
Compiler-controlled Power-Performance Management
• Develop efficient architectural support and compiler techniques for power management
• continuously -- as an application runs
• targeted for high performance/VLIW machines
• Coordinated management of multiple techniques
• reduction in power with little or no loss of performance.
• Develop techniques for dynamic compilation to actively trade off performance and power consumption
• Develop a retargetable, ADL-based, power-aware system simulation capability.
3Paper 327 02CCECS, UC Irvine
Approach• Compiler Strategies for Power Management
• Compiler-directed architectural “configuration”
–generate embedded “configuration code”
–code “adapts” to new architectural organization at runtime
• JIT vs multi-version compilation techniques
• dynamic, on-demand optimization
• Code annotation for dynamic compilation
–trade-off compilation overhead for quality of generated code
• Power-use Estimation for Compiler Control
–static analysis to select “optimal” configuration
–profile-based selection techniques
–static or dynamic prediction methods
4Paper 327 02CCECS, UC Irvine
Power/Performance “Knobs”
Memory hierarchy
Instruction issue logic & issue width for VLIW m/c
Dynamic Register File Reconfiguration
Frequency and Voltage scaling
5Paper 327 02CCECS, UC Irvine
Timing Constraints• We consider timing constraints as bounds on
operation intervals
• upper and lower bounds
• (determination of optimum interval separation possible statically)
• Time constraints specified via checkpoints
• User-defined checkpoints are inserted in the source code and time constraints between checkpoints are defined.
• The problem addressed here:
• Given a profile of power availability and a constraints on specified operation intervals minimize total processor energy consumption while meeting timing and power profile constraints.
6Paper 327 02CCECS, UC Irvine
Constrained Dynamic F/V Scaling• Power-performance profiling compiler
• Estimates max energy/cycle ratio and cycle count between checkpoints
• Compiler-inserted (frequency adjustment points) and user-inserted checkpoints (time constraints)
• Run-time scheduler
• Calculates run-time freq limit based on available power and energy profile between curr. chp. and all possible next chps.
• Calculates optimal target freq based on both time constraints and run-time freq limit between curr. chp. and all possible next chps.
• Final target freq is selected so that the code runs as slow as possible within the imposed time constraints.
7Paper 327 02CCECS, UC Irvine
Program CheckpointsProgram Checkpoints are generated at compile time and indicate places in the code where the processor speed/voltage should be re-calculated; checkpoints also carry user-defined time constraints
foo(){
read(i);
if (i > 5) {
i = i - calc_new_i(i);
} else
a++;
}
i = 36;
for (j = 0; j < i, j++) {
k = k*sin(j/100 + k/10);
}
}
calc_new_i(int I){
for (k = 0; k < limit, k++){
i += new_i[k];
show_value(i);
}
}
(a) Original code.
CDBCheckpoint Min Time Max TimeTransition (ms) (ms)
1-2 10 302-3 20 203-3 50 2003-4 200 200
(c) Checkpoint Database (CDB).
foo(){
read(i);
CHECKPOINT(1);
if (i > 5) do {
i = i - calc_new_i(i);
} else {
a++;
}
i = 36;
k = i + a;
CHECKPOINT(2);
for (j = 0; j < i, j++) {
CHECKPOINT (3);
k = k*sin(j/100 + k/10);
}
CHECKPOINT(4);
}
(b) Transformed foo code with checkpoints 1, 2, 3 and 4 carrying time constraints.
Constraint 1
deadline 2
deadline 1
Constraint 2
Task 1
8Paper 327 02CCECS, UC Irvine
Basic Approach
• Compiling phase: Checkpoint profiling• Estimate max energy/cycle ratio and cycle count
between checkpoints• set time constraints
–e.g., devices response time, WCET
• Scheduling phase• At program checkpoints and power profile change
points, dynamically adjust frequency and voltage
9Paper 327 02CCECS, UC Irvine
Example
0
200
400
600
800
0 100 200 300
Tim e
Fre
qu
ency
Checkpoint 3
Checkpoint 4
Frequencylim itOptim alfrequency
Calculating optimal frequencyFrequency limit (determined by available power profile) is lower than potential optimal frequency
10Paper 327 02CCECS, UC Irvine
Exploiting Runtime Slack
CHECKPOINT(0);
read(i);
CHECKPOINT(1);
if (i > 5) do {
CHECKPOINT(2);
i = i - calc_new_i(i);
} else {
CHECKPOINT(3);
a++;
}
CHECKPOINT(4);
i = 36;
k = i + a;
CHECKPOINT(5);
for (j = 0; j < i, j++) {
CHECKPOINT (6);
k = k*sin(j/100 + k/10);
CHECKPOINT (7);
}
CHECKPOINT(8);(a) Transformed code with checkpoints carrying time constraints (0, 1, 3, 8, 9 and 10) and extra checkpoints for exploiting run-time slack.
(c) Checkpoint Database (CDB).
Checkpoint Database (CDB)
Checkpoint Max TimeTransition (ms)
0-3 501-8 3009-10 10
(b) Hierarchical control flow graph.
0
1if
2func 3
4
5loop
6
78
9
end end
10
calc_new_i(i){
CHECKPOINT(9);
for (k = 0; k < limit, k++){
i += new_i[k];
show_value(i);
}
CHECKPOINT(10);
}
11Paper 327 02CCECS, UC Irvine
Slack-based Checkpointing • Compiling phase
• Build a hierarchical CFG (HCFG) program representation
• Insert checkpoints at function calls, loops, if-statements
• Checkpoint profiling and removal
• Estimate max energy/cycle ratio and cycle count between checkpoints, maximum iteration number for loops
• Prune the HCFG removing unnecessary checkpoints
– Nodes with low maximum execution cycle count
– Nodes with small variation in the execution cycle count
• Annotate the HCFG with the profiling information
• Scheduling
• Determine active checkpoint transitions from precomputed information
• Estimate the number of cycles from current node to the ends of active time constraints. This is minimum of the statically computed longest path to the time constraint and execution delay update on the profiling information (if available)
12Paper 327 02CCECS, UC Irvine
Our Approach: Slack Algorithm• Algorithm at work Current checkpoint
(with I iterations left)1
2 3
4
5
6
7
9
10X1
cycles8
X2
cycles
Y1 cycles
Calculating estimated cycles C
Method1:
C(7-10) = Y1
C(7-9) = X2+Y1+I*cycle_per_iter
Method2:
C(7-10) = cycle_per_iter – elapsed(6)C(7-9) = X1 – elapsed(5)
CDB
Time Max TimeConstraints
1-9 T16-10 T2
Checkpoint Database (CDB)
13Paper 327 02CCECS, UC Irvine
COPPER framework• MIPS R10K like processor, Wattch power models
Cycle-LevelPerformance
Simulator
ParameterizablePower Models
HardwareConfig
CodeVersions Performance
Estimate
PowerEstimate
Cycle-by-CycleHardware Access
Counts
Power Simulator
PowerScheduler
PowerProfiler
CompilerApplication
ChosenCode Version
AvailablePower
Time Constraints
14Paper 327 02CCECS, UC Irvine
Results• Power consumption highlighting time constraints
for parafffins (f=600 MHz)
0
1
2
3
4
5
6
7
0 200 400 600 800 1000 1200
[Pow
er]
[Time, microseconds]
4747474747474747474747474
74
74
74
74
74
7
600 MHzPower
15Paper 327 02CCECS, UC Irvine
Results: Slack-based DVS for paraffins
• Calculated target frequencies satisfying time and power constraints using Formula 1 for paraffins
• Time constraint on checkpoint transition 4-7
150
200
250
300
350
400
450
500
550
600
0 200 400 600 800 1000 1200 1400 1600 1800
[Fre
quen
cy, M
Hz]
[Time, microseconds]
47474747474747474747474747
474 7
4 74
74 7
47474747474747474747474747
474 7
4 74
74 7
Frequency LimitFrequency
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 500 1000 1500 2000 2500 3000
[Pow
er]
[Time, microseconds]
474747474747474747474747474
74 7
4 74
74 7
Power ConsumptionAvailable Power Profile
52% energy savingsFrequency Power
16Paper 327 02CCECS, UC Irvine
Results• Calculated target frequencies satisfying time and
power constraints using Formula 2 for paraffins
• Slack-based DVS for paraffins
200
250
300
350
400
450
500
550
600
0 500 1000 1500 2000 2500 3000
[Fre
quen
cy, M
Hz]
[Time, microseconds]
47474747474747474747474747
474 7
4 74
74 7
47474747474747474747474747
474 7
4 74
74 7
Frequency LimitFrequency
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 500 1000 1500 2000 2500 3000
[Pow
er]
[Time, microseconds]
47474747474747474747474747
474 7
4 74
74 7
Power ConsumptionAvailable Power Profile
82% energy savingsFrequency Power
17Paper 327 02CCECS, UC Irvine
Summary• While average power reduction is important, effective
control of dynamic power consumption is essential
• especially for software management of power and performance
• The hard problem here is
• identification of effective architectural mechanisms and their deterministic control through software
• COPPER approach
• use architectural features common to a range of processor architectures
–memory hierarchy, register files, instruction issue.
• Coordinate with technology and OS strategies
–frequency and voltage scaling.
18Paper 327 02CCECS, UC Irvine
Our Approach : Base Algorithm • Scheduling phase
• Create list of events• Calculate frequency limit
• Calculate optimal frequency–Case 1: One future checkpoint transition–Case 2: Frequency limit lower than potential
optimal frequency–Case 3: Several possible future checkpoints
0
2
4
6
8
10
0 5 10 15
Tim e
Po
we
r
AvailablePower ProfileCheckpoint 5
Checkpoint 6
Checkpoint 7 0
200
400
600
800
0 5 10 15
Tim e
Fre
qu
ency
Checkpoint 5
Checkpoint 6
Checkpoint 7
Frequency lim it
19Paper 327 02CCECS, UC Irvine
Our Approach : Base Algorithm• Calculate optimal frequency (cont’d)
0
200
400
600
800
0 100 200 300
Tim e
Fre
qu
ency
Checkpoint 3
Checkpoint 4
Frequencylim itOptim alfrequency
0
200
400
600
800
0 100 200 300
Tim e
Fre
qu
ency
Checkpoint 3
Checkpoint 4
Frequencylim itOptim alfrequency
a) Calculating optimal frequency, Case 1.One future checkpoint transition
(b) Calculating optimal frequency, Case 2.Frequency limit lower than potential optimal frequency
0
200
400
600
800
0 5 10 15 20 25 30Time
Fre
qu
ency
Checkpoint 1
Checkpoint 2
Checkpoint 3
Frequency limit
Optimal frequency ch1 - ch2
Optimal frequency ch1 - ch3
Final frequency values
(c) Calculating optimal frequency, Case 3.Several possible future checkpoints
20Paper 327 02CCECS, UC Irvine
Baseline Architecture• A MIPS R10K like processor
• 4-wide issue, out-of-order (OOO) processor
–5-stage pipeline: fetch, dispatch, issue, writeback, commit
• 32b integers, 64b f.p. numbers
• register files: 32 integer and 32 FP registers
• 32K L1 instruction cache, 32K L1 data cache
–32B L1 line size,
• 512K L2 unified cache
–64B L2 line size
• 2 int ALUs, 1 FP adder, 1 FP multiplier
• 512-entry BTB, 2K entry branch predictor
21Paper 327 02CCECS, UC Irvine
Power Management by F/V Scaling
• 4 available versions (600MHz,2.2V-500MHz,2.0V-400MHz,1.8V-300MHz,1.6V)
22Paper 327 02CCECS, UC Irvine
Related Work• DVS Theoretical Studies and Simulations
• [Weiser94], [Govil95], [Yassura98], [Lee98], [Pering98], [Mosse00],
• Practical DVS Implementations
• Transmeta Crusoe, Intel XScale, lpARM
• Interval-based and inter-task DVS techniques under OS control
• [Weiser94], [Govil95], [Yao95], [Ishihara98], [Hong99], [Manzak00], [Sinha01], [Poulwelse01]
• Intra-task DVS techniques under compiler control
• [Shin01], [Hsu01], [Krshna00], [Lee00]