power management for chip-level multiprocessing processors
DESCRIPTION
Power Management for Chip-level Multiprocessing Processors. Kai Ma. Background. To get better performance 1. Scale frequency (fast) 2. On-chip resource replication (parallel) Chip-MultiProcessing vs Simultaneous MultiThreading. SMT vs CMP. Other justification for CMP. - PowerPoint PPT PresentationTRANSCRIPT
04/19/23 1
Power Management for Chip-level
Multiprocessing Processors
Kai Ma
04/19/23 2
Background
To get better performance
1. Scale frequency (fast)
2. On-chip resource replication (parallel) Chip-MultiProcessing vs Simultaneous MultiThreading
04/19/23 3
SMT vs CMP
SMT CMP
Technique Duplicate resources on one core
Duplicate cores on one die
Target Instruction level parallelism
Thread level parallelism
Implementation Basically redesign Reuse proven design
Area Small transistor increase
Proportionately to core number
04/19/23 4
Other justification for CMP
Memory wall, ILP wall, Power wall Higher cache coherency circuitry rate Signal integrity Future: Many cores (many specialized cores )
04/19/23 5
Power management for CMP
Reduce operating costs for energy and cooling Prolong battery life for portable and embedded systems Reduce cooling requirement Meet scalable performance target Heat dissipation and hotspot
04/19/23 6
Outline
1. An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget
Canturk Isci*, Alper Buyuktosunoglu*, Chen-Yong Cher*, Pradip Bose* and Margaret Martonosi
*IBM T.J. Watson Research Center Department of Electrical Engineering
Yorktown Heights Princeton University
2. Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors
Radu Teodorescu and Josep Torrellas
Department of Computer Science University of Illinois at Urbana-Champaign
04/19/23 7
Outline
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget
1. Contribution
2. Global Power Management
3. Global Power Management Policies: core modes, power and performance matrix
4. Experimental Result and Evaluation
5. Conclusion
6. Critique
04/19/23 8
Contribution
Introduce a global power management
Develop a static power management analysis tool
Evaluate different policies for CMP power management
04/19/23 9
Global Power Management
Monitor the power and set working mode of each core
04/19/23 10
Global Power Management Policies
Priority: Slow down the core runs low priority task
PullhiPushLo: Speedup the low power core and slow down the high power core.
MaxBIPS: Predict and choose power mode combination
04/19/23 11
Core Power Modes
Underlying mechanism: DVFS Overhead: Order of microseconds Performance Degradation: Elapsed execution time for
benchmark
04/19/23 12
Power and BIPS Matrices
Power BIPS
Turbo 1 1*(500/507)
Eff1 1*0.95^3 1*0.95*(500/513)
Eff2 1*0.85^3 1*0.85*(500/520)
04/19/23 13
Experimental Methodology
SPEC CPU2000 benchmark A trace-based CMP analysis tool is incorporated with
IBM’s Turandot simulator Mode switch (500ns) and Statistics collection (50ns) During mode switch, no instruction execution, power is
consumed
04/19/23 14
Static vs Dynamic
04/19/23 15
Policy and Budget Curve
04/19/23 16
Power Saving
04/19/23 17
Power Management Result
04/19/23 18
Trends under CMP Scaling
The difference between MaxBIPS and oracle decreases with core number increasing
Increasing core numbers has smaller impact on MaxBIPS
CMP scales favor static per-core management over chip-wide DVFS
04/19/23 19
Conclusion
Global management is preferred
Dynamic management is preferred
MaxBIPS is efficient
04/19/23 20
Critique
MaxBIPS: Prediction is superlinearly dependent on the number of modes and core
Power performance estimation matrix: transition penalty
Not consider temperature
04/19/23 21
Outline
Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors
1. Background 2. Contribution 3. Algorithm 4. System Implementation 5. Evaluation 6. Conclusion 7. Critique
04/19/23 22
Background
For CMP, with-in die process variation impacts:
Static power consumption
Maximum frequency
04/19/23 23
Contribution
Propose variation-aware algorithms for application scheduling
Complement these algorithms with variation-aware DVFS
04/19/23 24
CMP Configuration
High level frequency and DVFS policy
04/19/23 25
Algorithms
04/19/23 26
Linear Programming
A technique for optimization of a linear objective function, subject to linear equality and linear inequality constraints
c and b are known vectors, A is a known matrix, x represents variables vector
04/19/23 27
Power Mode Selection: LinOpt TP : average throughput N: core number i : from 1 to N a(i) : constant depends on the thread and core v(i): core voltage b(i) and c(i): constants introduced to approximate power-voltage relation
Object function:
Constraints:
04/19/23 28
Power Mode Selection: SAnn
Use annealing algorithm to solve the power mode selection problem
SAnn searches all possible combination of core voltage
Compare to LinOpt: More accurate but more costly
04/19/23 29
System Implementation
Algorithm runs on a core or a power management unit At OS scheduling interval, OS assigns threads to cores by using
VarF&AppIPC Every 10ms, the LinOpt algorithm runs and sets the cores to correct
power
04/19/23 30
Profiling for Implementation
04/19/23 31
Evaluation Methodology
Variation:Varius model Power: SESC + Wattch+HotLeakage Temperature: HotSpot Critical Path Model:
1.Calculation path delay: Multiplier like unit
2.Memory: SRAM
3.Interconnection: Cacti
4.Gate delay: Alpha-power law
04/19/23 32
Workload
SPEC
Run different applications on different cores
12 billion instructions
04/19/23 33
Metrics
Total power Average frequency of active cores Throughput Energy delay-square product (consider Time-to-solution
and energy consumption) Weighted throughput: application’s IPC normalized to the
application’s IPC at reference conditions
04/19/23 34
Evaluation
Power and frequency variation on one die
04/19/23 35
Uniform Frequency & No DVFS
As the thread number increases, there is no less used core for thread mapping
04/19/23 36
NoUniform Frequency & No DVFS
Different cores run at different frequencies, by selecting less used core, they may end up with lower frequency ones.
04/19/23 37
NoUniFreq+DVFS
Throughput:
VarF&AppIPC+LinOpt is effective
Power:
throughput gains are high when power targets are low
04/19/23 38
LinOpt Granularity
Deviation between power consumed and power target decreases as interval between LinOpt run increases
04/19/23 39
Conclusion
With-in die variation substantially impacts static power consumed and maximum frequency
Variation-aware algorithms are proposed and analyzed, LinOpt is efficient
04/19/23 40
Critique
How to decouple thread mapping and power mode selection
Static power consumption and dynamic power consumption should be discussed separately
Thread mapping takes place once, thread migration should be considered
04/19/23 41
Comparison
MICRO UIUC
Objective Manage CMP processor power
Manage CMP processor power
Core 8 homogeneous Cores (POWER4 like)
20 inhomogeneous Cores (Alpha 21264 like)
Policy Global, Dynamic Global, Dynamic
Algorithm MaxBIPS LinOpt
Methodology Simulation (Turandot) Simulation (SESC)
Benchmark SPEC CPU2000 SPEC
Controller Ad-hoc Ad-hoc