prime: power-efficient, reliable, many-core embedded …€¦ · prime: power-efficient, reliable,...
TRANSCRIPT
1
PRiME: POWER-EFFICIENT, RELIABLE, MANY-COREEMBEDDED SYSTEMS
Dr Geoff MerrettInternational Symposium on Many-Core Computing: Hardware and Software18 January 2018 | Southampton, UK
p r i m e - p r o j e c t . o r g
2
THE PRiME PROJECT
“Enable the sustainability of many-core scaling by preventing the uncontrolled increase in energy consumption and
unreliability through a step change in holistic design methods and cross-layer system optimisation.“
www.prime-project.org
3
THE PRiME PROJECT
4
THE PRiME PROJECTwww.prime-project.org
http://www.prime-project.org/
Runtime Management:a PRiME PerspectivePower/Energy Reliability
6
PRiME’S APPROACH TO RTM
Maeda-Nunez, Luis Alfonso, Anup K. Das, Rishad A. Shafik, Geoff V. Merrett, and Bashir Al-Hashimi. "PoGo: an application-specific adaptive energy minimisation approach for embedded systems.” HiPEAC Workshop on Energy Efficiency with Heterogenous Computing, 2015
ondemandLinux Governor
PRiMEQ-Learning RTM
7
REINFORCEMENT LEARNING RTM
Application DataSet AverageTemperature(Celcius) PeakTemperature(Celcius)Linux Ge etal. Proposed Linux Geetal. Proposed
tachyonset1 69.2 52.6 38.6 71.5 63 60set2 50.5 44.5 43.8 57.3 56.3 52set3 50.8 44.7 41.6 57.8 54.5 48.8
mpeg2_decclip1 36 34 34.2 42.7 41.3 39clip2 35.6 34.4 34.2 42.3 42 39.3clip3 34.3 34.4 34 43 39.7 44.3
Average MTTF improvements: 5x (thermal aging); 4x (thermal cycling)Das, Anup, Al-Hashimi, Bashir and Merrett, Geoff (2015) Adaptive and hierarchical run-time manager for energy-aware thermal management of embedded systems. ACM Transactions on Embedded Computing Systems, 1-25.
8
MODEL-BASED RTMStereo Matching Application: http://github.com/PRiME-project/PRiMEStereoMatch
Grayscale & Gradient
Post Processing
LeftImage
RightImage
Depth Map
Cost Volume Construction
Cost Volume Construction
Grayscale & Gradient
Cost Volume Filtering
Cost Volume Filtering
Disparity Selection
Disparity Selection
• Processes stillimages, video ora camera feed
• OpenCL supported
• Includes test datasets
Leech, Charles, Vala, Charan Kumar, Acharyya, Amit, Yang, Sheng, Merrett, Geoffrey and Al-Hashimi, Bashir (2017) Run-time performance and power optimization of parallel disparity estimation on many-core platforms ACM Transactions on Embedded Computing Systems
9
MODEL-BASED RTMModel Building
Leech, Charles, Vala, Charan Kumar, Acharyya, Amit, Yang, Sheng, Merrett, Geoffrey and Al-Hashimi, Bashir (2017) Run-time performance and power optimization of parallel disparity estimation on many-core platforms ACM Transactions on Embedded Computing Systems
10
MODEL-BASED RTMRuntime Management
Leech, Charles, Vala, Charan Kumar, Acharyya, Amit, Yang, Sheng, Merrett, Geoffrey and Al-Hashimi, Bashir (2017) Run-time performance and power optimization of parallel disparity estimation on many-core platforms ACM Transactions on Embedded Computing Systems
11
MODEL-BASED RTM: HETEROGENEITY Heterogeneous Platforms
Run-time changes in:• Performance requirements• Application workload changes
Workload-A
Workload-B
Application
Hardware
Decode
CPU
Filter Display
DSP FGPA
RuntimeModelling Mapping DVFS
Tasks
Yang, Sheng, Shafik, Rishad Ahmed, Merrett, Geoff V., Stott, Edward, Levine, Joshua, Davis, James and Al-Hashimi, Bashir (2015) Adaptive energy minimization of embedded heterogeneous system using regression-based learning. PATMOS 2015, Salvador, BR, 01 - 04 Sep 2015. 8pp.
FPGA
12
EXECUTING MULTIPLE APPLICATIONS
• Workload and performance variation due to:– Changes within an application
– Changing applications (sequential execution)
• RTM: Change detection and Learning transfer
• Overlapping applications? (concurrent execution)
Shafik, Rishad, Das, Anup, Maeda-Nunez, Luis, Yang, Sheng, Merrett, Geoff and Al-Hashimi, Bashir (2015) Learning transfer-based adaptive energy minimization in embedded systems. IEEE TCAD.
13
RTM FOR CONCURRENT EXECUTIONMRPI (Memory Reads Per Instruction)
• Supports concurrent execution of applications
• Inter-cluster Thread-to-core Mapping (ITM).
• MRPI informs DVFS control
Reddy, Basireddy Karunakar, Singh, Amit, Biswas, Dwaipayan, Merrett, Geoff and Al-Hashimi, Bashir (2017) Inter-cluster thread-to-core mapping and DVFS on heterogeneous multi-cores IEEE Transactions on Multiscale Computing Systems, pp. 1-14.
14
MANAGING TEMPERATUREMRPI (Memory Reads Per Instruction)
• Aims to avoid frequency throttling at temperature threshold.
• Predicts temperature using a regression-based model
• Can achieve a 10% improvement in energy and performance
App(s)Profiling
(temp.,freq.,power)
Choose best regression model
Mapping and DVFS setting
RegressionModels
Temperature predictor
Run-time Manager
online
Core frequency without prediction
Core frequency with prediction
http://www.prime-project.org/
15
TUNING DPM/RTM PARAMETERS
• Tune governor parameters for theexecuting (interactive) workload
• Account for variability in accesstimes and user input
• Prediction/detection dependent
• Energy saving/QoE improvement compared to ‘default’, e.g.– 13% energy saving
– 27% QoE improvement
– 9% energy + 15% QoE
Exynos-5422 A15/A7, Android 6.0Google Chrome browser workloads
Touch input emulationNetwork throttling (UL, DL, RTT latency)
Bantock, James, Robert Benjamin, Tenentes, Vasileios, Al-Hashimi, Bashir and Merrett, Geoffrey (2017) Online tuning of Dynamic Power Management for efficient execution of interactive workloads International Symposium on Low Power Electronics and Design. IEEE. 6 pp.
16
MEASURING/MODELLING POWERwww.powmon.ecs.soton.ac.uk
Why Power Estimation?
• Few platforms allow hardware measurement of power consumption
• RTMs need to make decisions based on real-time ‘measurements’
• Offline DPM strategy evaluation/design-space exploration
PMCs (Performance Monitoring Counters)
• On several platforms, lowoverhead, many events available
• …but a small number (e.g. 4-6)can be monitored simultaneously
PowMon: A Stable, “Bottom-Up” Approach to Power Modelling
02/12/2015, 13:52Graph Test:Run-Time Power Modelling
Page 1 of 1http://127.0.0.1/~Matthew/Micro_Server/results-plots/results-pmc-drag-drop.html
PMC Exploration Results
Cluster and CorrelationSelect Clustering: None Clustering A Clustering B
Cluster and Correlation
L1D
_TLB
_REF
ILL:
0x05
L1D
_TLB
_REF
ILL_
LD:0
x4C
MEM
_ACC
ESS:
0x13
L1D
_CAC
HE_A
CCES
S:0x
04CY
CLE_
COUN
T:0x
11LD
ST_S
PEC:
0x72
LD_S
PEC:
0x70
MEM
_ACC
ESS_
LD:0
x66
BUS_
CYCL
ES:0
x1D
L1D
_CAC
HE_L
D:0
x40
BR_R
ETRU
N_S
PEC:
0x79
BR_I
ND
IREC
T_SP
EC:0
x7A
INST
_SPE
C:0x
1BL1
I_CA
CHE_
ACCE
SS:0
x14
BR_I
MM
ED_S
PEC:
0x78
DP_
SPEC
:0x7
3BR
_PRE
D:0
x12
BR_M
IS_P
RED
:0x1
0PC
_WRI
TE_S
PEC:
0x76
INST
R_RE
TIRE
D:0
x08
VPF_
SPEC
:0x7
5ST
REX_
PASS
_SPE
C:0x
6DD
MB_
SPEC
:0x7
ELD
REX_
SPEC
:0x6
CL2
D_C
ACHE
_WB:
0x18
L1D
_CAC
HE_W
B:0x
15
L1D
_CAC
HE_R
EFIL
L_LD
:0x4
2
L1D
_CAC
HE_R
EFIL
L_ST
:0x4
3
L1D
_CAC
HE_W
B_VI
CTIM
:0x4
6
L2D
_CAC
HE_W
B_VI
CTIM
:0x5
6BU
S_AC
CESS
_LD
:0x6
0BU
S_AC
CESS
:0x1
9L2
D_C
ACHE
_REF
ILL:
0x17
BUS_
ACCE
SS_S
HARE
D:0
x62
BUS_
ACCE
SS_S
T:0x
61
L2D
_CAC
HE_R
EFIL
L_LD
:0x5
2
L2D
_CAC
HE_R
EFIL
L_ST
:0x5
3L2
D_C
ACHE
_ACC
ESS:
0x16
BUS_
ACCE
SS_N
ORM
AL:0
x64
L2D
_CAC
HE_S
T:0x
51L2
D_C
ACHE
_LD
:0x5
0L1
D_C
ACHE
_REF
ILL:
0x03
L1D
_TLB
_REF
ILL_
ST:0
x4D
UNAL
IGN
ED_L
D_S
PEC:
0x68
UNAL
IGN
ED_S
T_SP
EC:0
x69
UNAL
IGN
ED_L
DST
_SPE
C:0x
6AM
EM_A
CCES
S_ST
:0x6
7ST
_SPE
C:0x
71L1
D_C
ACHE
_ST:
0x41
L2D
_CAC
HE_I
NVA
L:0x
58ST
REX_
FAIL
_SPE
C:0x
6E
L2D
_CAC
HE_W
B_CL
EAN
:0x5
7L1
I_CA
CHE_
REFI
LL:0
x01
L1I_
TLB_
REFI
LL:0
x02
DSB
_SPE
C:0x
7DEX
C_RE
TURN
:0x0
A
BUS_
ACCE
SS_N
OT_
SHAR
ED:0
x63
BUS_
ACCE
SS_P
ERIP
H:0x
65EX
C_TA
KEN
:0x0
9
L1D
_CAC
HE_W
B_CL
EAN
:0x4
7L1
D_C
ACHE
_IN
VAL:
0x48
CID
_WT_
RETI
RED
:0x0
B
TTBR
_WRI
TE_R
ETIR
ED:0
x1C
ISB_
SPEC
:0x7
CAS
E_SP
EC:0
x74
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Corr
elat
ion
with
Pow
er
Events
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
17
STABLE vs UNSTABLE POWER MODELSwww.powmon.ecs.soton.ac.uk
Our stable approach achieves a low average error and narrow error distribution compared to existing techniques.
09/03/2016, 01:50Graph Test:Run-Time Power Modelling
Page 1 of 1file:///Users/user/Dropbox/ARM/Power_Modelling_Website/paper/comparison-graph.html
Filename: comparison-graph-data.csv
Comparison Graph
a b c d e P0
2
4
6
8
10
12
14
16
18
20
22
Per
cent
age
Err
or (
%)
Power Model
Mean
Last updated:
Training: Small set of 20 workloads
Testing: Full set of 60 workloads
[a] M. Pricopi, T. S. Muthukaruppan, V. Venkataramani, T. Mitra, and S. Vishin, “Power-performance modeling on asymmetric multi-cores,” CASES ’13.[b] M. Walker et al., “Run-time power estimation for mobile and embedded asymmetric multi-core cpus,” HIPEAC Workshop Energy Efficiency with Hetero. Comp. 2015[c] S. K. Rethinagiri et al., “System-level power estimation tool for embedded processor based platforms,” RAPIDO ’14. New York, 2014.[d], [e] R. Rodrigues et al, “A study on the use of performance counters to estimate power in microprocessors,” IEEE TCAS II, vol. 60, no. 12, pp. 882–886, Dec 2013.
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
18
IMPROVING USABILITY/EVALUATION
App-RTM interaction
• Communication of controls/monitors(e.g. register/deregister)
• Update controls/monitors (e.g. set/get)
RTM-Device interaction
• Communication of controls/monitors(e.g. register/deregister)
• Update controls/monitors
OpenCL Workload Management
• Tuneable parameters (e.g. thread level parallelism – kernels, device selection)
• Scheduling of kernels to compute devices
Hardware Layer
Application Layer
GPU
DSP
Application
PRiME App Knobs: Parallelism (No.
Kernels), Heterogeneity
PRiME Dev Knobs:V-F, Core Affinity
CPU
FPGA
prime_app_knob_reg()/dereg()prime_app_knob_get()
PRiME App Monitors: Performance (fps),
Accuracy (error rate)
prime_app_mon_reg()/dereg()prime_app_mon_set()
prime_app_mon_weight()
PRiME Dev Monitors: Power, Energy, Temperature
Runtime Manager
Runtime Controller
Runtime Model
prime_dev_knob_reg()/dereg()prime_dev_knob_type()prime_dev_knob_set()
prime_dev_mon_reg()/dereg()prime_dev_mon_type()prime_dev_mon_get()
Workloads (OpenCL Kernels)
Workload Scheduler
prime_app_reg()/dereg()
App 1 App 2 App N
prime_dev.h
prime_app.h
http://www.prime-project.org/
19
THE PRiME RTM FRAMEWORKExperimental Evaluation
Application-RTM interaction
• 3 Applications
RTM-Device interaction
• 2 Platforms
20
SUMMARY
PRiME’s Approach to RTM
• Online vs offline+online approaches
• Model-free vs model-based approaches
• Single > multiple > concurrent applications
• Homogeneous vs Heterogeneous platforms
Tools and Support www.prime-project.org
• PowMon power estimationwww.powmon.ecs.soton.ac.uk
• PRiME RTM Framework(available soon)
• PRiMEStereoMatch application• + more...
http://www.prime-project.org/
Any Questions?
Dr Geoff V MerrettAssociate Professor
Electronics and Computer ScienceTel: +44 (0)23 8059 2775Email: [email protected] | www.geoffmerrett.co.ukHighfield Campus, Southampton, SO17 1BJ UK