international symposium on low power electronics and design energy-efficient non-minimal path...
TRANSCRIPT
![Page 1: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/1.jpg)
International Symposium on Low Power Electronics and Design
Energy-Efficient Non-Minimal Path On-chip Interconnection Network
for Heterogeneous Systems
Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai
University of Minnesota – Twin Cities
![Page 2: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/2.jpg)
2
Network-on-Chips
CoreR
Leads to latencyLeads to energy
consumption
ScalableProvides high
bandwidth
CoreR
CoreR
CoreR
CoreR
CoreR
CoreR
CoreR
![Page 3: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/3.jpg)
Heterogeneous System
DataParallel
DataParallel
DataParallel
DataParallel
Super-scalar
Super-scalar
Super-scalar
Super-scalar
3
Only some routers are fully utilized
![Page 4: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/4.jpg)
4
DVFS for Reducing NoC Energy
Dynamic Voltage and Frequency Scaling • Router energy dominates• DVFS reduces router energy, but leads to delay• Previous work are conservative on aggressiveness
We need more aggressive DVFS
![Page 5: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/5.jpg)
5
Limitations of Aggressive DVFS
Dynamic Voltage
Frequency Scaling
Our Previous Work *
This Work
Latency Throughput
• DVFS to reduce energy• Limitations of Aggressive DVFS– Increase latency– Reduce throughputWork for limited traffic pattern
Sensitive Insensitive
Hig
h
Latency
Thro
ughp
utLo
w
Contention
* Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011
![Page 6: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/6.jpg)
1 2 3 4
1 2 3 4
Flexible-Pipeline Routers
Frequency = 0.5F
Frequency = 0.5F
TFlexible pipeline reduces router pipeline delay
T
T
6
![Page 7: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/7.jpg)
7
Exploiting DVFS Opportunity
(a) Minimal path routing
High utilization
Mid utilization
Low utilization
1
Src1 Dest1
(b) Non-minimal path routing
1’
Src1 Dest1
![Page 8: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/8.jpg)
8
• Dynamic Energy: EDyn ∝ Vdd2
• Static Energy: ESta ∝ Vdd
• Clock Energy: EClk ∝ (Freq* Vdd2)
Router Speed
DVFS Parameters Normalized EnergyFreq (GHz) Vdd (V)
High 1.5 1.2 1.0Mid 0.75 1.0 0.67Low 0.375 0.8 0.49
Exploiting DVFS Opportunity (cont.)
Operating at Mid-frequency gets most benefit
![Page 9: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/9.jpg)
9
(a) Minimal path routing
100% frequency
50% frequency
25% frequency
1
Src1 Dest1
(b) Non-minimal path routing
1’
Src1 Dest1
Exploiting DVFS Opportunity (cont.)
1. Performance
2. Dynamic Energy
3. Static Energy
More benefit with bigger network
![Page 10: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/10.jpg)
10
• Introduction• Non-minimal path selection
- Issue- Solution- Challenges
• Infrastructure (CPU+GPU)• Results• Conclusion
Outline
![Page 11: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/11.jpg)
11
Non-minimal Path Routing
(a) Minimal path routing
High utilization
Mid utilization
Low utilizationSrc Dest
(b) Non-minimal path routingSrc Dest
![Page 12: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/12.jpg)
12
Too Close !
(a) Minimal path routing
(b) Non-minimal path routing
High utilization
Mid utilization
Low utilizationSrc Dest
Src Dest
PerformanceStatic Energy
Dynamic Energy
![Page 13: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/13.jpg)
13
Non-minimal path routing
Too Aggressive !
Src1 Dest1
High utilization
Mid utilization
Low utilization
Static EnergyDynamic Energy
![Page 14: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/14.jpg)
14
Dynamic Network Tuning
Input
Slack == 1
Slack = 0
Output
Dx>=3 || Dy>=3
Y
Min. path port
N
N
YLeast busy port
Initial State
Utilization Monitor
V/F Scaling
Router:Packet:
Busy information propagation
How to determine Slack?
![Page 15: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/15.jpg)
Busy Information Propagation• Busy Metrics- Buffer utilization- Crossbar utilization- Router utilization
• Propagation- Regional congestion awareness
[Grot et al. HPCA08]
15
![Page 16: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/16.jpg)
Regional Congestion Awareness
16
• Local data collection• Propagation to neighboring routers• Aggregation of local & non-local data
![Page 17: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/17.jpg)
Slack in Applications
Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time
Thread 0 Thread 1 Thread 2 Thread n Thread 0
Thread 0 read miss
Thread 0 ready
Thread 0 schedule
• CPU: Not necessarily, but assume NO slack• GPU: Based on # of threads
17
![Page 18: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/18.jpg)
M G
C L2
18
Tile-Based Multicore System
CPU Core/GPU SM/L2 Cache/
MC
RR
G G
MEM
C L2 C L2
G G G G
M L2 C L2
MEM
MEM
MEM
C L2
G G G G
G M
C L2
G G
C M
C L2
G G
![Page 19: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/19.jpg)
19
Benchmark
• Benchmarks– CPU: afi, ammp, art, equake, kmeans, scalparc– GPU: blackscholes, lps, lib, nn, bfs
• Evaluate ALL 30 CPU+GPU combinations• For presentation purpose, classify- CPU: 1) Memory-bound
2) Computation-bound- GPU: 1) Latency-tolerant
2) Latency-intolerant
Based on: L1 cache miss rate
Based on: Slack cycles
![Page 20: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/20.jpg)
20
Benchmark Categorization
Sensitive Insensitive
Hig
h
Latency
Thro
ughp
ut
Low
(I) memory-bound CPU + latency-tolerant GPU
(II) computation-bound CPU + latency-tolerant GPU
(III) memory-bound CPU + latency-intolerant GPU
(IV) computation-bound CPU + latency-intolerant GPU
![Page 21: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/21.jpg)
Category I Category II Category III Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001
1Baseline DVFS DVFS+Non-min
Net
wor
k En
ergy
21
Network Energy Saving
(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPUEnergy saving is significant on certain workloads
![Page 22: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/22.jpg)
Category I
Category II
Category III
Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001
1
Baseline DVFSDVFS+Non-min
Nor
mal
ized
IPC
22
Performance Impact (CPU)
(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU
equake+LPS art+NN ammp+LIB0.9
0.910.920.930.940.950.960.970.980.99
1
Baseline DVFSDVFS+Non-min
Nor
mal
ized
IPC
![Page 23: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/23.jpg)
Category I Category II Category III Category IV0.600000000000001
0.650000000000001
0.700000000000001
0.750000000000001
0.800000000000001
0.850000000000001
0.900000000000001
0.950000000000001
1
Baseline DVFS DVFS+Non-min
Nor
mal
ized
IPC
23
Performance Impact (GPU)
(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU
Performance penalty is minimal compared to DVFS
![Page 24: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/24.jpg)
24
Non-minimal Path NoC+ Balance on-chip workloads+ Reduce NoC energy
Workload Mix• High throughput• Latency Insensitive
Sensitive Insensitive
Hig
hLo
w
Latency
Thro
ughp
ut
Conclusion
Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed
![Page 25: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/25.jpg)
25
Thank You!
![Page 26: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/26.jpg)
Exploiting Slack in GPU
0 5 10 15 20 25 50 1000
0.20.40.60.8
11.2
BlackScholes LPS LIB NNBFS RAY MUM
Delay of Scheduling (cycles)
Syst
em S
peed
Up
26
![Page 27: International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems](https://reader038.vdocuments.mx/reader038/viewer/2022110319/56649c785503460f9492dce6/html5/thumbnails/27.jpg)
Predict slack based on # of available warps
Exploiting Slack in GPU
0 5 10 15 20 25 300
5
10
15
20
25
BlackScholes
LPS
LIBNN
BFSRAY
MUM
Tolerable Delay Cycles
Avai
l War
ps
27