international symposium on low power electronics and design noc frequency scaling with flexible-...
Post on 12-Jan-2016
226 Views
Preview:
TRANSCRIPT
International Symposium on Low Power Electronics and Design
NoC Frequency Scaling with Flexible-Pipeline Routers
Pingqiang Zhou, Jieming Yin, Antonia Zhai, and Sachin S. Sapatnekar
University of Minnesota – Twin Cities
MEM
MEM
MEM
MEM
NoC dissipates substantial system energy
C L1
L2
RR
Tile-Based Multicore System
RAW – 36%; Intel 80-tile – 28% [Vangal et al. 2008]
2
MEM
MEM
MEM
MEM
Superscalar Machine
VFS and Its Limitations• NoC is– Potential performance bottleneck– Source of energy consumptionDesigned for diverse traffic patterns
• VFS to reduce energy• Limitations of Aggressive VFS– Reduce throughput– Increase latencyWork for limited traffic pattern
Can we make VFS work for other important traffic patterns?
3
Sensitive Insensitive
Hig
h
Latency
Thro
ughp
utLo
w
3
Frequency Scaling1 2 3 4 Frequency = F1
T
44
2
ammp art blackscholes equake fkmeans kmeans Avg0
0.2
0.4
0.6
0.8
1
StaticClockDynamic
Net
wor
k En
ergy
Br
eakd
own
1 2 3 4 Frequency = 0.5F
Animationammp art blackscholes equake fkmeans kmeans Avg
0
0.2
0.4
0.6
0.8
1
StaticClockDynamic
Net
wor
k En
ergy
Br
eakd
own
Frequency scaling harms performance
1 2 3 4
Reconfigure Pipeline
Frequency = 0.5F
Frequency = 0.5F
T
4
Flexible pipeline can reduce router pipeline delay
5
1 2 3 4
T
T
Flexible Pipeline Routers
+ Reduce NoC energy+ Negligible performance
degradationSensitive Insensitive
Hig
hLo
w
Latency
Thro
ughp
utReduce frequency without increasing router latency
56
Target Application• Low throughput• Latency sensitive
Outline
• Background/Motivation• Router Design• Experimental Results• Related work• Conclusion
67
Route Computation
VC Allocator(VA)
Switch Allocator(SA)
MC 1, VC 1
MC n, VC 1
Crossbar Switch(ST)
Outputports
Inputports
Input Controller(BW/RC)
BWRC VA SA ST
Headflit
BW SA STBody/tailflit
Baseline Router Architecture
How to reconfigure
pipeline?
BWRC
Route Computation
VA
VC Allocator(VA)
SA
Switch Allocator(SA)
ST
78
Pipeline Stage Delay
BW+RC VA SA ST
100 τ 65.5 τ 77.7 τ 45 τ
Delay of 4-stage pipeline:
Tclk = 72.1τ
109
Time-borrowing• Boost pipeline frequency• Average out stage delays
τ : inverter delay
The router delay model is presented in [Peh et al., HPCA 2001].
Pipeline Reconfiguration
• Flex Router: pipeline reconfiguration
BW+RC VA SA ST
100 τ4 65.5 τ4 77.7 τ4 45 τ4
BW+RC VA+SA+ST
100 τ2 170.2 τ2
BW+RC VA SA+ST
100 τ3 65.5 τ3 113.7 τ3
BW+RC+VA+SA+ST
270.2 τ1
4-stage pipelineVdd = 1.2 V
3-stage pipelineVdd = 1.0 V
2-stage pipelineVdd = 1.0 V
1-stage pipelineVdd = 0.8 V
How much hardware overhead?
Tclk = 93.1τ3
= 102.1τ4
Tclk = 135.1τ2
= 148.7τ4
Tclk = 72.1τ4
Tclk = 270.2τ1
= 337.7τ4
1010
Route Computation
VC Allocator
Switch Allocator
Input Controller(with buffers)
Flits outFlits inRoute
Computation
VASA
Input Controller(with buffers)
Flits outFlits in
BW/RC ST
Architecture Support
BW+RC VA SA ST 4-stage pipeline
R
R
R
11
R R R
11
BW+RC VA SA ST 4-stage pipelineR R R
Architecture Support
Route Computation
VASA
Input Controller(with buffers)
Flits outFlits in
R
R
MU
X
RM
UX
R
R
MU
X11
BW/RC ST
BW+RC VA SA ST 3-stage pipelineR R
MU
X
BW+RC VA SA ST 2-stage pipelineR
MU
XMU
XBW+RC VA SA ST 1-stage pipelineM
UXM
UXM
UXLess than 2% overhead in router area
+ Control Logics11
Outline
• Background/Motivation• Router Design• Experimental Results• Related work• Conclusion
1212
Experimental Platform
• Simulator– Full system simulator: GEMS– Power module: Wattch & Orion2.0– Infrastructure: 8 Core, 1 issue in-order
• Benchmarks– From SPEC OMP2001, NU-Mine and PARSEC
1313
MEM
MEM
C L1
L2
R
1.5 GHz
Base: Baseline RouterBase-2: VFS, Slowdown Factor of 2Flex-2: VFS + Flexible-Pipeline Router
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
ammp art blackscholes
equake fkmeans kmeans Avg
00.20.40.60.8
11.2
Dynamic Clock Static
Net
wor
k En
ergy
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
ammp art blackscholes
equake fkmeans kmeans Avg
00.20.40.60.8
11.2
Dynamic Clock Static
Net
wor
k En
ergy
Efficacy in Network Energy Saving
14
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
Base
Base
-2Fl
ex-2
ammp art blackscholes
equake fkmeans kmeans Avg
00.20.40.60.8
11.2
Dynamic Clock Static
Net
wor
k En
ergy
41%
2%
14
Dynamic energy decreases quadratically as voltage goes downClock energy reduction is significant (65%)
Changes in static energy are minimal
Sensitive Insensitive
Hig
hLo
w
Latency
Thro
ughp
utBase: Baseline RouterBase-2: VFSFlex-2: VFS + Flexible-Pipeline Router
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
ammp art blackscholes equake fkmeans kmeans G.M.
0.8
0.9
1
1.1
1.2
Nor
mal
ized
Exe
cutio
n Ti
me
Efficacy in Execution Time
Workload L1 data cache(misses/K instructions)
L2 cache(misses/K instructions)
ammp 13.7 4.4art 40.8 18.1blackscholes 8.1 0.9equake 2.8 2.6fkmeans 1.9 1.7kmeans 2.4 1.9
1.5%
Average system performance degradation is reduced
1515
SystemEnergy
System Delay
• System-level ED2 Product– Cores, caches and the interconnection networks– E: System Energy– D: System Delay
System-Level Evaluation
1616
NetworkEnergy
Network Delay Tradeoff
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
Base
Base
-2
Flex
-2
ammp art blackscholes equake fkmeans kmeans G.M.
0.80.9
11.11.21.31.41.5
Syst
em E
D2
Efficacy in System ED2 Product
ED2 increase
16
Base: Baseline RouterBase-2: VFSFlex-2: VFS + Flexible-Pipeline Router
Frequency tuning should be based on workloads
17
Base: Baseline RouterFlex-2: Flexible-Pipeline Router + Slowdown Factor of 2Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
ammp art blackscholes equake fkmeans kmeans Avg
00.20.40.60.8
11.2
Dynamic Clock Static
Net
wor
k En
ergy
More Aggressive VFS: Network Energy Saving
Flexible –Pipeline Router is scalable in reducing network energy
43%
39%
1718
Base: Baseline RouterFlex-2: Flexible-Pipeline Router + Slowdown Factor of 2Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
Base
Flex
-2Fl
ex-4
ammp art blackscholes equake fkmeans kmeans G.M.
0.8
0.9
1
1.1
Nor
mal
ized
Exe
cutio
n Ti
me
More Aggressive VFS: Execution Time
18
Performance degradation is increasing19
Base: Baseline RouterFlex-2: Flexible-Pipeline Router + Slowdown Factor of 2Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4
Base
Flex
-2
Flex
-4
Base
Flex
-2
Flex
-4
Base
Flex
-2
Flex
-4
Base
Flex
-2
Flex
-4
Base
Flex
-2
Flex
-4
Base
Flex
-2
Flex
-4
Base
Flex
-2
Flex
-4
ammp art blackscholes equake fkmeans kmeans G.M.
0.8
0.9
1
1.1
1.2
Syst
em E
D2
Limits of VFS: System ED2 Product
Diminishing returns when pushing the frequency scaling limitWorkload-dependent
1920
Related Works
• “A case for dynamic frequency tuning in on-chip networks” [Mishra `09]
Dynamically router VFS for reducing network power consumption
– Flexible-pipeline routers enable more drastic scaling
• “A variable-pipeline on-chip router optimized to traffic pattern”[Hirata `10]Dynamically router VFS + variable-pipeline-routers
– Flexible-pipeline routers have lower hardware overhead– Our work presents system-level evaluation
2021
Conclusions
Network
21
Energy Performance
Flexible-Pipeline Router Minimal hardware overhead Enable aggressive VFS
System Level Implications Considerable energy saving Negligible performance degradation
22
Thank you!
21
Q & A
Router Delay Model*
• Router stage delay:
9
htT istage
9
Route Computation
VC Allocator(VA)
Switch Allocator(SA)
MC 1, VC 1
MC n, VC 1
Crossbar Switch(ST)
Outputports
Inputports
Input Controller(BW/RC)
p: # of input/output portsc: # of message classesv: # of VCs/message classω: flit size in bits
ti: sequential logic latencyh: setup delay τ: inverter delay
Stage ti hBW/RC constant 0
VA f(p, v) 9 τSA f(p, c, v) 9 τST f(p, ω) 0
*This model is presented in [Peh et al., HPCA 2001].
System Energy BreakdownBa
seBa
se-2
Flex
-2Fl
ex-4
Base
Base
-2Fl
ex-2
Flex
-4
Base
Base
-2Fl
ex-2
Flex
-4
Base
Base
-2Fl
ex-2
Flex
-4
Base
Base
-2Fl
ex-2
Flex
-4
Base
Base
-2Fl
ex-2
Flex
-4
ammp art blackscholes equake fkmeans kmeans
00.20.40.60.8
11.2
Network Core+Cache
top related