a read-write aware dram scheduling for power … · a read-write aware dram scheduling for power...
TRANSCRIPT
A Read-Write Aware DRAM Scheduling for
Power Reduction in Multi-Core Systems
2014/01/23
Affiliations: * National Chiao Tung University, Taiwan
^ National Central University, Taiwan
Chih-Yen Lai*, Gung-Yu Pan*, Hsien-Kai Guo*, Jing-Yang Jou*^
Authors:
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Outline
Introduction
DRAM power management
DRAM architecture
Related Works and Our Motivation
Proposed Techniques
Experimental Results
Conclusions and Future Works
2
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
DRAM Power Management
Demand of low power systems
DRAM consumes 25%~60% of the system power [2][3][4]
DRAM power management is important
The memory wall
Limits the system performance
Makes DRAM power management harder
3
Memory
54%CPU 21%
Drives 4%
NIC 3%
HBAs 2%
Others 16%
Power consumption of FBDIMM server
*Source: A speech by Samsung’s S. Kadivar at Denali MemCon, 2009 *Source: A speech by Samsung’s K. Han, 2012
DRAM
vs.
CPU
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
DRAM Architecture
Dual Inline Memory Module
(DIMM)
Composed of multiple DRAM chips
Rank
A set of DRAM chips in a DIMM
Bank
Spreads across chips in a rank
Memory controller
Reorder Queue (RQ)
Command Queue (CQ)
Scheduling
Power mode switching4
DIMMs
Rank
DRAM chip
Rank
Bank
Memory
Controller
Core Core Core Core
Cache
Last Level Cache
SchedulerRQ
CQ1 CQ2 CQ3 CQ4
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Outline
Introduction
Related Works and Our Motivation
Related works
Motivation
Proposed Techniques
Experimental Results
Conclusions and Future Works
5
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Related Works
Hybrid main memory
PDRAM [13][14], cached DRAM [15]
Re-designing the physical structure of the DRAM
Array rearrangement [2], Mini-rank [16]
Adding hardware component to the DRAM
Intelligent refresh [17], automatic data migration [18]
Power management policies on the memory
controller
Power-down policy
Scheduler
Throttling mechanism
6
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Power Management Policies
Power-down policy
Determine when to turn off idle ranks
Granularity: minimum number of chips to be switched
Overhead: transition delays
Time-out power-down [19]
Queue-aware power-down [21]
7
ri becomes idle
DRAM clock
𝑡𝑇𝑂
ri is turned off
ri receives command
ri is turned on
𝑡𝑃𝐷𝑁 𝑡𝑃𝑈𝑃
*ri: rank i
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Power Management Policies
Scheduler
Schedules the order of request commands in the memory
controller
Performance driven scheduler
Power-aware scheduler [21]
8
W1 r1
R2 r3
W3 r3
W4 r1
W5 r2
First
Last
W1 r1
W4 r1
R2 r3
W3 r3
W5 r2
W1 r1
R2 r3
W5 r2
W4 r1
W3 r3
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Power Management Policies
Throttling mechanism
Blocks commands until throttle delay (𝑡𝑇𝐷) is reached
Ranks stay in the low power mode for longer periods
9
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Our Motivation
Reduce power with minimal hardware overhead
Criticality difference between reads and writes
Knobs
Throttling mechanism
Reorder request commands
Power mode control
We propose two techniques
Read-write aware throttling
Rank level read-write reordering
10
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Outline
Introduction
Related Works and Our Motivation
Proposed Techniques
Overview
Read-write aware throttling
Rank level read-write reordering
A complete example
Experimental results
Conclusions and Future Works
11
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Overview
Greedy power-down memory controller
12
Turn off ri
Proceed to next cycle
RQ receives commands from
last level cache
Turn on ri
CQi is
empty?
RQ pops a command to the CQ
Yes
No
Per rank
Greedy power-down memory controller
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Overview
Previous work [21]
13
Turn off ri
Proceed to next cycle
RQ receives commands from
last level cache
Turn on ri
CQi is empty?
RQ pops an allowed command to the CQ
Yes
No
Per rank
𝑡𝑇𝐷 is
reached?
Yes No
Si contains
request?
Cluster commands into command sets
S1…Sn by rank in the RQ
Allow commands in Si to be sent to CQi
Yes
No
Per rank
Previous work [21]
Greedy power-down memory controller
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Overview
Our work
14
𝑡𝑇𝐷 is
reached?
Turn off ri
Proceed to next cycle
Yes No
RQ receives commands from
last level cache
Turn on ri
CQi is empty?
Cluster commands into command sets
S1…Sn by rank in the RQ
Allow commands in Si to be sent to CQi
RQ pops an allowed command to the CQ
Yes
No
Per rank
Per rank
Rank level read-write reordering in the RQSi contains
read
request?
Yes
No
Our work
Previous work [21]
Greedy power-down memory controller
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Read-Write Aware Throttling
Based on basic throttling [21]
Blocks commands in the RQ
Clusters commands into
command sets 𝑆1, 𝑆2, …, 𝑆𝑛
Read-write aware throttling
Read requests in each
command set?
Urgent rank vs Trivial rank
Only allows commands
targeting urgent ranks to be
sent to the corresponding
CQs
15
cluster
W1 r1
R2 r3
W3 r3
W4 r1
W5 r2
W6 r1
W7 r1
R8 r1
RQFront
End
RQ
W1 r1
W4 r1
W6 r1
W7 r1
R8 r1
S1
R2 r3
W3 r3
S3
W5 r2S2
S4
r1 on
r3 on
r2 on?
r4 off
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Rank Level Read-Write Reordering
Reorder commands within a command set
Lets the read requests be processed as early as
possible
Groups commands based on target addresses
At most one read request in a command group
FIFO order for commands in a command group
FIFO order for read requests across command groups
16
Front
End
S1
W1 addr1
W4 addr2
W6 addr3
W7 addr2
R8 addr2
R9 addr3
read-write
reorder
W4 addr2
W7 addr2
R8 addr2
W6 addr3
R9 addr3
W1 addr1
S1
cmd group
cmd group
cmd group
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
A Complete Example
17
W1 r1
R2 r3
W3 r3
W4 r1
W5 r2
W6 r1
W7 r1
R8 r1
RQFront
End
cluster commands
RQ
W1 r1
W4 r1
W6 r1
W7 r1
R8 r1
S1
R2 r3
W3 r3
S3
W5 r2 S2
r1 is urgent
r3 is urgent
r2 is trivial
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
A Complete Example
18
Cmd group
W4 addr2
W6 addr3
W7 addr2
R8 addr2
W4 addr2
W7 addr2
R8 addr2
W1 addr1
Cycle k
Front
End
Proceed to next cycle
Read-write reordering
W1 addr1
Cycle k+3
r1 is off r1 is on r1 is off
Both W1 and W6 finish
Cycle
k+4
W1 addr1
W6 addr3
W6 addr3
W6 addr1
Throttle delay is reached
Commands from last throttle period are finished
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Outline
Introduction
Related Works and Our Motivation
Proposed Techniques
Experimental Results
Analysis on different techniques
Power and performance trade-off
Conclusions and Future Works
19
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Simulation Environment
Simulators: Multi2Sim + DRAMSim2
System: ARM Cortex A9 + DDR2 SDRAM
Workloads: SPEC CPU2006 (multi-program),
SPLASH-2 (multi-threaded)
Policies
Previous work [21]
Basic throttling, Power-aware scheduling, Queue-aware
power-down
Oracle policy
Maximum power reduction at zero performance degradation
Our work
Read-write aware throttling, Rank level read-write reordering
20
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Analysis on Different Techniques
At 𝑡𝑇𝐷 = 400 CPU cycles
21
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Power and Performance Trade-Off
Trade-off characteristic for SPEC CPU2006
~75% of power reduction is close to the upper bound
Short throttle delays work well for our work
22
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Power and Performance Trade-Off
Trade-off characteristic for SPLASH-2
Small difference in power reduction
SPLASH-2 benchmarks are less memory intensive
23
Benchmark
(combination)
Main memory requests
per million cycles
SPEC
CPU2006
fp1 511391.80
fp2 818090.00
fp3 31697.60
fp4 442664.80
SPLASH-2
cholesky 894.36
fft 2196.41
fmm 271.93
radix 801.45
barnes 38.60
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Outline
Introduction
Related Works and Our Motivation
Proposed Techniques
Experimental Results
Conclusions and Future Works
24
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
Conclusions and Future Works
Conclusions
We propose two techniques
Read-write aware throttling
Rank level read-write reordering
Our work improves the power reduction by 10%~15% on
average comparing to the previous work [21]
Our work achieves ~75% power reduction at 1%~3%
system performance degradation
Future works
Run-time throttle delay controller
Associate with automatic data migration, write
combining…
Utilizing deeper power mode25
A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System
26