row buffer locality aware caching policies for hybrid memories hanbin yoon justin meza rachata...
TRANSCRIPT
Row Buffer Locality AwareCaching Policies for Hybrid Memories
HanBin YoonJustin Meza
Rachata AusavarungnirunRachael Harding
Onur Mutlu
2
Executive Summary• Different memory technologies have different strengths• A hybrid memory system (DRAM-PCM) aims for best of both• Problem: How to place data between these heterogeneous
memory devices?• Observation: PCM array access latency is higher than
DRAM’s – But peripheral circuit (row buffer) access latencies are similar
• Key Idea: Use row buffer locality (RBL) as a key criterion for data placement
• Solution: Cache to DRAM rows with low RBL and high reuse• Improves both performance and energy efficiency over
state-of-the-art caching policies
Demand for Memory Capacity
1. Increasing cores and thread contexts– Intel Sandy Bridge: 8 cores (16 threads)– AMD Abu Dhabi: 16 cores– IBM POWER7: 8 cores (32 threads)– Sun T4: 8 cores (64 threads)
3
Demand for Memory Capacity
1. Increasing cores and thread contexts– Intel Sandy Bridge: 8 cores (16 threads)– AMD Abu Dhabi: 16 cores– IBM POWER7: 8 cores (32 threads)– Sun T4: 8 cores (64 threads)
2. Modern data-intensive applications operate on increasingly larger datasets– Graph, database, scientific workloads
4
Emerging High Density Memory
• DRAM density scaling becoming costly• Promising: Phase change memory (PCM)
+ Projected 3−12 denser than DRAM [Mohan HPTS’09]
+ Non-volatile data storage• However, cannot simply replace DRAM
− Higher access latency (4−12 DRAM) [Lee+ ISCA’09]
− Higher dynamic energy (2−40 DRAM) [Lee+ ISCA’09]
− Limited write endurance (108 writes) [Lee+ ISCA’09]
Employ both DRAM and PCM
5
Hybrid Memory
• Benefits from both DRAM and PCM– DRAM: Low latency, dyn. energy, high endurance– PCM: High capacity, low static energy (no refresh)
6
DRAM PCM
CPU
MC MC
Hybrid Memory
• Design direction: DRAM as a cache to PCM [Qureshi+ ISCA’09]
– Need to avoid excessive data movement– Need to efficiently utilize the DRAM cache
7
DRAM PCM
CPU
MC MC
Hybrid Memory
• Key question: How to place data between the heterogeneous memory devices?
8
DRAM PCM
CPU
MC MC
9
Outline
• Background: Hybrid Memory Systems• Motivation: Row Buffers and Implications on
Data Placement• Mechanisms: Row Buffer Locality-Aware
Caching Policies• Evaluation and Results• Conclusion
Hybrid Memory: A Closer Look
10
MC MC
DRAM(small capacity cache)
PCM(large capacity store)
CPU
Memory channel
Bank Bank Bank Bank
Row buffer
Row (buffer) hit: Access data from row buffer fast Row (buffer) miss: Access data from cell array slow
LOAD X LOAD X+1LOAD X+1LOAD X
Row Buffers and Latency
11
ROW
AD
DRE
SS
ROW DATA
Row buffer miss!Row buffer hit!
Bank
Row buffer
CELL ARRAY
Key Observation
• Row buffers exist in both DRAM and PCM– Row hit latency similar in DRAM & PCM [Lee+ ISCA’09]
– Row miss latency small in DRAM, large in PCM
• Place data in DRAM which– is likely to miss in the row buffer (low row buffer
locality) miss penalty is smaller in DRAMAND
– is reused many times cache only the data worth the movement cost and DRAM space
12
RBL-Awareness: An Example
13
Let’s say a processor accesses four rows
Row A Row B Row C Row D
RBL-Awareness: An Example
14
Let’s say a processor accesses four rowswith different row buffer localities (RBL)
Row A Row B Row C Row D
Low RBL(Frequently miss
in row buffer)
High RBL(Frequently hitin row buffer)
Case 1: RBL-Unaware Policy (state-of-the-art)Case 2: RBL-Aware Policy (RBLA)
Case 1: RBL-Unaware Policy
15
A row buffer locality-unaware policy couldplace these rows in the following manner
DRAM(High RBL)
PCM(Low RBL)
Row CRow D
Row ARow B
RBL-Unaware: Stall time is 6 PCM device accesses
Case 1: RBL-Unaware Policy
16
DRAM (High RBL)
PCM (Low RBL) A B
C DC C D D
A B A B
Access pattern to main memory:A (oldest), B, C, C, C, A, B, D, D, D, A, B (youngest)
time
Case 2: RBL-Aware Policy (RBLA)
17
A row buffer locality-aware policy wouldplace these rows in the opposite manner
DRAM(Low RBL)
PCM(High RBL)
Access data at lower row buffer miss latency of DRAM
Access data at low row buffer hit latency of PCM
Row ARow B
Row CRow D
Saved cycles
DRAM (High RBL)
PCM (Low RBL)
Case 2: RBL-Aware Policy (RBLA)
18
A B
C DC C D D
A B A B
Access pattern to main memory:A (oldest), B, C, C, C, A, B, D, D, D, A, B (youngest)
DRAM (Low RBL)
PCM (High RBL)
time
A B
C DC C D D
A B A B
RBL-Unaware: Stall time is 6 PCM device accesses
RBL-Aware: Stall time is 6 DRAM device accesses
19
Outline
• Background: Hybrid Memory Systems• Motivation: Row Buffers and Implications on
Data Placement• Mechanisms: Row Buffer Locality-Aware
Caching Policies• Evaluation and Results• Conclusion
Our Mechanism: RBLA
1. For recently used rows in PCM:– Count row buffer misses as indicator of row buffer
locality (RBL)
2. Cache to DRAM rows with misses threshold– Row buffer miss counts are periodically reset (only
cache rows with high reuse)
20
Our Mechanism: RBLA-Dyn
1. For recently used rows in PCM:– Count row buffer misses as indicator of row buffer
locality (RBL)
2. Cache to DRAM rows with misses threshold– Row buffer miss counts are periodically reset (only
cache rows with high reuse)
3. Dynamically adjust threshold to adapt to workload/system characteristics– Interval-based cost-benefit analysis 21
Implementation: “Statistics Store”
• Goal: To keep count of row buffer misses to recently used rows in PCM
• Hardware structure in memory controller– Operation is similar to a cache• Input: row address• Output: row buffer miss count
– 128-set 16-way statistics store (9.25KB) achieves system performance within 0.3% of an unlimited-sized statistics store
22
23
Outline
• Background: Hybrid Memory Systems• Motivation: Row Buffers and Implications on
Data Placement• Mechanisms: Row Buffer Locality-Aware
Caching Policies• Evaluation and Results• Conclusion
Evaluation Methodology
• Cycle-level x86 CPU-memory simulator– CPU: 16 out-of-order cores, 32KB private L1 per
core, 512KB shared L2 per core– Memory: 1GB DRAM (8 banks), 16GB PCM (8
banks), 4KB migration granularity• 36 multi-programmed server, cloud workloads– Server: TPC-C (OLTP), TPC-H (Decision Support)– Cloud: Apache (Webserv.), H.264 (Video), TPC-C/H
• Metrics: Weighted speedup (perf.), perf./Watt (energy eff.), Maximum slowdown (fairness)
24
Comparison Points
• Conventional LRU Caching• FREQ: Access-frequency-based caching– Places “hot data” in cache [Jiang+ HPCA’10]
– Cache to DRAM rows with accesses threshold– Row buffer locality-unaware
• FREQ-Dyn: Adaptive Freq.-based caching– FREQ + our dynamic threshold adjustment– Row buffer locality-unaware
• RBLA: Row buffer locality-aware caching• RBLA-Dyn: Adaptive RBL-aware caching 25
Server Cloud Avg0
0.2
0.4
0.6
0.8
1
1.2
1.4
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
Nor
mal
ized
Wei
ghte
d S
pee
du
p
10%
System Performance
26
14%
Benefit 1: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM
17%
Benefit 1: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM
Benefit 2: Reduced memory bandwidth consumption due to stricter caching criteria
Benefit 2: Reduced memory bandwidth consumption due to stricter caching criteria
Benefit 3: Balanced memory request load between DRAM and PCM
Server Cloud Avg0
0.2
0.4
0.6
0.8
1
1.2
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
Nor
mal
ized
Avg
Mem
ory
Lat
ency
Average Memory Latency
27
14%
9%12%
Server Cloud Avg0
0.2
0.4
0.6
0.8
1
1.2
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
Nor
mal
ized
Per
f. p
er W
att
Memory Energy Efficiency
28
Increased performance & reduced data movement between DRAM and PCM
7% 10%13%
Server Cloud Avg0
0.2
0.4
0.6
0.8
1
1.2
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
Nor
mal
ized
Max
imu
m S
low
dow
n
Thread Fairness
29
7.6%
4.8%6.2%
Weighted Speedup Max. Slowdown Perf. per Watt0
0.20.40.60.8
11.21.41.61.8
2
16GB PCM RBLA-Dyn 16GB DRAM
Normalized Metric0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Nor
mal
ized
Wei
ghte
d
Sp
eed
up
0
0.2
0.4
0.6
0.8
1
1.2
Nor
mal
ized
Max
. Slo
wd
own
Compared to All-PCM/DRAM
30
Our mechanism achieves 31% better performance than all PCM, within 29% of all DRAM performance
31%
29%
Other Results in Paper
• RBLA-Dyn increases the portion of PCM row buffer hit by 6.6 times
• RBLA-Dyn has the effect of balancing memory request load between DRAM and PCM– PCM channel utilization increases by 60%.
31
32
Summary• Different memory technologies have different strengths• A hybrid memory system (DRAM-PCM) aims for best of both• Problem: How to place data between these heterogeneous
memory devices?• Observation: PCM array access latency is higher than
DRAM’s – But peripheral circuit (row buffer) access latencies are similar
• Key Idea: Use row buffer locality (RBL) as a key criterion for data placement
• Solution: Cache to DRAM rows with low RBL and high reuse• Improves both performance and energy efficiency over
state-of-the-art caching policies
Thank you! Questions?
33
Row Buffer Locality AwareCaching Policies for Hybrid Memories
HanBin YoonJustin Meza
Rachata AusavarungnirunRachael Harding
Onur Mutlu
Appendix
35
Cost-Benefit Analysis (1/2)
• Each quantum, we measure the first-order costs and benefits under the current threshold– Cost = cycles expended for data movement– Benefit = cycles saved servicing requests in DRAM
versus PCM
• Cost = Migrations × tmigration
• Benefit = ReadsDRAM × (tread,PCM − tread,DRAM) + WritesDRAM × (twrite,PCM − twrite,DRAM)
36
Cost-Benefit Analysis (2/2)
• Dynamic Threshold Adjustment AlgorithmNetBenefit = Benefit - Costif (NetBenefit < 0)
MissThresh++else if (NetBenefit > PreviousNetBenefit)
if (MissThresh was previously incremented)MissThresh++
elseMissThresh--
elseif (MissThresh was previously incremented)
MissThresh--else
MissThresh++PreviousNetBenefit = NetBenefit
37
Simulator Parameters• Core model– 3-wide issue with 128-entry instruction window– Private 32 KB per core L1 cache– Shared 512 KB per core L2 cache
• Memory model– 1 GB DRAM (1 rank), 16 GB PCM (1 rank)– Separate memory controllers, 8 banks per device– Row buffer hit: 40 ns– Row buffer miss: 80 ns (DRAM); 128, 368 ns (PCM)– Migrate data at 4 KB granularity
38
Row Buffer Locality
39
Server Cloud Avg0
0.2
0.4
0.6
0.8
1
1.2
DRAM row hit DRAM row missPCM row hit PCM row miss
Workload
Nor
mal
ized
Mem
ory
Acc
esse
s
FR
EQ
FR
EQ
-Dyn
RB
LA
RB
LA
-Dyn
PCM Channel Utilization
40
Server Cloud Avg0
0.05
0.1
0.15
0.2
0.25
0.3
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
PC
M C
han
nel
Uti
liza
tion
DRAM Channel Utilization
41
Server Cloud Avg0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
DR
AM
Ch
ann
el U
tili
zati
on
Compared to All-PCM/DRAM
42
Weighted Speedup
Max. Slowdown Perf. per Watt0
0.20.40.60.8
11.21.41.61.8
2
16GB PCM RBLA-Dyn 16GB DRAM
Normalized Metric
Memory Lifetime
43
Server Cloud Avg0
5
10
15
20
25
16GB PCM FREQ-Dyn RBLA-Dyn
Workload
Mem
ory
Lif
etim
e (y
ears
)
DRAM Cache Hit Rate
44
Server Cloud Avg0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FREQ FREQ-Dyn RBLA RBLA-Dyn
Workload
DR
AM
Cac
he
Hit
Rat
e