virtualized and flexible ecc for main memory...0 4 ea: 0x0540 0 5 wr: 0x0540 ecc address translation...
TRANSCRIPT
Virtualized and Flexible ECC for Main Memory
Doe Hyun Yoon and Mattan Erez
Dept. Electrical and Computer Engineering
The University of Texas at Austin
1ASPLOS 2010
Memory Error Protection
• Applying ECC uniformly – ECC DIMMs– Simple and transparent to programmers
• Error protection level– Fixed, design-time decision
• Chipkill-correct used in high-end servers– Constrain memory module design space
• Allow only x4 DRAMs• Lower energy efficiency than x8 DRAMs
• Virtualized ECC – objectives– To provide flexible memory error protection– To relax design constraints of chipkill
2
Virtualized ECC
• Two-tiered error protection
• Tier-1 Error Code (T1EC)
– Simple error code for detection or light-weight correction
• Tier-2 Error Code (T2EC)
– Strong error correcting code
• Store T2EC within the memory namespace itself
– OS manages T2EC
• Flexible memory error protection
– Different T2EC for different data pages
– Stronger protection for more important data
3
Virtualized ECC – Example
4
Physical Memory
Data T1EC
Virtual Address space
Low
High
Virtual Page to Physical Frame mapping
Physical Frame to ECC Page mapping
T2EC for Chipkill
T2EC for DoubleChipkill
Page frame – i
Page frame – j
Page frame – k
ECC page – j
ECC page – k
Virtual page – i
Virtual page – j
Virtual page – k
Error Protection
Level
VIRTUALIZED ECC
5
Observations on Memory Errors
• Per-system error rate is still low– Most of time, we try to detect errors
finding no error
• To detect errors is a common case operation– Need a low latency, low complexity
error detection mechanism T1EC
• To correct errors is an uncommon case operation– Correction can be complex, take a long time– But, still need to manage
error correction info somewhere Virtualized T2EC
6
Uniform ECC
7
Physical Memory
Data ECC
VPN
Virtual Memory
VA offset
PFN offsetPA
Page Frame
PA
Virtualized ECC
8
Physical Memory
Data T1EC
VPNVA offset
PFN offsetPA
Scale according to T2EC size
offsetECC Address
OS managesPFN to EPNtranslation
ECC page number
T2EC
ECC Page
PA
EA
Virtual Memory
Page Frame
Data T1EC Data T1EC
LLC
DRAM Rank 0 Rank 1
ECC Address Translation Unit
T2EC for Rank 1 data
T2EC for Rank 0 data
0000
0080
0100
0180
0200
0280
0300
0380
0400
0480
0500
0580
0040
00c0
0140
01c0
0240
02c0
0340
03c0
0440
04c0
0540
05c0
PA: 0x02003
Wr: 0x02002
B0
B0
Rd: 0x00c01
A
A
B1
B2
B3
1 2 3
1 2 3
EA: 0x054040
0
Wr: 0x05405
Virtualized ECC operationRead: fetch data and T1ECDon’t need T2EC in most casesWrite: update data, T1EC, and T2ECECC Address Translation Unit: fast PA to EA translationT2ECs of consecutive data lines map to a T2EC lineT2EC lines can be partially validUpdate only valid T2EC to DRAM
Penalty with V-ECC
• Increased data miss rate
– T2EC lines in LLC reduce effective LLC size
• Increased traffic due to T2EC write-back
– One-way write-back traffic
• Not in a critical-path
10
CHIPKILL-CORRECT
11
Chipkill-correct
• Single Device-error CorrectDouble Device-error Detect
– Can tolerate a DRAM failure
– Can detect a second DRAM failure
• Chipkill requires x4 DRAMs
• x8 chipkill is impractical
– But, x8 DRAM is more energy efficient
12
Baseline x4 Chipkill• Two x4 ECC DIMMs
– 128bit data + 16bit ECC (redundancy overhead: 12.5%)– 4 check symbol error code using 4-bit symbol
• Access granularity – 64B in DDR2 (min. burst 4 x 128 bit)– 128B in DDR3 (min. burst 8 x 128 bit)
13
x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
144-bit wide data bus
x8 Chipkill• x8 chipkill with the same access granularity
– 152-bit wide data path
• 128-bit data + 24-bit ECC
• Redundancy overhead: 18.75%
– Need a custom-designed DIMM
• Increase the system cost a lot
14
152-bit wide data bus
x8 x8 x8 x8 x8 x8 x8 x8 x8
x8 x8 x8 x8 x8 x8 x8 x8 x8
x8
x8 Chipkill /w Standard DIMMs
• Increase access granularity– 128B in DDR2 (min. burst 4 x 256 bit)– 256B in DDR3 (min. burst 8 x 256 bit)
15
x8 x8 x8 x8 x8 x8 x8 x8 x8
x8 x8 x8 x8 x8 x8 x8 x8
x8 x8 x8 x8 x8 x8 x8 x8 x8
x8 x8 x8 x8 x8 x8 x8 x8 x8
280-bit wide data bus
V-ECC for Chipkill
• Use 3 check symbol error codes– Single Symbol-error Correct and
Double Symbol-error Detect
• T1EC– 2 check symbols
– Detect up to 2 symbol error
• T2EC– 3rd check symbol
– Combined T1EC/T2EC provides Chipkill16
V-ECC: ECC x4 configuration
• Use 8-bit symbol error code
– 2 bursts out of a x4 DRAM form an 8bit-symbol
• Modern DRAMs have minimum burst of 4 or 8
• 1 x4 ECC DIMM + 1 x4 Non-ECC DIMM
• Each DRAM access in DDR2 (burst 4)
– 64B data, 4B T1EC
– 2B T2EC is virtualized within memory namespace
• 32 T2ECs per 64B cache line
17
Virtualized within memory
x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
136-bit wide data bus
Data
Data
T1EC
T2EC
144-bit wide data bus
x8 x8 x8 x8 x8 x8 x8 x8 x8
x8 x8 x8 x8 x8 x8 x8 x8 x8
V-ECC: ECC x8 configuration
• Use 8-bit symbol error code
• 2 x8 ECC DIMMs
• Each DRAM access in DDR2 (burst 4)
– 64B data, 8B T1EC
– 4B T2EC is virtualized
• 16 T2ECs per 64B cache line
18
Data
Data
T1EC
T1EC
T2EC
Virtualized within memory
Flexible Error Protection
• Single HW with V-ECC can provide– Chipkill-detect, Chipkill-correct, and
Double chipkill-correct
– Use different T2EC for different pages
• Reliability – Performance tradeoff
• Maximize performance/power efficiency with Chipkill-Detect
• Stronger protection at the cost of additional T2EC access
19
Chipkill-Detect
Chipkill-Correct
Double Chipkill-Correct
ECC x4 0B 2B 4B
ECC x8 0B 4B 8B
EVALUATION
20
Simulator/Workload
• GEMS + DRAMsim– An out-of-order SPARC V9 core– Exclusive two-level cache hierarchy– DDR2 800MHz – 12.8GB/s (128-bit wide data path)
• 1 channel 4 ranks
• Power model– WATTCH for processor power – scaled to 45nm– CACTI for cache power – cacti 45nm– Micron model for DRAM power – commodity DRAMs
• Workloads– 12 data intensive applications
from SPEC CPU 2006 and PARSEC– Microbenchmarks: STREAM and GUPS
21
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
bzip2 hmmer mcf libq omnet milc lbm sphinx3canneal dedup fluid freq avg
SPEC 2006 PARSEC
Baseline x4 ECC x4 ECC x8
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
ST
RE
AM
GU
PS
Normalized Execution Time
• Less than 1% penalty on average
• Performance penalty
– Spatial locality
– Write-back traffic
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
bzip2 hmmer mcf libq omnet milc lbm sphinx3canneal dedup fluid freq avg
SPEC 2006 PARSEC
Baseline x4 ECC x4 ECC x8
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
ST
RE
AM
GU
PS
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
bzip2 hmmer mcf libq omnet milc lbm sphinx3canneal dedup fluid freq avg
SPEC 2006 PARSEC
Baseline x4 ECC x4 ECC x8
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
ST
RE
AM
GU
PS
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
bzip2 hmmer mcf libq omnet milc lbm sphinx3canneal dedup fluid freq avg
SPEC 2006 PARSEC
Baseline x4 ECC x4 ECC x8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
ST
RE
AM
GU
PS
System Energy Efficiency
• Energy Delay Product (EDP) gain
– ECC x4: 1.1% on average
– ECC x8: 12.0% on average
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
bzip2 hmmer mcf libq omnet milc lbm sphinx3canneal dedup fluid freq avg
SPEC 2006 PARSEC
Baseline x4 ECC x4 ECC x8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
ST
RE
AM
GU
PS
1.23
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
ST
RE
AM
GU
PS
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
bzip2 hmmer mcf libq omnet milc lbm sphinx3canneal dedup fluid freq avg
SPEC 2006 PARSEC
Baseline x4 ECC x4 ECC x8
20%17%
10%12%
0.96
1.00
1.04
1.08
1.12
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Normalized Execution Time
Flexible Error Protection
0.96
1.00
1.04
1.08
1.12
1234567
0.60
0.70
0.80
0.90
1.00
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
bzip2 hmmer mcf libq omnet milc lbm sphinx3 canneal dedup fluid freq avg
SPEC 2006 PARSEC
Normalized EDP
0.60
0.70
0.80
0.90
1.00
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
Chip
kill
Dete
ct
Chip
kill
Corr
ect
2 C
hip
kill
Corr
ect
STREAM GUPS
Chipkill-Detect
Chipkill-Correct
Double Chipkill-Correct
Conclusion
• Virtualized ECC– Two-tiered error protection, virtualized T2EC
• Improved system energy efficiency with chipkill– Reduce DRAM power consumption by 27%– Improve system EDP by 12%
• Performance penalty – 1% on average
• Error protection even for Non-ECC DIMMs– Can be used for GPU memory error protection
• Flexibility in error protection– Adaptive error protection level by user/system demand– Cost of error protection is proportional to protection level
25
Virtualized and Flexible ECC for Main Memory
Doe Hyun Yoon and Mattan Erez
Dept. Electrical and Computer Engineering
The University of Texas at Austin
26