joint architecture and circuit techniques to address...
TRANSCRIPT
![Page 1: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/1.jpg)
Joint Architecture and Circuit Techniques to Address Process and Voltage
Variability Gu-Yeon Wei & David Brooks
School of Engineering and Applied Sciences
Harvard University
![Page 2: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/2.jpg)
The Great Wall of Collaboration
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 2
Architect Circuit Designer
![Page 3: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/3.jpg)
The Great Wall of Collaboration
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 3
Architect Circuit Designer
![Page 4: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/4.jpg)
The Great Wall of Collaboration
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 4
Architect Circuit Designer
![Page 5: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/5.jpg)
Architecture & Circuits Groups
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 5
Wonyoung
GuAlex
David
Meeta
Hayun
Mike
Ankur
Hillery(guest from IBM)
Dongwan
Kevin
Mark
VJ
Not shown:• Andrew• Krishna• Ruwan• Ben
![Page 6: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/6.jpg)
Collaborative Projects• SW+Arch+HW for efficient power delivery
– Understanding Voltage Variations in CMPs Using a Distributed Power Delivery Network (DATE ’07)
– Toward a SW Approach to Mitigate Voltage Emergencies (ISLPED’07)
– DeCoR: A Delayed-Commit and Rollback Mechanism for Handling Inductive Noise in Processors (HPCA’08)
– System-Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators (ASGI’07, HPCA’08)
• SW+Arch+HW to combat process variations– Mitigating the Impact of Process Variation on CPU RF and
Execution Units (MICRO’06)– Process Variation Tolerant 3T1D-Based Cache Architectures
(ASGI’07, MICRO’07)– A Process Variation Tolerant FPU with Voltage Interpolation and
Variable Latency (ISSCC’08 )
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 6
![Page 7: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/7.jpg)
Today’s Topics• System-Level Analysis of Fast, Per-Core
DVFS using On-Chip Switching Regulators– Wonyoung Kim, Meeta Gupta, Wei and
Brooks– To be presented at HPCA in Feb. 2008
• Process Variation Tolerant 3T1D-Based Cache Architectures– Xiaoyao (Alex) Liang, Ramon Canal (UPC
Barcelona), Gu-Yeon Wei and David Brooks– To be presented at MICRO in Dec. 2007
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 7
![Page 8: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/8.jpg)
SYSTEM-LEVEL ANALYSIS OF FAST, PER-CORE DVFS USING ON-CHIP SWITCHING REGULATORS
Seminar Part 1
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 8
![Page 9: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/9.jpg)
Voltage Variability Movie
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 9
3 cores running bzip, 1 core idle 1 core running bzip, 3 cores idle
![Page 10: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/10.jpg)
Motivating Example
• Can we move the off-chip regulator onto the processors?• If yes, WHY?
Wei & Brooks 10IEEE Denver Chapter Technical Seminar (11/1/07)
![Page 11: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/11.jpg)
Supply Noise Comparison
PowerSupply
PowerRegulator
PCB Package
Package-to-Chip
ConnectionEmbeddedProcessor
PCBde-cap
Packagede-cap
Processorde-cap
Off-Chip On-Chip
PowerSupply
PCB Package
Package-to-Chip
ConnectionEmbeddedProcessor
PCBde-cap
Packagede-cap
Processorde-cap
PowerRegulator
Off-Chip On-Chip
Wei & Brooks 11IEEE Denver Chapter Technical Seminar (11/1/07)
resonance
BW limitation of on-chip regulator
![Page 12: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/12.jpg)
Fast DVFS
• Off-chip regulators limited to microsecond-scale transitions• On-chip regulators enable nanosecond-scale voltage
transitions– Can we leverage this fast switching?
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 12
![Page 13: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/13.jpg)
Outline• Motivation• Offline DVFS• On-chip regulator design• Simulation analysis• Summary & future work
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 13
![Page 14: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/14.jpg)
Fast DVFS w/ On-Chip Regulators
Questions to answer:1.Does fast DVFS offer power savings?2.For CMPs, do we want one global supply
or per-core voltage control?3.What does an on-chip regulator cost us?4.How can architecture help regulator
design?5.How does this all add up?
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 14
![Page 15: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/15.jpg)
DVFS Overview• Minimize energy consumption w/ bounded
performance loss– Exploit CPU slack from asynchronous memory
events (i.e., L2 miss) to reduce frequency (F) and voltage (V)
• Offline DVFS control – Formulate as integer linear programming (ILP)
optimization problem– Oracle uses memory vs. CPU boundedness to
set V/F across different windowed intervals• 4 V/F settings assumed
– Compare different intervals (100ns to 100μs)Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 15
![Page 16: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/16.jpg)
DVFS Architecture Study
• Processor model– 4 simple Xscale-like in-order cores– Private L1, shared L2
• Simulation framework– SESC multi-core simulator– Wattch power modeling– Cacti cache simulator– Orion– MESI-based cache coherence– Multithreaded and multi-programming
benchmarks
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 16
![Page 17: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/17.jpg)
Ocean’s DVFS Opportunities
• Multithreaded ocean running on all 4 cores exhibits variable activity between cores
• Per-core voltage again offers more DVFS opportunities
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 17
![Page 18: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/18.jpg)
fft’s DVFS Opportunities
• Multithreaded fft running on all 4 cores exhibits variable activity between cores
• Per-core voltage offers more DVFS opportunities
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 18
![Page 19: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/19.jpg)
Benefits of Fine-Grained DVFS
• Off-chip regulator 100μs – static (app-level) intervals– OS-level DVFS control
• On-chip regulator 100ns – 1μs intervals– Needs online DVFS control
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 19
mcf fft
![Page 20: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/20.jpg)
Global vs. Per-Core DVFS (multithreaded applications)
• DVFS interval = 100ns• Per-core DVFS offers more savings • Savings vs. benchmark trend tracks “variability”
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 20
Global Per-Core
![Page 21: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/21.jpg)
Global vs. Per-Core DVFS (multi-programming applications)
• DVFS interval = 100ns• mcf = memory-bound app; applu = CPU-bound app• Power savings for mix of memory- and CPU-bound
apps Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 21
Global Per-Core
![Page 22: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/22.jpg)
Regulator Design
• Pdelivered = ½ LI2Fswitching• On-chip multiphase
buck converter– Higher Fswitching– Smaller L & C– Lower Vripple and/or
smaller filter C
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 22
Conventional buck converter w/hysteretic control
Multi-phase buck converter
![Page 23: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/23.jpg)
Power Delivery Options
• Can we leverage architecture to reduce the droop?
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 23
(x4 for per-core DVFS)
![Page 24: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/24.jpg)
Current Staggering
• Burn power to reduce voltage droop
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 24
Voltage margins
![Page 25: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/25.jpg)
Voltage Transition Overhead
• Scale up voltage before increasing frequency• Drop frequency before decreasing voltage• Power overhead = area between curves
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 25
![Page 26: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/26.jpg)
Regulator Specifications
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 26
• Optimized Fswitching with respect to losses– Balance DVFS
overhead with regulator loss
![Page 27: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/27.jpg)
Energy Breakdown Comparison
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 27
(100μs DVFS interval) (100ns DVFS interval) (100ns DVFS interval)
![Page 28: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/28.jpg)
Relative Energy Savings
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 28
![Page 29: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/29.jpg)
Putting It All Together• Energy savings with fast DVFS offset by
– On-chip regulator loss– Voltage transition power overhead– Current staggering overhead
• Per-core DVFS attractive for CMP systems– Must consider scalability of on-chip regulators
• Next steps:– Meeta is investigating fast DVFS scaling
algorithms to leverage fast, fine-grained voltage switching
– Wonyoung is designing the regulatorWei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 29
![Page 30: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/30.jpg)
PROCESS VARIATION TOLERANT 3T1D-BASED CACHE ARCHITECTURES
Seminar Part 2
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 30
![Page 31: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/31.jpg)
Process Variation
• As Moore’s Law continues and on-chip dimensions get smaller, imperfections in the fabrication process affect device performance more and more…
• Past: Worried about wafer-to-wafer, chip-to-chip variations• Now: Worry about within-die, transistor-to-transistor variations
(Source: K. Bernstein,IBM J.R&D’06)(Source: Friedberg, SPIE’06)
![Page 32: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/32.jpg)
Variability Trends• In the past…
• Now…
chip to chip core to core
block to block array to array
wafer to wafer
transistor to transistor
![Page 33: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/33.jpg)
On-Chip Memory• On-chip memory is a huge fraction of die
area
Intel Core2Duo AMD Barcelona
33IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 34: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/34.jpg)
From ISSCC
0
0.5
1
1.5
2
2.5
0 20 40 60 80 100 120 140 160 180 200
Technology (nm)
SQR
T(ce
ll ar
ea) μ
m
SRAM scaling: A Tale of Two Conferences?
• Is SRAM scaling slowing down?• Plots include circuit techniques to improve reliability
(e.g., dual voltage, boosted WL, etc.)
(http://www.chipworks.com/blogs.aspx?id=2706)
From IEDM
![Page 35: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/35.jpg)
Problems with 6T
• Susceptibility to process variations (PV)• Performance variations (Read/Write delay variations)• Bit flips due to voltage noise and leakage• Stuck at faults b/c too much mismatch
35IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 36: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/36.jpg)
Dealing with variability in memories• Microarchitectural techniques
– Traditional ideas to deal with soft errors • Parity or ECC • Cache scrubbing
– PVT-induced soft errors much more frequent than radiation-induced soft errors
• Must understand the system-level issues• What’s the problem?
– Fighting or feedback • Sensitive to mismatch
Boosted array or wordline voltage?– Bitline leakage
• Large variations in leakage currentsShorter bitlines?
36IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 37: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/37.jpg)
Data Usage in L1
0
10
20
30
40
50
60
70
80
90
100
1 2001 4001 6001 8001 10001 12001 14001 16001 18001 20001
Period (in cycles)
Per
cent
age
of r
efer
ence
s
AppluCraftyFma3dGzipMcfMesaTwolfAVERAGE
• On average, 90% of data accessed in first 6K cycles
![Page 38: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/38.jpg)
Proposed Solution• Use 3T1D dynamic cells to replace 6T cells
– W. K. Luk et al., “A 3-transistor dram cell with gated diode for enhanced speed and retention time,” Symp. on VLSI Circuits, June 2006.
• Why?– Higher immunity to process variations– Absorb delay variation into cell “retention time”– No inherent fighting no bit flips– Lower power (leakage and dynamic)– Higher density possible
• But what about refresh?– Use architectural insights and techniques to deal with dynamic
data storage• Where?
– Analyzed register files (RF) and L1 data caches– eDRAMs being considered for L2 caches and above…
38IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 39: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/39.jpg)
What is a 3T1D cell?
• Gated-diode selectively boosts stored data (“1”) during reads
• Non-destructive reads allows for column multiplexing
T1 T2
T3
WLread
WLwrite
BLwriteBLread
D1
storage node (nodeS)
WLread
transistor connected as gated diode
![Page 40: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/40.jpg)
Retention Time vs. Access Time
• What retention time is “good enough”?
0 1 2 3 4 5 6 7140
150
160
170
180
190
200
210
220
230
Time (us) passed after a "1" is written to the storage cell
Acc
ess
time
(ps)
access time for nomimal 3T1D cellaccess time for 3T1D cell with longer gate length (+sigma)access time for 3T1D cell with shorter gate length (-sigma)access time for 6T SRAM cell
256x256 memory array simulation (32nm PTM)
![Page 41: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/41.jpg)
Simulation Setup• Baseline: 4-wide Out-of-order machine
– 20FO4 pipelines– 80-entry RF– 64KB, 4-way set-associative I- and D-caches
• sim-alpha simulator used to calculate instructions per cycle (IPC)
• 8 SPEC2000 benchmarks
41IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 42: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/42.jpg)
Variation Model• Monte Carlo analysis of process variation
impact on memory cell delay and power– 32nm PTM, Vdd = 1V– Considered typical and extreme PV scenarios– Correlations based on Friedberg’s chip
measurementsTypical Severe
σL/Lnominal (WID) 5% 7%
σVth/Vth (WID) 10% 15%
σL/Lnominal (D2D) 5% 5%
![Page 43: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/43.jpg)
Cache Configuration• 64KB cache
– 4-way Set Associative, 512b cache lines– 2 Read/1 Write ports– 8 256x256 subarrays– 64 Sense Amps per subarray
43IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 44: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/44.jpg)
Cache Data Array Floorplan256x256 Subarray
Column Mux
256
Sense Amps
64
64
256x256 Subarray
256
Way Mux (Late Select from Tag)
256x256Subarray
Column Mux
256
Sense Amps
64
64
256x256 Subarray
256
Way Select(From Tag
Array)
Data Out
64
Row
and
Col
umn
Dec
ode
Row
and
Col
umn
Dec
ode 256x256
Subarray
Column Mux
256
Sense Amps
64
64
256x256 Subarray
256
256x256Subarray
Column Mux
256
Sense Amps
64
64
256x256 Subarray
256
44IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 45: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/45.jpg)
Global Refresh Scheme• 8 cycles to refresh
one cache line (SA-limited)
• 2K cycles to refresh entire cache (476ns @ 4.3GHz)
• ~6µs retention time (no variations)
• Refresh takes 8% of cache bandwidth
• IPC hit < 1%
refresh pulse generation
refresh rate =476.3n/retention
time
cache refresh ID generation
chip clock refresh pulse
insert refresh operation to cache array
L1 data cache array
block signal block one rd/wr
port
to processor
to scheduler
45IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 46: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/46.jpg)
6T Performance under typical variations
46IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 47: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/47.jpg)
3T Performance under typical variations
47IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 48: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/48.jpg)
Three chips under severe variations
48IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 49: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/49.jpg)
Line-Level Schemes: Refresh Policies
• Refresh Policies– Full-refresh: Per-line
counter forces refresh when needed
– No-refresh: Rely on L2 inclusion properties
– Partial-refresh: Threshold counter chooses one of the two policies
Wei & Brooks 49IEEE Denver Chapter Technical Seminar (11/1/07)
![Page 50: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/50.jpg)
Line-Level Schemes: Replacement Policies
• Replacement Policies– Dead-sensitive Placement
• Avoid using “dead” lines when performing placement
– Retention-sensitive placement (RSP-FIFO)• Order lines in descending retention time• New lines are assigned the longest retention time
line (and old ones reshuffle)– Retention-sensitive placement (RSP-LRU)
• MRU block is assigned the longest retention time
Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 50
![Page 51: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/51.jpg)
Evaluating Policies
51IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 52: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/52.jpg)
Pushing policies to the limits
1. 65nm, typical, 1.1V2. 45nm, typical, 1.1V3. 32nm, typical, 1.1V
0k 10k
20k
30k0.05
0.150.25
0.35
0.7
0.8
0.9
1
1.1
μσ/μ
Perfo
rman
ce
0k 10k
20k
30k0.05
0.150.25
0.35
0.7
0.8
0.9
1
1.1
μσ/μPe
rform
ance
Perfo
rman
ce
52IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
4. 32nm, severe, 1.1V5. 32nm, typical, 0.9V6. 32nm, severe, 0.9V
![Page 53: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/53.jpg)
Power Analysis (Dynamic)
1 20 40 60 80 100
0.9
0.95
1
Chip ID
Nor
mal
ized
per
form
ance
no-refresh/LRUpartial-refresh/DSPRSP-FIFO
1 20 40 60 80 1001
1.2
1.4
1.6
1.8no-refresh/LRUpartial-refresh/DSPRSP-FIFO
Chip ID
Nor
mal
ized
dyn
. pow
er
• Refresh power is small (~10% overhead for better schemes)
Wei & Brooks 53IEEE Denver Chapter Technical Seminar (11/1/07)
![Page 54: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/54.jpg)
Power Analysis (Leakage)
T1 T2
T3
WLread
WLwrite
BLwrite
D1
weak leakage path
storage node (nodeS)
WLread
transistor connected as gated diode
• Substantial leakage savings
54IEEE Denver Chapter Technical Seminar (11/1/07)Wei & Brooks
![Page 55: Joint Architecture and Circuit Techniques to Address ...ewh.ieee.org/r5/denver/sscs/Presentations/2007_11_Wei_Brooks.pdf · Joint Architecture and Circuit Techniques to Address Process](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b63e057b2a376ec6c2c03/html5/thumbnails/55.jpg)
Reliable Memory Summary• Transient nature of data in L1 cache
allows for architecturally-simple refresh schemes for 3T1D memories
• Provides PV-tolerant on-chip memories– Comparable performance to “ideal” 6T– Lower leakage power– Low HW overhead
• Similar results observed for 3T1D register files and instruction caches
• Test chip planned for fab in Spring 2008Wei & Brooks IEEE Denver Chapter Technical Seminar (11/1/07) 55