exploiting 3d-stacked memory devices
DESCRIPTION
Exploiting 3D-Stacked Memory Devices. Rajeev Balasubramonian School of Computing University of Utah Oct 2012. Power Contributions. PROCESSOR. PERCENTAGE OF TOTAL SERVER POWER. MEMORY. Power Contributions. PROCESSOR. PERCENTAGE OF TOTAL SERVER POWER. MEMORY. Example IBM Server. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/1.jpg)
1
Exploiting 3D-Stacked Memory Devices
Rajeev Balasubramonian
School of ComputingUniversity of Utah
Oct 2012
![Page 2: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/2.jpg)
2
Power Contributions
PERCENTAGEOF TOTALSERVERPOWER
PROCESSOR
MEMORY
![Page 3: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/3.jpg)
3
Power Contributions
PERCENTAGEOF TOTALSERVERPOWER
PROCESSOR
MEMORY
![Page 4: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/4.jpg)
4
Example IBM Server
Source: P. Bose, WETI Workshop, 2012
![Page 5: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/5.jpg)
5
Reasons for Memory Power Increase
• Innovations for the processor, but not for memory
• Harder to get to memory (buffer chips)
• New workloads that demand more memory SAP HANA in-memory databases SAS in-memory analytics
![Page 6: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/6.jpg)
6
The Cost of Data Movement
• 64-bit double-precision FP MAC: 50 pJ (NSF CPOM Workshop report)
• 1 instruction on an ARM Cortex A5: 80 pJ (ARM datasheets)
• Fetching 256-bit block from a distant cache bank: 1.2 nJ (NSF CPOM Workshop report)
• Fetching 256-bit block from an HMC device: 2.68 nJ Fetching 256-bit block from a DDR3 device: 16.6 nJ (Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)
![Page 7: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/7.jpg)
7
Memory Basics
Host Multi-CoreProcessor
MC MC
MCMC
![Page 8: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/8.jpg)
8
FB-DIMM
Host Multi-CoreProcessor
MC MC
MCMC …
![Page 9: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/9.jpg)
9
SMB/SMI
Host Multi-CoreProcessor
MC MC
MCMC
![Page 10: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/10.jpg)
10
Micron Hybrid Memory Cube Device
![Page 11: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/11.jpg)
11
HMC Architecture
Host Multi-CoreProcessor
MC MC
MCMC
![Page 12: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/12.jpg)
12
Key Points
• HMC allows logic layer to easily reach DRAM chips
• Open question: new functionalities on the logic chip – cores, routing, refresh, scheduling
• Data transfer out of the HMC is just as expensive as before
Near Data Computing … to cut off-HMC movement
Intelligent Network-of-Memories … to reduce hops
![Page 13: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/13.jpg)
13
Near Data Computing (NDC)
![Page 14: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/14.jpg)
14
Timely Innovation
• A low-cost way to achieve NDC
• Workloads that are embarrassingly parallel
• Workloads that are increasingly memory bound
• Mature frameworks (MapReduce) in place
![Page 15: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/15.jpg)
15
Open Questions
• What workloads will benefit from this?
• What causes the benefit?
![Page 16: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/16.jpg)
16
Workloads
• Initial focus on MapReduce, but any workload with localized data access patterns will be a good fit
• Map phase in MapReduce: the dataset is partitioned and each Map phase works on its “split”; embarrassingly parallel, localized data access, often the bottleneck; e.g., count word occurrences in each individual document
• Reduce phase in MapReduce: aggregates the results of many mappers; requires random access of data; but deals with less data than Mappers; e.g., summing up the occurrences for each word
![Page 17: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/17.jpg)
17
Baseline Architecture
MC MC
MCMC
• Mappers and Reducers both execute on the host processor• Many simple cores is better than few complex cores• 2 sockets, 256 GB memory, processing power budget 260 W, 512 Arm cores (EE-Cores) per socket, each core at 876 MHz
![Page 18: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/18.jpg)
18
NDC Architecture
MC MC
MCMC
• Mappers execute on ND Cores; Reducers execute on the host processor• 32 cores per HMC; 2048 total ND Cores and 1024 total EE-Cores; 260 W total processing power budget
![Page 19: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/19.jpg)
19
NDC Memory Hierarchy
MC MC
MCMC
• Memory latency excludes delay for link queuing and traversal• Many row buffer hits• L1 I and D caches per ND Core• The vault has space reserved for intermediate outputs, and Mapper/Runtime code/data
![Page 20: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/20.jpg)
20
Methodology
• Three workloads: Range-Aggregate: count occurrences of something Group-By: count occurrences of everything Equi-Join: for two databases, it counts the pairs that
have similar attributes
• Dataset: 1998 World Cup web server logs
• Simulations of individual mappers and reducers on EE-cores on TRAX simulator
![Page 21: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/21.jpg)
21
Single Thread Performance
![Page 22: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/22.jpg)
22
Effect of Bandwidth
![Page 23: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/23.jpg)
23
Exec Time vs. Frequency
![Page 24: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/24.jpg)
24
Maximizing the Power Budget
![Page 25: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/25.jpg)
25
Scaling the Core Count
![Page 26: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/26.jpg)
26
Energy Reduction
![Page 27: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/27.jpg)
27
Results Summary
• Execution time reductions of 7%-89%
• NDC performance scales better with core count
• Energy reduction of 26%-91%
No bandwidth limitation Lower memory access latency Lower bit transport energy
![Page 28: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/28.jpg)
28
Intelligent Network of Memories
• How should several HMCs be connected to the processor?• How should data be placed in these HMCs?
![Page 29: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/29.jpg)
29
Contributions
• Evaluation of different network topologies Route adaptivity does help
• Page placement to bring popular data to nearby HMCs Percolate-down based on page access counts
• Use of router bypassing under low load
• Use of deep sleep modes for distant HMCs
![Page 30: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/30.jpg)
30
Topologies
![Page 31: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/31.jpg)
31
Topologies
![Page 32: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/32.jpg)
32
Topologies
(d) F-Tree (e) T-Tree
![Page 33: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/33.jpg)
33
Network Properties
• Supports 44-64 HMC devices with 2-4 rings
• Adaptive routing (deadlock avoidance based on timers)
• An entire page resides in one ring, but cache lines are striped across the channels
![Page 34: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/34.jpg)
34
Percolate-Down Page Placement
• New pages are placed in nearest ring
• Periodically, inactive pages are demoted to the next ring; thresholds matter because of queuing delays
• Activity is tracked with the multi-queue algorithm: hierarchical queues, each entry has a timer and an access count, demotion to lower queue if timer expires, promotion to higher queue if access count is high
• Page migration off the critical path, striped across many channels, distant links are under-utilized
![Page 35: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/35.jpg)
35
Router Bypassing
• Topologies with more links and adaptive routing (T-Tree) are better… but distant links experience relatively low load
• While a complex router is required for the T-Tree, the router can often be bypassed
![Page 36: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/36.jpg)
36
Power-Down Modes
• Activity shift to nearby rings under-utilization at distant HMCs
• Can power off the DRAM layers (PD-0) and the SerDes circuits (PD-1)
• 26% energy saving for a 5% performance penalty
![Page 37: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/37.jpg)
37
Methodology
• 128-thread traces of NAS parallel benchmarks (capacity requirements of nearly 211 GB)
• Detailed simulations with 1 billion memory access traces, confirmatory page-access simulations for the entire application
• Power breakdown: 3.7 pJ/bit for DRAM access, 6.8 pJ/bit for HMC logic layer, 3.9 pJ/bit for a 5x5 router
![Page 38: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/38.jpg)
38
Results – Normalized Exec Time
• T-Tree P-Down reduces exec time by 50%• 86% of flits bypass the router• 88% of requests serviced by Ring-0
![Page 39: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/39.jpg)
39
Results – Energy
![Page 40: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/40.jpg)
40
Summary
• Must reduce data movement on off-chip memory links
• NDC reduces energy, improves performance by overcoming the bandwidth wall
• More work required to analyze workloads, build software frameworks, analyze thermals, etc.
• iNoM uses OS page placement to minimize hops for popular data and increase power-down opportunities
• Path diversity is useful, router overhead is small
![Page 41: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/41.jpg)
41
Acknowledgements
• Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor, Jeff Jestes, Al Davis, Feifei Li
• Group funded by: NSF, HP, Samsung, IBM
![Page 42: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/42.jpg)
42
Backup Slide
![Page 43: Exploiting 3D-Stacked Memory Devices](https://reader035.vdocuments.mx/reader035/viewer/2022062323/568162f8550346895dd371cb/html5/thumbnails/43.jpg)
43
Backup Slide