handling the problems and opportunities posed by multiple on-chip memory controllers

28
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi , David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis University of Utah

Upload: dorit

Post on 22-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers. Manu Awasthi , David Nellans , Kshitij Sudan, Rajeev Balasubramonian, Al Davis University of Utah. Takeaway. Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

Handling the Problems and Opportunities Posed by Multiple

On-Chip Memory Controllers

Manu Awasthi , David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis

University of Utah

Page 2: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

2

Takeaway

• Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC– NUMA memory hierarchies across multiple sockets– Intelligent data mapping required to reduce average memory

access delay• Hardware-software co-design approach required for

efficient data placement– Minimum software involvement

• Data placement needs to be aware of system parameters – Row-buffer hit rates, queuing delays, physical proximity, etc.

Page 3: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

NUMA - Today

3

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Socket 1

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMMQPI

Conceptual representation of four

socket Nehalem machine

MC On-Chip Memory Controller

QPI Interconnect

Memory Channel

DIMM DRAM (DIMMs)

Socket Boundary

Page 4: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

NUMA - Future

4

Core 5 Core 6 Core 7 Core 8

Core 9 Core 10 Core 11 Core 12

Core 13 Core 14 Core 15 Core 16

Core 1 Core 2 Core 3 Core 4

DIMM

MC2

MC3

MC4

MC1

DIMM DIMM

DIMM

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

Future CMPs withmultiple on-chip MCs

MC On-Chip Memory ControllerOn-Chip

Interconnect

Memory Channel

DIMM DRAM (DIMMs)

Page 5: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

5

Local Memory Access

• Accessing local memory is fast!!

5

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Socket 1

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

ADDR

DATA

Page 6: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

6

Problem 1 - Remote Memory Access

• Data for Core N can be anywhere! MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Socket 1

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

ADDR

Page 7: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

7

Problem 1 - Remote Memory Access

• Data for Core N can be anywhere! MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

Socket 1

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMMDATA

Page 8: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

8

Memory Access Stream – Single Core

Prog 1CPU 0

Prog 1CPU 0

Prog 1CPU 0

Prog 1CPU 0

Prog 2CPU 0

Prog 1CPU 0

Prog 1CPU 0

Prog 1CPU 0

Memory Controller Request Queue

In Out

• Single cores executed a handful of context-switched programs.• Spatio-temporal locality can be exploited!!

Page 9: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

9

Problem 2 - Memory Access Stream - CMPs

Prog 0CPU 0

Prog 1CPU 1

Prog 1CPU 1

Prog 2 CPU 2

Prog 3 CPU 3

Prog 4CPU 4

Prog 5CPU 5

Prog 6CPU 6

Memory Controller Request Queue

In Out

• Memory accesses from cores get interleaved, leading to loss of spatio-temporal locality.

Page 10: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

Problem 3 – Increased Overheads for Memory Accesses

Increased queuing delays1 Core/1 Thread

16 Core/16 Threads

Page 11: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

11

Problem 4 – Pin Limitations

Core 5 Core 6 Core 7 Core 8

Core 9 Core 10 Core 11 Core 12

Core 13 Core 14 Core 15 Core 16

Core 1 Core 2 Core 3 Core 4

MC7MC5

MC1 MC2 MC3 MC4

MC8MC6

Core 5 Core 6 Core 7 Core 8

Core 9 Core 10 Core 11 Core 12

Core 13 Core 14 Core 15 Core 16

Core 1 Core 2 Core 3 Core 4

MC10MC12

MC1 MC2 MC3 MC4

MC9MC11

MC16

MC15

MC14

MC13

MC4

MC4

MC4

MC4

• Pin bandwidth is limited : Number of MCs cannot grow exponentially

• A small number of MCs will have to handle all traffic

Page 12: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

12

Problems Summary - I• Pin limitations imply an increase in queuing delay

– Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads

• Multi-core implies an increase in row-buffer interference– Increasingly randomized memory access stream– Row-buffer hit rates bound to go down

• Longer on- and off-chip wire delays imply an increase in NUMA factor

• NUMA factor already at 1.5 today

Page 13: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

13

Problems Summary - II

• DRAM access time in systems with multiple on-chip MCs is governed by– Distance between requesting core and responding MC.– Load on the on-chip interconnect.– Average queuing delay at responding MC– Bank and rank contention at target DIMM– Row-buffer hit rate at responding MC

Bottomline : Intelligent management of data is required

Page 14: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

14

cost j = α x loadj + β x rowhitsj + λ x distancej

Adaptive First Touch Policy

• Basic idea : Assign each new virtual page to a DRAM (physical) page belonging to MC (j) that minimizes the following cost function –

Measure of Queuing Delay

Measure of Locality at DRAM

Measure of Physical Proximity

Constants α, β and λ can be made programmable

Page 15: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

15

Dynamic Page Migration Policy

• Programs change phases!!– Can completely stop touching new pages– Can change the frequency of access to a subset of pages

• Leads to imbalance in MC accesses– For long running programs with varying working sets,

AFT can lead to some MCs getting overloaded

Solution : Dynamically migrate pages between MCs at runtime to decrease

imbalance

Page 16: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

16

Dynamic Page Migration Policy

Core 5 Core 6 Core 7 Core 8

Core 9 Core 10 Core 11 Core 12

Core 13 Core 14 Core 15 Core 16

Core 1 Core 2 Core 3 Core 4

DIMM

MC2

MC3

MC4

MC1

DIMM DIMM

DIMM

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

MC3

Heavily Loaded (Donor)

MC

MC2

Lightly Loaded MC

Lightly Loaded MC

Lightly Loaded MC

MC2

MC2

Page 17: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

17

Dynamic Page Migration Policy

Core 5 Core 6 Core 7 Core 8

Core 9 Core 10 Core 11 Core 12

Core 13 Core 14 Core 15 Core 16

Core 1 Core 2 Core 3 Core 4

DIMM

MC2

MC3

MC4

MC1

DIMM DIMM

DIMM

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

L2$ L2$ L2$ L2$

MC3

Select N pages

MC2

MC2

MC2

Select Recipient MC

Copy N pages from

donor to recipient

MC

Page 18: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

18

Dynamic Page Migration Policy - Challenges

• Selecting recipient MC

– Move pages to MC with least value of cost function• Selecting N pages to migrate

– Empirically select the best possible value– Can also be made programmable

Move pages to a physically proximal MC

Minimize interference at recipient MC

costk = Λ x distancek + Γ x rowhitsk

Page 19: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

19

Dynamic Page Migration Policy - Overheads

• Pages are physically copied to new addresses– Original address mapping has to be invalidated– Invalidate cache lines belonging to copied pages

• Copying pages can block resources, leading to unnecessary stalls.

• Instant TLB invalidates could cause misses in memory even when data is present.

• Solution : Lazy Copying– Essentially, delayed write-back

Page 20: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

20

Issues with TLB Invalidates

Donor MC RecipientMC

Copy Page A,B

Core 1 Core 3 Core 5 Core 12TLB INV

TLB INV

TLB INV

TLB INV

Read A’ -> A

OS Stall!

Page 21: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

21

Lazy Copying

Donor MC RecipientMC

Copy Page A,B

Core 1 Core 3 Core 5 Core 12Read Only

Read Only

Read Only

Read Only

OSFlush Dirty Cachelines

Read A’ -> A

Copy Complete

TLB Update

TLB Update

TLB Update

TLB Update

Read A’ -> A

Page 22: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

22

Methodology• Simics based simulation platform• DRAMSim based DRAM timing. • DRAM energy figures from CACTI 6.5• Baseline : Assign pages to closest MC

CPU 16-core Out-of-Order CMP, 3 GHz freq.L1 Inst. and Data Cache Private, 32 KB/2-way, 1-cycle access

L2 Unified Cache Shared, 2 MB KB/8-way, 4x4 S-NUCA, 3 cycle bank access

Total DRAM Capacity 4 GBDIMM Configuration 8 DIMMs, 1 rank/DIMM, 64 bit channel, 8

devices/DIMMα, β ,λ , Λ, Γ 10, 20, 100, 100, 100

Page 23: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

Results - Throughput

23AFT : 17.1% , Dynamic Page Migration : 34.8%

Page 24: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

24

Results – DRAM Locality

AFT : 16.6% , Dynamic Page Migration : 22.7%

STDDEV Down, increased fairness

Page 25: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

25

Results – Reasons for Benefits

Page 26: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

26

Sensitivity Studies

• Lazy Copying does help, a little– Average 3.2% improvement over without lazy copying

• Terms/Variables in cost function– Very sensitive to load and row-buffer hit rates, not as

much to distance• Cost of TLB shootdowns

– Negligible, since fairly uncommon• Physical placement of MCs – center or peripheral

– Most workloads agnostic to physical placement

Page 27: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

27

Summary

• Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC– Intelligent data mapping will need to be done to reduce

average memory access delay• Adaptive First Touch policy

– Increases performance by 17.1%– Decreases DRAM energy consumption by 14.1%

• Dynamic page migration, improvement on AFT– Further improvement over AFT by 17.7%, 34.8% over

baseline.– Increases energy consumption by 5.2%

Page 28: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

28

Thank You

http://www.cs.utah.edu/arch-research