qos-aware memory systems - carnegie mellon...

QoS-Aware Memory Systems

Onur Mutlu [email protected]

July 8, 2013 INRIA

Trend: Many Cores on Chip n  Simpler and lower power than a single large core n  Large scale parallelism on chip

2

IBM Cell BE 8+1 cores

Intel Core i7 8 cores

Tilera TILE Gx 100 cores, networked

IBM POWER7 8 cores

Intel SCC 48 cores, networked

Nvidia Fermi 448 “cores”

AMD Barcelona 4 cores

Sun Niagara II 8 cores

Many Cores on Chip

n  What we want: q  N times the system performance with N times the cores

n  What do we get today?

3

Unfair Slowdowns due to Interference

Memory Performance Hog Low priority

High priority

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

matlab (Core 1)

gcc (Core 2)

4

5

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

matlab gcc

DRAM Bank 3

Memory System is the Major Shared Resource

6

threads’ requests interfere

Much More of a Shared Resource in Future

7

Inter-Thread/Application Interference

n  Problem: Threads share the memory system, but memory system does not distinguish between threads’ requests

n  Existing memory systems q  Free-for-all, shared based on demand q  Control algorithms thread-unaware and thread-unfair q  Aggressive threads can deny service to others q  Do not try to reduce or control inter-thread interference

8

Unfair Slowdowns due to Interference

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

matlab (Core 1)

gcc (Core 2)

9

10

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

stream random

DRAM Bank 3

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … }

11

A Memory Performance Hog

STREAM

-  Sequential memory access -  Very high row buffer locality (96% hit rate) -  Memory intensive

RANDOM

-  Random memory access -  Very low row buffer locality (3% hit rate) -  Similarly memory intensive

// initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … }

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

12

What Does the Memory Hog Do?

Row Buffer

Row

dec

oder

Column mux

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0 T1: Row 111

T0: Row 0 T0: Row 0 T1: Row 5

T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0

Memory Request Buffer

T0: STREAM T1: RANDOM

Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1


13

DRAM Controllers

n  A row-conflict memory access takes significantly longer than a row-hit access

n  Current controllers take advantage of the row buffer

n  Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*

(1) Row-hit first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first

n  This scheduling policy aims to maximize DRAM throughput n  But, it is unfair when multiple threads share the DRAM system

*Rixner et al., “Memory Access Scheduling,” ISCA 2000. *Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

Effect of the Memory Performance Hog

0

0.5

1

1.5

2

2.5

3

STREAM RANDOM

14

1.18X slowdown

2.82X slowdown

Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slow

dow

n

0

0.5

1

1.5

2

2.5

3

STREAM gcc 0

0.5

1

1.5

2

2.5

3

STREAM Virtual PC


Greater Problem with More Cores

n  Vulnerable to denial of service (DoS)

n  Unable to enforce priorities or SLAs n  Low system performance

Uncontrollable, unpredictable system

15

Greater Problem with More Cores

n  Vulnerable to denial of service (DoS) [Usenix Security’07]

n  Unable to enforce priorities or SLAs [MICRO’07,’10,’11, ISCA’08’11’12, ASPLOS’10]

n  Low system performance [IEEE Micro Top Picks ’09,’11a,’11b,’12]


16

Distributed DoS in Networked Multi-Core Systems

17

Attackers (Cores 1-8)

Stock option pricing application (Cores 9-64)

Cores connected via packet-switched routers on chip

~5000X latency increase

Grot, Hestness, Keckler, Mutlu, “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009.

How Do We Solve The Problem?

n  Inter-thread interference is uncontrolled in all memory resources q  Memory controller q  Interconnect q  Caches

n  We need to control it q  i.e., design an interference-aware (QoS-aware) memory system

18

QoS-Aware Memory Systems: Challenges

n  How do we reduce inter-thread interference? q  Improve system performance and core utilization q  Reduce request serialization and core starvation

n  How do we control inter-thread interference? q  Provide mechanisms to enable system software to enforce

QoS policies q  While providing high system performance

n  How do we make the memory system configurable/flexible? q  Enable flexible mechanisms that can achieve many goals

n  Provide fairness or throughput when needed n  Satisfy performance guarantees when needed

19

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12][Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

20

QoS-Aware Memory Scheduling

n  How to schedule requests to provide q  High system performance q  High fairness to applications q  Configurability to system software

n  Memory controller needs to be aware of threads

21

Memory Controller

Core Core

Core Core Memory

Resolves memory contention by scheduling requests

QoS-Aware Memory Scheduling: Evolution

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

n  ATLAS memory scheduler [Kim+ HPCA’10]

23

Within-Thread Bank Parallelism

24

Bank 0

Bank 1

req

req req

req

memory service +meline

thread A

thread B

thread execu+on +meline

WAIT

WAIT

thread B

thread A Bank 0

Bank 1

req

req req

req

memory service +meline

thread execu+on +meline

WAIT

WAIT

rank

thread B

thread A

thread A

thread B

SAVED CYCLES

Key Idea:

Parallelism-Aware Batch Scheduling [ISCA’08]

n  Principle 1: Schedule requests from a thread back to back q  Preserves each thread’s bank parallelism q  But, this can cause starvation…

n  Principle 2: Group a fixed number of oldest requests from each thread into a “batch” q  Service the batch before all other requests q  Form a new batch when the current batch is done q  Eliminates starvation, provides fairness

25

Bank 0 Bank 1

T0

T0

T1

T1

T3

T3

T2

T2

T3 T3

T3

Batch

T1

T0 T0

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

q  Takeaway: Preserving within-thread bank-parallelism improves performance; request batching improves fairness

n  ATLAS memory scheduler [Kim+ HPCA’10]

q  Idea: Prioritize threads that have attained the least service from the memory scheduler

q  Takeaway: Prioritizing “light” threads improves performance 26


n  Thread cluster memory scheduling [Kim+ MICRO’10]

q  Idea: Cluster threads into two groups (latency vs. bandwidth sensitive); prioritize the latency-sensitive ones; employ a fairness policy in the bandwidth sensitive group

q  Takeaway: Heterogeneous scheduling policy that is different based on thread behavior maximizes both performance and fairness

n  Integrated Memory Channel Partitioning and Scheduling [Muralidhara+ MICRO’11]

n  Idea: Only prioritize very latency-sensitive threads in the scheduler; mitigate all other applications’ interference via channel partitioning

n  Takeaway: Intelligently combining application-aware channel partitioning and memory scheduling provides better performance than either

27


n  Parallel application memory scheduling [Ebrahimi+ MICRO’11]

q  Idea: Identify and prioritize limiter threads of a multithreaded application in the memory scheduler; provide fast and fair progress to non-limiter threads

q  Takeaway: Carefully prioritizing between limiter and non-limiter threads of a parallel application improves performance

n  Staged memory scheduling [Ausavarungnirun+ ISCA’12]

n  Idea: Divide the functional tasks of an application-aware memory scheduler into multiple distinct stages, where each stage is significantly simpler than a monolithic scheduler

n  Takeaway: Staging enables the design of a scalable and relatively simpler application-aware memory scheduler that works on very large request buffers

28


n  MISE [Subramanian+ HPCA’13]

n  Idea: Estimate the performance of a thread by estimating its change in memory request service rate when run alone vs. shared à use this simple model to estimate slowdown to design a scheduling policy that provides predictable performance or fairness

n  Takeaway: Request service rate of a thread is a good proxy for its performance; alone request service rate can be estimated by giving high priority to the thread in memory scheduling for a while

29


n  Prefetch-aware shared resource management [Ebrahimi+ ISCA’12] [Ebrahimi+ MICRO’09] [Lee+ MICRO’08]

q  Idea: Prioritize prefetches depending on how they affect system performance; even accurate prefetches can degrade performance of the system

q  Takeaway: Carefully controlling and prioritizing prefetch requests improves performance and fairness

n  DRAM-Aware last-level cache policies [Lee+ HPS Tech Report’10] [Lee+ HPS Tech Report’10] q  Idea: Design cache eviction and replacement policies such that they

proactively exploit the state of the memory controller and DRAM (e.g., proactively evict data from the cache that hit in open rows)

q  Takeaway: Coordination of last-level cache and DRAM policies improves performance and fairness

30

Stall-Time Fair Memory Scheduling

Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"

40th International Symposium on Microarchitecture (MICRO), pages 146-158, Chicago, IL, December 2007. Slides (ppt)

STFM Micro 2007 Talk

The Problem: Unfairness

n  Vulnerable to denial of service (DoS) [Usenix Security’07]

n  Unable to enforce priorities or SLAs [MICRO’07,’10,’11, ISCA’08’11’12, ASPLOS’10]

n  Low system performance [IEEE Micro Top Picks ’09,’11a,’11b,’12]


32

How Do We Solve the Problem? n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

n  Goal: Threads sharing main memory should experience similar slowdowns compared to when they are run alone à fair scheduling

n  Also improves overall system performance by ensuring cores make “proportional” progress

n  Idea: Memory controller estimates each thread’s slowdown due to interference and schedules requests in a way to balance the slowdowns

n  Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.

33

34

Stall-Time Fairness in Shared DRAM Systems

n  A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system

n  DRAM-related stall-time: The time a thread spends waiting for DRAM memory

n  STshared: DRAM-related stall-time when the thread runs with other threads n  STalone: DRAM-related stall-time when the thread runs alone

n  Memory-slowdown = STshared/STalone q  Relative increase in stall-time

n  Stall-Time Fair Memory scheduler (STFM) aims to equalize

Memory-slowdown for interfering threads, without sacrificing performance q  Considers inherent DRAM performance of each thread q  Aims to allow proportional progress of threads

35

STFM Scheduling Algorithm [MICRO’07] n  For each thread, the DRAM controller

q  Tracks STshared q  Estimates STalone

n  Each cycle, the DRAM controller q  Computes Slowdown = STshared/STalone for threads with legal requests q  Computes unfairness = MAX Slowdown / MIN Slowdown

n  If unfairness < α q  Use DRAM throughput oriented scheduling policy

n  If unfairness ≥ α q  Use fairness-oriented scheduling policy

n  (1) requests from thread with MAX Slowdown first n  (2) row-hit first , (3) oldest-first

36

How Does STFM Prevent Unfairness?

Row Buffer

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0

T1: Row 111

T0: Row 0 T0: Row 0

T1: Row 5

T0: Row 0 T0: Row 0

T0: Row 0

T0 Slowdown

T1 Slowdown 1.00 1.00

1.00 Unfairness

1.03

1.03

1.06

1.06

α 1.05

1.03

1.06 1.03 1.04 1.08

1.04

1.04 1.11

1.06

1.07

1.04

1.10 1.14

1.03

Row 16 Row 111

STFM Pros and Cons n  Upsides:

q  Identifies fairness as an issue in multi-core memory scheduling q  Good at providing fairness q  Being fair improves performance

n  Downsides: q  Does not handle all types of interference q  Somewhat complex to implement q  Slowdown estimations can be incorrect

37

Parallelism-Aware Batch Scheduling

Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems”

35th International Symposium on Computer Architecture (ISCA), pages 63-74, Beijing, China, June 2008. Slides (ppt)

PAR-BS ISCA 2008 Talk

Another Problem due to Interference

n  Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests q  Memory-Level Parallelism (MLP) q  Out-of-order execution, non-blocking caches, runahead execution

n  Effective only if the DRAM controller actually services the

multiple requests in parallel in DRAM banks n  Multiple threads share the DRAM controller n  DRAM controllers are not aware of a thread’s MLP

q  Can service each thread’s outstanding requests serially, not in parallel

39

Bank Parallelism of a Thread

40

Thread A: Bank 0, Row 1


Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency

Thread A :

Bank 0 Bank 1

Compute

2 DRAM Requests

Bank 0 Stall Compute

Bank 1

Single Thread:

Compute

Compute

2 DRAM Requests

Bank Parallelism Interference in DRAM

41

Bank 0 Bank 1


Thread B: Bank 1, Row 99



A : Compute

2 DRAM Requests

Bank 0 Stall

Bank 1

Baseline Scheduler:

B: Compute

Bank 0

Stall Bank 1

Stall

Stall

Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies

2 DRAM Requests

Parallelism-Aware Scheduler

42

Bank 0 Bank 1





A :

2 DRAM Requests Parallelism-aware Scheduler:

B: Compute Bank 0

Stall Compute

Bank 1

Stall

2 DRAM Requests

A : Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

B: Compute

Bank 0

Stall Compute Bank 1

Stall

Stall

Baseline Scheduler:

Compute Bank 0

Stall Compute

Bank 1

Saved Cycles Average stall-time: ~1.5 bank access

latencies

Parallelism-Aware Batch Scheduling (PAR-BS)

n  Principle 1: Parallelism-awareness q  Schedule requests from a thread (to

different banks) back to back q  Preserves each thread’s bank parallelism q  But, this can cause starvation…

n  Principle 2: Request Batching q  Group a fixed number of oldest requests

from each thread into a “batch” q  Service the batch before all other requests q  Form a new batch when the current one is done q  Eliminates starvation, provides fairness q  Allows parallelism-awareness within a batch

43

Bank 0 Bank 1

T1

T1

T0

T0

T2

T2

T3

T3

T2 T2

T2

Batch

T0

T1 T1

Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.

PAR-BS Components

n  Request batching

n  Within-batch scheduling q  Parallelism aware

44

Request Batching

n  Each memory request has a bit (marked) associated with it

n  Batch formation: q  Mark up to Marking-Cap oldest requests per bank for each thread q  Marked requests constitute the batch q  Form a new batch when no marked requests are left

n  Marked requests are prioritized over unmarked ones q  No reordering of requests across batches: no starvation, high fairness

n  How to prioritize requests within a batch?

45

Within-Batch Scheduling

n  Can use any existing DRAM scheduling policy q  FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality

n  But, we also want to preserve intra-thread bank parallelism q  Service each thread’s requests back to back

n  Scheduler computes a ranking of threads when the batch is formed q  Higher-ranked threads are prioritized over lower-ranked ones q  Improves the likelihood that requests from a thread are serviced in

parallel by different banks n  Different threads prioritized in the same order across ALL banks

46

HOW?

How to Rank Threads within a Batch n  Ranking scheme affects system throughput and fairness

n  Maximize system throughput q  Minimize average stall-time of threads within the batch

n  Minimize unfairness (Equalize the slowdown of threads) q  Service threads with inherently low stall-time early in the batch q  Insight: delaying memory non-intensive threads results in high

slowdown

n  Shortest stall-time first (shortest job first) ranking q  Provides optimal system throughput [Smith, 1956]*

q  Controller estimates each thread’s stall-time within the batch q  Ranks threads with shorter stall-time higher

47 * W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.

n  Maximum number of marked requests to any bank (max-bank-load) q  Rank thread with lower max-bank-load higher (~ low stall-time)

n  Total number of marked requests (total-load) q  Breaks ties: rank thread with lower total-load higher

Shortest Stall-Time First Ranking

48

T2 T3 T1

T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3 T1 T3

T2 T2 T1 T2

T1 T0 T2 T0

T3 T2 T3

T3

T3

T3 max-bank-load total-load

T0 1 3

T1 2 4

T2 2 6

T3 5 9

Ranking: T0 > T1 > T2 > T3

7

5

3

Example Within-Batch Scheduling Order

49

T2 T3 T1

T0


T3

T3 T1 T3

T2 T2 T1 T2

T1 T0 T2 T0

T3 T2 T3

T3

T3

T3 Baseline Scheduling Order (Arrival order)

PAR-BS Scheduling Order

T2

T3

T1 T0


T3

T3

T1

T3 T2 T2

T1 T2 T1

T0

T2

T0

T3 T2

T3

T3

T3

T3

T0 T1 T2 T3

4 4 5 7

AVG: 5 bank access latencies AVG: 3.5 bank access latencies

Stall times

T0 T1 T2 T3

1 2 4 7 Stall times

Tim

e

1 2

4

6

Ranking: T0 > T1 > T2 > T3

1 2 3 4 5 6 7

Tim

e

Putting It Together: PAR-BS Scheduling Policy n  PAR-BS Scheduling Policy

(1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first

n  Three properties: q  Exploits row-buffer locality and intra-thread bank parallelism q  Work-conserving

n  Services unmarked requests to banks without marked requests

q  Marking-Cap is important n  Too small cap: destroys row-buffer locality n  Too large cap: penalizes memory non-intensive threads

n  Many more trade-offs analyzed in the paper

50

Batching

Parallelism-aware within-batch scheduling

Hardware Cost

n  <1.5KB storage cost for q  8-core system with 128-entry memory request buffer

n  No complex operations (e.g., divisions)

n  Not on the critical path q  Scheduler makes a decision only every DRAM cycle

51

52

Unfairness on 4-, 8-, 16-core Systems

1

1.5

2

2.5

3

3.5

4

4.5

5

4-core 8-core 16-core

Unfa

irnes

s (lo

wer

is b

ette

r)

FR-FCFSFCFSNFQSTFMPAR-BS

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]

1.11X 1.08X

1.11X

53

System Performance (Hmean-speedup)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

4-core 8-core 16-core

Nor

mal

ized

Hm

ean

Spee

dup

FR-FCFSFCFSNFQSTFMPAR-BS

8.3% 6.1% 5.1%

PAR-BS Pros and Cons

n  Upsides: q  Identifies the problem of bank parallelism destruction across

multiple threads q  Simple mechanism

n  Downsides: q  Does not always prioritize the latency-sensitive applications à

lower overall throughput q  Implementation in multiple controllers needs coordination for

best performance à too frequent coordination since batching is done frequently

54

ATLAS Memory Scheduler

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter, "ATLAS: A Scalable and High-Performance

Scheduling Algorithm for Multiple Memory Controllers" 16th International Symposium on High-Performance Computer Architecture (HPCA),

Bangalore, India, January 2010. Slides (pptx)

ATLAS HPCA 2010 Talk

Rethinking Memory Scheduling A thread alternates between two states (episodes)

§ Compute episode: Zero outstanding memory requests è High IPC § Memory episode: Non-zero outstanding memory requests è Low IPC

56

Goal: Minimize time spent in memory episodes

Out

stan

ding

m

emor

y re

ques

ts

Time

Memory episode Compute episode

How to Minimize Memory Episode Time

§  Minimizes time spent in memory episodes across all threads §  Supported by queueing theory:

§  Shortest-Remaining-Processing-Time scheduling is optimal in single-server queue

Remaining length of a memory episode?

Prioritize thread whose memory episode will end the soonest

Time

Out

stan

ding

m

emor

y re

ques

ts

How much longer?

57

Predicting Memory Episode Lengths

Large attained service è Large expected remaining service

Q: Why? A: Memory episode lengths are Pareto distributed…

58

We discovered: past is excellent predictor for future

Time

Out

stan

ding

m

emor

y re

ques

ts

Remaining service FUTURE

Attained service PAST

Pareto Distribution of Memory Episode Lengths

59

401.bzip2

Favoring least-attained-service memory episode = Favoring memory episode which will end the soonest

Pr{M

em. e

piso

de >

x}

x (cycles)

Memory episode lengths of SPEC benchmarks

Pareto distribution

Attained service correlates with remaining service

The longer an episode has lasted è The longer it will last further

Prioritize the job with shortest-remaining-processing-time

Provably optimal

§  Remaining service: Correlates with attained service

§  Attained service: Tracked by per-thread counter

Least Attained Service (LAS) Memory Scheduling

60

Prioritize the memory episode with least-remaining-service

Our Approach Queueing Theory

Least-attained-service (LAS) scheduling:

Minimize memory episode time

However, LAS does not consider long-term thread behavior

Prioritize the memory episode with least-attained-service

Long-Term Thread Behavior

61

Mem. episode

Thread 1 Thread 2

Short-term thread behavior

Mem. episode

Long-term thread behavior

Compute episode

Compute episode

> priority

< priority

Prioritizing Thread 2 is more beneficial: results in very long stretches of compute episodes

Short memory episode Long memory episode

Quantum-Based Attained Service of a Thread

62

Time O

utst

andi

ng

mem

ory

requ

ests

Attained service

Short-term thread behavior

We divide time into large, fixed-length intervals: quanta (millions of cycles)

Attained service

Long-term thread behavior

Out

stan

ding

m

emor

y re

ques

ts

Time

… Quantum (millions of cycles)

LAS Thread Ranking

Each thread’s attained service (AS) is tracked by MCs

ASi = A thread’s AS during only the i-th quantum

Each thread’s TotalAS computed as:

TotalASi = α · TotalASi-1 + (1- α) · ASi High α è More bias towards history

Threads are ranked, favoring threads with lower TotalAS

Threads are serviced according to their ranking

During a quantum

End of a quantum

Next quantum

63

ATLAS Scheduling Algorithm

ATLAS §  Adaptive per-Thread Least Attained Service §  Request prioritization order 1. Prevent starvation: Over threshold request 2. Maximize performance: Higher LAS rank 3. Exploit locality: Row-hit request 4. Tie-breaker: Oldest request

64

How to coordinate MCs to agree upon a consistent ranking?

4

6

8

10

12

14

16

1 2 4 8 16

Memory controllers

System

throughp

ut

FCFS FR_FCFS STFM PAR-‐BS ATLAS

System Throughput: 24-Core System

65

System throughput = ∑ Speedup

ATLAS consistently provides higher system throughput than all previous scheduling algorithms

17.0%

9.8%

8.4%

5.9%

3.5%

Syst

em t

hrou

ghpu

t

# of memory controllers

0 2 4 6 8

10 12 14

4 8 16 24 32

Cores

System

throughp

ut

PAR-‐BS ATLAS

System Throughput: 4-MC System

# of cores increases è ATLAS performance benefit increases

66

1.1% 3.5%

4.0%

8.4% 10.8%

Syst

em t

hrou

ghpu

t

# of cores

Properties of ATLAS

§  LAS-ranking §  Bank-level parallelism §  Row-buffer locality §  Very infrequent coordination

§  Scale attained service with thread weight (in paper)

§  Low complexity: Attained

service requires a single counter per thread in each MC

67

§  Maximize system performance §  Scalable to large number of controllers

§  Configurable by system software

Goals Properties of ATLAS

ATLAS Pros and Cons n  Upsides:

q  Good at improving overall throughput (compute-intensive threads are prioritized)

q  Low complexity q  Coordination among controllers happens infrequently

n  Downsides: q  Lowest/medium ranked threads get delayed significantly à

high unfairness

68

TCM: Thread Cluster Memory Scheduling

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread Cluster Memory Scheduling:

Exploiting Differences in Memory Access Behavior" 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

TCM Micro 2010 Talk

No previous memory scheduling algorithm provides both the best fairness and system throughput

1

3

5

7

9

11

13

15

17

7 7.5 8 8.5 9 9.5 10

Maxim

um Slowdo

wn

Weighted Speedup

FCFS FRFCFS STFM PAR-‐BS ATLAS

Previous Scheduling Algorithms are Biased

70

System throughput bias

Fairness bias

BeZer system throughput

BeZer fa

irness

24 cores, 4 memory controllers, 96 workloads

Take turns accessing memory

Throughput vs. Fairness

71

Fairness biased approach

thread C

thread B

thread A

less memory intensive

higher priority

Priori[ze less memory-‐intensive threads

Throughput biased approach

Good for throughput

starva3on è unfairness

thread C thread B thread A

Does not starve

not priori3zed è reduced throughput

Single policy for all threads is insufficient

Achieving the Best of Both Worlds

72

thread

thread

higher priority

thread

thread

thread

thread

thread

thread

PrioriDze memory-‐non-‐intensive threads

For Throughput

Unfairness caused by memory-‐intensive being prioriDzed over each other

•  Shuffle thread ranking

Memory-‐intensive threads have different vulnerability to interference

•  Shuffle asymmetrically

For Fairness

thread

thread

thread

thread

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

1.   Group threads into two clusters 2.   PrioriDze non-‐intensive cluster 3.   Different policies for each cluster

73

thread

Threads in the system

thread

thread

thread

thread

thread

thread

Non-‐intensive cluster

Intensive cluster

thread

thread

thread

Memory-‐non-‐intensive

Memory-‐intensive

Priori3zed

higher priority

higher priority

Throughput

Fairness

Clustering Threads Step1 Sort threads by MPKI (misses per kiloinstruc[on)

74

thread

thread

thread

thread

thread

thread higher

MPKI

T α < 10% ClusterThreshold

Intensive cluster αT

Non-‐intensive cluster

T = Total memory bandwidth usage

Step2 Memory bandwidth usage αT divides clusters

Priori3ze non-‐intensive cluster

•  Increases system throughput – Non-‐intensive threads have greater poten[al for making progress

•  Does not degrade fairness – Non-‐intensive threads are “light” – Rarely interfere with intensive threads

Prioritization Between Clusters

75

> priority

Priori3ze threads according to MPKI

•  Increases system throughput – Least intensive thread has the greatest poten[al for making progress in the processor

Non-Intensive Cluster

76

thread

thread

thread

thread

higher priority lowest MPKI

highest MPKI

Periodically shuffle the priority of threads

•  Is trea[ng all threads equally good enough? •  BUT: Equal turns ≠ Same slowdown

Intensive Cluster

77

thread

thread

thread

Increases fairness

Most priori3zed higher priority

thread

thread

thread

0 2 4 6 8 10 12 14

random-‐access streaming Slow

down

Case Study: A Tale of Two Threads Case Study: Two intensive threads contending 1.  random-‐access 2.  streaming

78

Priori[ze random-‐access Priori[ze streaming

random-‐access thread is more easily slowed down

0 2 4 6 8

10 12 14

random-‐access streaming

Slow

down

7x priori[zed

1x

11x

priori[zed 1x

Which is slowed down more easily?

Why are Threads Different?

79

random-‐access streaming req req req req

Bank 1 Bank 2 Bank 3 Bank 4 Memory rows

• All requests parallel • High bank-‐level parallelism

• All requests è Same row • High row-‐buffer locality

req req req req

ac3vated row req req req req req req req req stuck

Vulnerable to interference

Niceness How to quan3fy difference between threads?

80

Vulnerability to interference Bank-‐level parallelism

Causes interference Row-‐buffer locality

+ Niceness -

Niceness High Low

Shuffling: Round-Robin vs. Niceness-Aware 1.  Round-‐Robin shuffling 2.  Niceness-‐Aware shuffling

81

Most priori3zed

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

GOOD: Each thread priori+zed once

ç What can go wrong?

A B C D

D A B C D


82

Most priori3zed

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

ç What can go wrong?

A B C D

D A B C D

A

B

D C

B

C

A D

C

D

B A

D

A

C B

BAD: Nice threads receive lots of interference



83

Most priori3zed

ShuffleInterval

Priority

Time

Nice thread

Least nice thread


A B C D

D C B A D


84

Most priori3zed

ShuffleInterval

Priority

Time

Nice thread

Least nice thread A B C D

D C B A D

D

A

C B

B

A

C D

A

D

B C

D

A

C B


GOOD: Least nice thread stays mostly depriori+zed

TCM Outline

85

1. Clustering

2. Between Clusters

3. Non-‐Intensive Cluster

4. Intensive Cluster

1. Clustering

2. Between Clusters

3. Non-‐Intensive Cluster

4. Intensive Cluster

Fairness

Throughput

TCM: Quantum-Based Operation

86

Time

Previous quantum (~1M cycles)

During quantum: • Monitor thread behavior 1. Memory intensity 2. Bank-‐level parallelism 3. Row-‐buffer locality

Beginning of quantum: • Perform clustering • Compute niceness of intensive threads

Current quantum (~1M cycles)

Shuffle interval (~1K cycles)

TCM: Scheduling Algorithm 1.  Highest-‐rank: Requests from higher ranked threads priori[zed

•  Non-‐Intensive cluster > Intensive cluster •  Non-‐Intensive cluster: lower intensity è higher rank •  Intensive cluster: rank shuffling

2.  Row-‐hit: Row-‐buffer hit requests are priori[zed

3.  Oldest: Older requests are priori[zed

87

TCM: Implementation Cost Required storage at memory controller (24 cores) •  No computa[on is on the cri[cal path

88

Thread memory behavior Storage

MPKI ~0.2kb

Bank-‐level parallelism ~0.6kb

Row-‐buffer locality ~2.9kb

Total < 4kbits

Previous Work FRFCFS [Rixner et al., ISCA00]: Priori[zes row-‐buffer hits

–  Thread-‐oblivious è Low throughput & Low fairness

STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns –  Non-‐intensive threads not priori[zed è Low throughput

PAR-‐BS [Mutlu et al., ISCA08]: Priori[zes oldest batch of requests while preserving bank-‐level parallelism

–  Non-‐intensive threads not always priori[zed è Low throughput

ATLAS [Kim et al., HPCA10]: Priori[zes threads with less memory service

– Most intensive thread starves è Low fairness 89

TCM: Throughput and Fairness

FRFCFS

STFM

PAR-‐BS

ATLAS

TCM

4

6

8

10

12

14

16

7.5 8 8.5 9 9.5 10

Maxim

um Slowdo

wn

Weighted Speedup

90


BeZer fa

irness

24 cores, 4 memory controllers, 96 workloads

TCM, a heterogeneous scheduling policy, provides best fairness and system throughput

TCM: Fairness-Throughput Tradeoff

91

2

4

6

8

10

12

12 12.5 13 13.5 14 14.5 15 15.5 16

Maxim

um Slowdo

wn

Weighted Speedup

When configuraDon parameter is varied…

Adjus+ng ClusterThreshold

TCM allows robust fairness-‐throughput tradeoff

STFM PAR-‐BS

ATLAS

TCM


BeZer fa

irness FRFCFS

Operating System Support •  ClusterThreshold is a tunable knob

– OS can trade off between fairness and throughput

•  Enforcing thread weights – OS assigns weights to threads – TCM enforces thread weights within each cluster

92

Conclusion

93

•  No previous memory scheduling algorithm provides both high system throughput and fairness – Problem: They use a single policy for all threads

•  TCM groups threads into two clusters 1.  Priori[ze non-‐intensive cluster è throughput 2.  Shuffle priori[es in intensive cluster è fairness 3.  Shuffling should favor nice threads è fairness

•  TCM provides the best system throughput and fairness

TCM Pros and Cons n  Upsides:

q  Provides both high fairness and high performance

n  Downsides: q  Scalability to large buffer sizes? q  Effectiveness in a heterogeneous system?

94

Staged Memory Scheduling

Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu, "Staged Memory Scheduling: Achieving High Performance

and Scalability in Heterogeneous Systems” 39th International Symposium on Computer Architecture (ISCA),

Portland, OR, June 2012.

SMS ISCA 2012 Talk

Executive Summary n  Observation: Heterogeneous CPU-GPU systems require

memory schedulers with large request buffers

n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes

n  Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM

n  Compared to state-of-the-art memory schedulers: q  SMS is significantly simpler and more scalable q  SMS provides higher performance and fairness

96

Stronger Memory Service Guarantees

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems"

Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

Strong Memory Service Guarantees n  Goal: Satisfy performance bounds/requirements in the

presence of shared main memory, prefetchers, heterogeneous agents, and hybrid memory

n  Approach: q  Develop techniques/models to accurately estimate the

performance of an application/agent in the presence of resource sharing

q  Develop mechanisms (hardware and software) to enable the resource partitioning/prioritization needed to achieve the required performance levels for all applications

q  All the while providing high system performance

98

MISE: Providing Performance Predictability

in Shared Main Memory Systems

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu

99

Unpredictable Application Slowdowns

100

0

1

2

3

4

5

6

leslie3d (core 0) gcc (core 1)

Slow

dow

n

0

1

2

3

4

5

6

leslie3d (core 0) mcf (core 1)

Slow

dow

n An application’s performance depends on

which application it is running with

Need for Predictable Performance n  There is a need for predictable performance

q  When multiple applications share resources q  Especially if some applications require performance

guarantees

n  Example 1: In mobile systems q  Interactive applications run with non-interactive applications q  Need to guarantee performance for interactive applications

n  Example 2: In server systems q  Different users’ jobs consolidated onto the same server q  Need to provide bounded slowdowns to critical jobs

101

Our Goal: Predictable performance

in the presence of memory interference

Outline

102

1. Estimate Slowdown q Key Observations q  Implementation q MISE Model: Putting it All Together q Evaluating the Model

2. Control Slowdown q Providing Soft Slowdown Guarantees q Minimizing Maximum Slowdown

Slowdown: Definition

103

Shared

Alone

ePerformanc ePerformanc Slowdown =

Key Observation 1 For a memory bound application,

Performance ∝ Memory request service rate

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized

Perform

ance

Normalized Request Service Rate

omnetpp

mcf

astar

104

Shared

Alone

Rate ServiceRequest Rate ServiceRequest Slowdown =

Shared

Alone

ePerformanc ePerformanc Slowdown =

Easy

Harder

Intel Core i7, 4 cores Mem. Bandwidth: 8.5 GB/s

Key Observation 2 Request Service Rate Alone (RSRAlone) of an application can be

estimated by giving the application highest priority in accessing memory

Highest priority à Little interference

(almost as if the application were run alone)

105

Key Observation 2

106

Request Buffer State Main

Memory

1. Run alone Time units Service order

Main Memory

12


Memory

2. Run with another application Service order

Main Memory

123


Memory

3. Run with another application: highest priority Service order

Main Memory

123

Time units

Time units

3

107

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications

)(RSR Rate ServiceRequest )(RSR Rate ServiceRequest Slowdown

SharedShared

AloneAlone=

Key Observation 3 n  Memory-bound application

108

No interference

Compute Phase

Memory Phase

With interference

Memory phase slowdown dominates overall slowdown

time

time Req

Req

Req Req

Req Req

Key Observation 3 n  Non-memory-bound application

109

time

time

No interference

Compute Phase

Memory Phase

With interference

Only memory fraction ( ) slows down with interference

αα−1

α

α−1

Shared

Alone

RSRRSR

α

Shared

Alone

RSRRSR ) - (1 Slowdown αα +=

Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications

Measuring RSRShared and α n  Request Service Rate Shared (RSRShared)

q  Per-core counter to track number of requests serviced q  At the end of each interval, measure

n  Memory Phase Fraction ( ) q  Count number of stall cycles at the core q  Compute fraction of cycles stalled for memory

Length IntervalServiced Requests ofNumber RSRShared =

α

110

Estimating Request Service Rate Alone (RSRAlone)

n  Divide each interval into shorter epochs n  At the beginning of each epoch

q  Memory controller randomly picks an application as the highest priority application

n  At the end of an interval, for each application, estimate

PriorityHigh Given n Applicatio Cycles ofNumber EpochsPriority High During Requests ofNumber RSR

Alone =

111

Goal: Estimate RSRAlone

How: Periodically give each application highest priority in accessing memory

Inaccuracy in Estimating RSRAlone

112

Request Buffer State

Main Memory

Time units Service order

Main Memory

123

n  When an application has highest priority q  Still experiences some interference


Main Memory


Main Memory

123


Main Memory

123

Interference Cycles

High Priority

Main Memory


Main Memory

123


Accounting for Interference in RSRAlone Estimation

n  Solution: Determine and remove interference cycles from RSRAlone calculation

n  A cycle is an interference cycle if

q  a request from the highest priority application is waiting in the request buffer and

q  another application’s request was issued previously

113

Cycles ceInterferen -Priority High Given n Applicatio Cycles ofNumber EpochsPriority High During Requests ofNumber RSR

Alone =

Outline

114

1. Estimate Slowdown q Key Observations q  Implementation q MISE Model: Putting it All Together q Evaluating the Model

2. Control Slowdown q Providing Soft Slowdown Guarantees q Minimizing Maximum Slowdown

MISE Model: Putting it All Together

115

time

Interval

α

Estimate slowdown

Interval

Estimate slowdown

n  Measure RSRShared, n  Estimate RSRAlone

αn  Measure RSRShared, n  Estimate RSRAlone

Previous Work on Slowdown Estimation n  Previous work on slowdown estimation

q  STFM (Stall Time Fair Memory) Scheduling [Mutlu+, MICRO ‘07]

q  FST (Fairness via Source Throttling) [Ebrahimi+, ASPLOS ‘10]

q  Per-thread Cycle Accounting [Du Bois+, HiPEAC ‘13]

n  Basic Idea:

116

Shared

Alone

Time Stall Time Stall Slowdown =

Hard

Easy

Count number of cycles application receives interference

Two Major Advantages of MISE Over STFM

n  Advantage 1: q  STFM estimates alone performance while an

application is receiving interference à Hard q  MISE estimates alone performance while giving an

application the highest priority à Easier

n  Advantage 2: q  STFM does not take into account compute phase for

non-memory-bound applications q  MISE accounts for compute phase à Better accuracy

117

Methodology n  Configuration of our simulated system

q  4 cores q  1 channel, 8 banks/channel q  DDR3 1066 DRAM q  512 KB private cache/core

n  Workloads

q  SPEC CPU2006 q  300 multi programmed workloads

118

Quantitative Comparison

119

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100

Slow

dow

n

Million Cycles

Actual STFM MISE

SPEC CPU 2006 application leslie3d

Comparison to STFM

120

cactusADM

0

1

2

3

4

0 50 100

Slow

dow

n

0

1

2

3

4

0 50 100 Sl

owdo

wn

GemsFDTD

0

1

2

3

4

0 50 100

Slow

dow

n

soplex

0

1

2

3

4

0 50 100

Slow

dow

n

wrf

0

1

2

3

4

0 50 100

Slow

dow

n

calculix

0

1

2

3

4

0 50 100 Sl

owdo

wn

povray

Average error of MISE: 8.2% Average error of STFM: 29.4%

(across 300 workloads)

Providing “Soft” Slowdown Guarantees n  Goal

1. Ensure QoS-critical applications meet a prescribed slowdown bound

2. Maximize system performance for other applications

n  Basic Idea q  Allocate just enough bandwidth to QoS-critical

application q  Assign remaining bandwidth to other applications

121

MISE-QoS: Mechanism to Provide Soft QoS

n  Assign an initial bandwidth allocation to QoS-critical application

n  Estimate slowdown of QoS-critical application using the MISE model

n  After every N intervals

q  If slowdown > bound B +/- ε, increase bandwidth allocation

q  If slowdown < bound B +/- ε, decrease bandwidth allocation

n  When slowdown bound not met for N intervals q  Notify the OS so it can migrate/de-schedule jobs

122

Methodology n  Each application (25 applications in total) considered the

QoS-critical application n  Run with 12 sets of co-runners of different memory

intensities n  Total of 300 multiprogrammed workloads n  Each workload run with 10 slowdown bound values n  Baseline memory scheduling mechanism

q  Always prioritize QoS-critical application [Iyer+, SIGMETRICS 2007]

q  Other applications’ requests scheduled in FRFCFS order [Zuravleff +, US Patent 1997, Rixner+, ISCA 2000]

123

A Look at One Workload

124

0

0.5

1

1.5

2

2.5

3

leslie3d hmmer lbm omnetpp

Slow

dow

n AlwaysPrioritize MISE-QoS-10/1 MISE-QoS-10/3 MISE-QoS-10/5 MISE-QoS-10/7 MISE-QoS-10/9

QoS-critical non-QoS-critical

MISE is effective in 1.  meeting the slowdown bound for the QoS-

critical application 2.  improving performance of non-QoS-critical

applications

Slowdown Bound = 10 Slowdown Bound = 3.33 Slowdown Bound = 2

Effectiveness of MISE in Enforcing QoS

125

Predicted Met

Predicted Not Met

QoS Bound Met 78.8% 2.1%

QoS Bound Not Met 2.2% 16.9%

Across 3000 data points

MISE-QoS meets the bound for 80.9% of workloads

AlwaysPrioritize meets the bound for 83% of workloads

MISE-QoS correctly predicts whether or not the bound

is met for 95.7% of workloads

Performance of Non-QoS-Critical Applications

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 Avg

Har

mon

ic S

peed

up

Number of Memory Intensive Applications

AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

126

Higher performance when bound is loose

When slowdown bound is 10/3 MISE-QoS improves system performance by 10%

Other Results in the Paper n  Sensitivity to model parameters

q  Robust across different values of model parameters

n  Comparison of STFM and MISE models in enforcing soft slowdown guarantees q  MISE significantly more effective in enforcing guarantees

n  Minimizing maximum slowdown

q  MISE improves fairness across several system configurations

127

Summary n  Uncontrolled memory interference slows down

applications unpredictably n  Goal: Estimate and control slowdowns n  Key contribution

q  MISE: An accurate slowdown estimation model q  Average error of MISE: 8.2%

n  Key Idea q  Request Service Rate is a proxy for performance q  Request Service Rate Alone estimated by giving an application highest

priority in accessing memory

n  Leverage slowdown estimates to control slowdowns q  Providing soft slowdown guarantees q  Minimizing maximum slowdown

128

MISE: Providing Performance Predictability

in Shared Main Memory Systems

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu

129

Memory Scheduling for Parallel Applications

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt,

"Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO),

Porto Alegre, Brazil, December 2011. Slides (pptx)

Handling Interference in Parallel Applications

n  Threads in a multithreaded application are inter-dependent n  Some threads can be on the critical path of execution due

to synchronization; some threads are not n  How do we schedule requests of inter-dependent threads to

maximize multithreaded application performance?

n  Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO’11]

n  Hardware/software cooperative limiter thread estimation: n  Thread executing the most contended critical section n  Thread that is falling behind the most in a parallel for loop

131 PAMS Micro 2011 Talk

Aside: Self-Optimizing Memory Controllers

Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach" Proceedings of the 35th International Symposium on Computer Architecture (ISCA),

pages 39-50, Beijing, China, June 2008. Slides (pptx)

Why are DRAM Controllers Difficult to Design?

n  Need to obey DRAM timing constraints for correctness q  There are many (50+) timing constraints in DRAM q  tWTR: Minimum number of cycles to wait before issuing a

read command after a write command is issued q  tRC: Minimum number of cycles between the issuing of two

consecutive activate commands to the same bank q  …

n  Need to keep track of many resources to prevent conflicts q  Channels, banks, ranks, data bus, address bus, row buffers

n  Need to handle DRAM refresh n  Need to optimize for performance (in the presence of constraints)

q  Reordering is not simple q  Predicting the future?

133

Many DRAM Timing Constraints

n  From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” HPS Technical Report, April 2010.

134

More on DRAM Operation and Constraints

n  Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012.

n  Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.

135

Self-Optimizing DRAM Controllers n  Problem: DRAM controllers difficult to design à It is difficult for

human designers to design a policy that can adapt itself very well to different workloads and different system conditions

n  Idea: Design a memory controller that adapts its scheduling policy decisions to workload behavior and system conditions using machine learning.

n  Observation: Reinforcement learning maps nicely to memory control.

n  Design: Memory controller is a reinforcement learning agent that dynamically and continuously learns and employs the best scheduling policy.

136

Self-Optimizing DRAM Controllers n  Engin Ipek, Onur Mutlu, José F. Martínez, and Rich

Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach" Proceedings of the 35th International Symposium on Computer Architecture (ISCA), pages 39-50, Beijing, China, June 2008.

137

Self-Optimizing DRAM Controllers n  Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana,

"Self Optimizing Memory Controllers: A Reinforcement Learning Approach" Proceedings of the 35th International Symposium on Computer Architecture (ISCA), pages 39-50, Beijing, China, June 2008.

138

Performance Results

139

QoS-Aware Memory Systems: The Dumb Resources Approach



[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12] [Subramanian+, HPCA’13]




ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]



141

Fairness via Source Throttling

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Fairness via Source Throttling: A Configurable and High-Performance

Fairness Substrate for Multi-Core Memory Systems" 15th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS),

pages 335-346, Pittsburgh, PA, March 2010. Slides (pdf)

FST ASPLOS 2010 Talk

Many Shared Resources

Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

... DRAM Bank K

...

Shared Memory Resources

Chip Boundary On-chip Off-chip

143

The Problem with “Smart Resources”

n  Independent interference control mechanisms in caches, interconnect, and memory can contradict each other

n  Explicitly coordinating mechanisms for different resources requires complex implementation

n  How do we enable fair sharing of the entire memory system by controlling interference in a coordinated manner?

144

An Alternative Approach: Source Throttling

n  Manage inter-thread interference at the cores, not at the shared resources

n  Dynamically estimate unfairness in the memory system n  Feed back this information into a controller n  Throttle cores’ memory access rates accordingly

q  Whom to throttle and by how much depends on performance target (throughput, fairness, per-thread QoS, etc)

q  E.g., if unfairness > system-software-specified target then throttle down core causing unfairness & throttle up core that was unfairly treated

n  Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10, TOCS’12.

145

A4 B1

A1 A2 A3

Oldest ⎧ ｜｜ ⎩

Shared Memory Resources

A: Compute Stall on A1 Stall on A2 Stall on A3 Stall on A4

Compute Stall waiting for shared resources Stall on B1 B:

Request Generation Order: A1, A2, A3, A4, B1

Unmanaged Interference

Core A’s stall time Core B’s stall time

A4

B1 A1

A2 A3

⎧ ｜｜ ⎩ Shared Memory

Resources

A: Compute Stall on A1 Stall on A2

Compute Stall wait. Stall on B1 B:

Dynamically detect application A’s interference for application B and throttle down application A

Core A’s stall time

Core B’s stall time

Fair Source Throttling

Stall wait.

Request Generation Order A1, A2, A3, A4, B1 B1, A2, A3, A4

queue of requests to shared resources

queue of requests to shared resources

Saved Cycles Core B Oldest

Intensive application A generates many requests and causes long stall times for less intensive application B

Throttled Requests

Stall on A4 Stall on A3

Extra Cycles Core A

Fairness via Source Throttling (FST)

n  Two components (interval-based)

n  Run-time unfairness evaluation (in hardware) q  Dynamically estimates the unfairness in the memory system q  Estimates which application is slowing down which other

n  Dynamic request throttling (hardware or software) q  Adjusts how aggressively each core makes requests to the

shared resources q  Throttles down request rates of cores causing unfairness

n  Limit miss buffers, limit injection rate

147

148

Runtime Unfairness Evaluation

Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) { 1-Throttle down App-interfering 2-Throttle up App-slowest }

FST Unfairness Estimate

App-slowest App-interfering

｜ ⎨

｜

⎧

⎩ Slowdown

Estimation

Time Interval 1 Interval 2 Interval 3










149


Estimating System Unfairness n  Unfairness =

n  Slowdown of application i =

n  How can be estimated in shared mode?

n  is the number of extra cycles it takes application i to execute due to interference

n 

150

Max{Slowdown i} over all applications i

Min{Slowdown i} over all applications i

Shared Ti

Ti Alone

Ti Alone

Ti Excess

Ti Shared

= Ti Alone

- Ti Excess

Tracking Inter-Core Interference

151

0 0 0 0

Interference per core bit vector

Core # 0 1 2 3

Core 0 Core 1 Core 2 Core 3

Bank 0 Bank 1 Bank 2 Bank 7 ...

Memory Controller

Shared Cache

Three interference sources: 1. Shared Cache 2. DRAM bus and bank 3. DRAM row-buffers

FST hardware

Bank 2

Row

Row A

Tracking DRAM Row-Buffer Interference

152

Core 0 Core 1

Bank 0 Bank 1 Bank 2 Bank 7 …

Shadow Row Address Register (SRAR) Core 1:

Shadow Row Address Register (SRAR) Core 0:

Queue of requests to bank 2 0 0

Row B

Row A

Row A

Row B

Row B

Interference per core bit vector Row Conflict Row Hit

Interference induced row conflict

1

Row A


153

0 0 0 0


Core # 0 1 2 3

0

0

0

0

Excess Cycles Counters per core

1

T Cycle Count T+1

1

T+2

2 FST hardware

1

T+3

3

1


Bank 0 Bank 1 Bank 2 Bank 7 ...

Memory Controller

Shared Cache

Ti Excess

｜

｜

Ti Shared

= Ti Alone

- Ti Excess







154



n  To identify App-interfering, for each core i q  FST separately tracks interference caused by each core j

( j ≠ i )

155

Cnt 3 Cnt 2 Cnt 1 Cnt 0 0

0 0 0 -


Core # 0 1 2 3 -

Cnt 1,0

Cnt 2,0

Cnt 3,0

Excess Cycles Counters per core

0 0 - 0 0 - 0 0 - 0 0 0

｜ ⎨

｜

⎧

⎩

｜ ⎨ ｜ ⎧

⎩

Interfered with core

Interfering core

Cnt 0,1

-

Cnt 2,1

Cnt 3,1

Cnt 0,2

Cnt 1,2

-

Cnt 3,2

Cnt 0,3

Cnt 1,3

Cnt 2,3

-

1 core 2

interfered with

core 1

Cnt 2,1+

0 1 2 3

Row with largest count determines App-interfering

App-slowest = 2

Pairwise interference bit matrix

Pairwise excess cycles matrix


156







Dynamic Request Throttling n  Goal: Adjust how aggressively each core makes requests to

the shared memory system

n  Mechanisms: q  Miss Status Holding Register (MSHR) quota

n  Controls the number of concurrent requests accessing shared resources from each application

q  Request injection frequency n  Controls how often memory requests are issued to the last level

cache from the MSHRs

157

Dynamic Request Throttling n  Throttling level assigned to each core determines both

MSHR quota and request injection rate

158

Throttling level MSHR quota Request Injection Rate

100% 128 Every cycle

50% 64 Every other cycle

25% 32 Once every 4 cycles





2% 2 Once every 50 cycles Total # of MSHRs: 128

FST at Work

159

Time Interval i Interval i+1 Interval i+2




App-slowest

App-interfering

Throttling Levels

Core 0 Core 1 Core 3 50% 100% 10% 100% 25% 100% 25% 100% 25% 50% 50% 100%

Interval i Interval i + 1 Interval i + 2

3

Core 2

Core 0

Core 0 Core 2 Throttle down Throttle up

2.5

Core 2

Core 1

Throttle down Throttle up

System software ��fairness goal: 1.4

Slowdown Estimation

｜

⎨ ｜

⎧

⎩

Slowdown Estimation

｜

⎨ ｜

⎧

⎩

System Software Support n  Different fairness objectives can be configured by

system software q  Keep maximum slowdown in check

n  Estimated Max Slowdown < Target Max Slowdown

q  Keep slowdown of particular applications in check to achieve a particular performance target n  Estimated Slowdown(i) < Target Slowdown(i)

n  Support for thread priorities q  Weighted Slowdown(i) =

Estimated Slowdown(i) x Weight(i)

160

FST Hardware Cost

n  Total storage cost required for 4 cores is ~12KB

n  FST does not require any structures or logic that are on the processor’s critical path

161

FST Evaluation Methodology

n  x86 cycle accurate simulator n  Baseline processor configuration

q  Per-core n  4-wide issue, out-of-order, 256 entry ROB

q  Shared (4-core system) n  128 MSHRs n  2 MB, 16-way L2 cache

q  Main Memory n  DDR3 1333 MHz n  Latency of 15ns per command (tRP, tRCD, CL) n  8B wide core to memory bus

162

FST: System Unfairness Results

163

44.4%

36%

FST: System Performance Results

164 164

25.6%

14%

164

Source Throttling Results: Takeaways

n  Source throttling alone provides better performance than a combination of “smart” memory scheduling and fair caching q  Decisions made at the memory scheduler and the cache

sometimes contradict each other

n  Neither source throttling alone nor “smart resources” alone provides the best performance

n  Combined approaches are even more powerful q  Source throttling and resource-based interference control

165

FST ASPLOS 2010 Talk







ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]



166

Memory Channel Partitioning

Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via

Application-Aware Memory Channel Partitioning” 44th International Symposium on Microarchitecture (MICRO),

Porto Alegre, Brazil, December 2011. Slides (pptx)

MCP Micro 2011 Talk

Outline

168

Goal: Mitigate

Inter-Application Interference

Previous Approach: Application-Aware Memory

Request Scheduling

Our First Approach: Application-Aware Memory

Channel Partitioning

Our Second Approach: Integrated Memory

Partitioning and Scheduling

Application-Aware Memory Request Scheduling

n  Monitor application memory access characteristics

n  Rank applications based on memory access

characteristics n  Prioritize requests at the memory controller,

based on ranking

169

thread

Threads in the system

thread

thread

thread

thread

thread

thread

Non-intensive cluster

Intensive cluster

thread

thread

thread

Memory-non-intensive

Memory-intensive

Prioritized

higher priority

higher priority

Throughput

Fairness

An Example: Thread Cluster Memory Scheduling

Figure: Kim et al., MICRO 2010

170

Application-Aware Memory Request Scheduling

171

Advantages n  Reduces interference between applications by

request reordering n  Improves system performance

Disadvantages n  Requires modifications to memory scheduling logic for

q  Ranking q  Prioritization

n  Cannot completely eliminate interference by request reordering

Our Approach

172


Request Scheduling







Goal: Mitigate


Observation: Modern Systems Have Multiple Channels

A new degree of freedom Mapping data across multiple channels

173

Channel 0 Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Data Mapping in Current Systems

174

Channel 0 Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Causes interference between applications’ requests

Page

Partitioning Channels Between Applications

175

Channel 0 Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Page

Eliminates interference between applications’ requests

Overview: Memory Channel Partitioning (MCP)

n  Goal q  Eliminate harmful interference between applications

n  Basic Idea

q  Map the data of badly-interfering applications to different channels

n  Key Principles

q  Separate low and high memory-intensity applications q  Separate low and high row-buffer locality applications

176

Key Insight 1: Separate by Memory Intensity High memory-intensity applications interfere with low

memory-intensity applications in shared memory channels

177

Map data of low and high memory-intensity applications to different channels

1 2 3 4 5 Channel 0

Bank 1

Channel 1

Bank 0

Conventional Page Mapping

Red App

Blue App

Time Units

Core

Core

Bank 1

Bank 0


Red App

Blue App

Channel 0 Time Units

1 2 3 4 5

Channel 1

Core

Core

Bank 1

Bank 0

Bank 1

Bank 0

Saved Cycles

Saved Cycles

Key Insight 2: Separate by Row-Buffer Locality

178

High row-buffer locality applications interfere with low row-buffer locality applications in shared memory channels

Conventional Page Mapping

Channel 0

Bank 1

Channel 1

Bank 0 R1

R0 R2 R3 R0

R4


Bank 1

Bank 0

Channel 1

Channel 0

R0 R0

Service Order

1 2 3 4 5 6

R2 R3

R4

R1

Time units

Bank 1

Bank 0

Bank 1

Bank 0

Channel 1

Channel 0

R0 R0

Service Order

1 2 3 4 5 6

R2 R3

R4 R1

Time units

Bank 1

Bank 0

Bank 1

Bank 0

R0

Channel 0

R1

R2 R3

R0

R4



Bank 1

Bank 0

Bank 1

Bank 0

Channel 1

Saved Cycles Map data of low and high row-buffer locality applications

to different channels

Memory Channel Partitioning (MCP) Mechanism

1. Profile applications 2. Classify applications into groups 3. Partition channels between application groups 4. Assign a preferred channel to each application 5. Allocate application pages to preferred channel

179

Hardware

System Software

1. Profile Applications

180

n  Hardware counters collect application memory access characteristics

n  Memory access characteristics q  Memory intensity: Last level cache Misses Per Kilo Instruction (MPKI)

q  Row-buffer locality: Row-buffer Hit Rate (RBH) - percentage of accesses that hit in the row buffer

2. Classify Applications

181

Test MPKI

High Intensity

High Low

Low Intensity

Test RBH

High Intensity Low Row-Buffer

Locality

Low

High Intensity High Row-Buffer

Locality

High

3. Partition Channels Among Groups: Step 1

182

Channel 1

Assign number of channels proportional to number of applications in group

. . .


Locality

Low Intensity Channel 2

Channel N-1

Channel N

Channel 3


Locality

3. Partition Channels Among Groups: Step 2

183

Channel 1


Locality


Locality


Channel N-1

Channel N

. . .

Assign number of channels proportional to bandwidth demand of group

Channel 3

Channel 1

. .


Locality


Locality


Channel N-1

Channel N

Channel N-1

Channel N

Channel 3

. . .

4. Assign Preferred Channel to Application

184

Channel 1

Low Intensity

Channel 2

MPKI: 1

MPKI: 3

MPKI: 4

MPKI: 1

MPKI: 3

MPKI: 4

n  Assign each application a preferred channel from its group’s allocated channels

n  Distribute applications to channels such that group’s bandwidth demand is balanced across its channels

5. Allocate Page to Preferred Channel

n  Enforce channel preferences computed in the previous step

n  On a page fault, the operating system

q  allocates page to preferred channel if free page available in preferred channel

q  if free page not available, replacement policy tries to allocate page to preferred channel

q  if it fails, allocate page to another channel

185

Interval Based Operation

186

time

Current Interval Next Interval

1. Profile applications

2. Classify applications into groups 3. Partition channels between groups 4. Assign preferred channel to applications

5. Enforce channel preferences

Integrating Partitioning and Scheduling

187


Request Scheduling





Goal: Mitigate


Observations n  Applications with very low memory-intensity rarely

access memory à Dedicating channels to them results in precious memory bandwidth waste

n  They have the most potential to keep their cores busy à We would really like to prioritize them

n  They interfere minimally with other applications à Prioritizing them does not hurt others

188

Integrated Memory Partitioning and Scheduling (IMPS)

n  Always prioritize very low memory-intensity applications in the memory scheduler

n  Use memory channel partitioning to mitigate

interference between other applications

189

Hardware Cost n  Memory Channel Partitioning (MCP)

q  Only profiling counters in hardware q  No modifications to memory scheduling logic q  1.5 KB storage cost for a 24-core, 4-channel system

n  Integrated Memory Partitioning and Scheduling (IMPS) q  A single bit per request q  Scheduler prioritizes based on this single bit

190

Methodology n  Simulation Model

q  24 cores, 4 channels, 4 banks/channel q  Core Model

n  Out-of-order, 128-entry instruction window n  512 KB L2 cache/core

q  Memory Model – DDR2

n  Workloads q  240 SPEC CPU 2006 multiprogrammed workloads

(categorized based on memory intensity)

n  Metrics q  System Performance

191

∑=i

alonei

sharedi

IPCIPCSpeedupWeighted

Previous Work on Memory Scheduling n  FR-FCFS [Zuravleff et al., US Patent 1997, Rixner et al., ISCA 2000]

q  Prioritizes row-buffer hits and older requests q  Application-unaware

n  ATLAS [Kim et al., HPCA 2010]

q  Prioritizes applications with low memory-intensity

n  TCM [Kim et al., MICRO 2010]

q  Always prioritizes low memory-intensity applications q  Shuffles request priorities of high memory-intensity applications

192

Comparison to Previous Scheduling Policies

193

1%

5%

0.9

0.95

1

1.05

1.1

1.15 Normalized

System

Perform

ance

FRFCFS

ATLAS

TCM

MCP

IMPS

7%

11%

Significant performance improvement over baseline FRFCFS Better system performance than the best previous scheduler at lower hardware cost

Averaged over 240 workloads

194

0.94 0.96 0.98

1 1.02 1.04 1.06 1.08 1.1

1.12

FRFCFS ATLAS TCM

Normalized

System

Perform

ance

No IMPS IMPS

0.94 0.96 0.98

1 1.02 1.04 1.06 1.08 1.1

1.12

FRFCFS ATLAS TCM

Normalized

System

Perform

ance

No IMPS

IMPS improves performance regardless of scheduling policy Highest improvement over FRFCFS as IMPS designed for FRFCFS

Interaction with Memory Scheduling Averaged over 240 workloads

MCP Summary n  Uncontrolled inter-application interference in main memory

degrades system performance

n  Application-aware memory channel partitioning (MCP) q  Separates the data of badly-interfering applications

to different channels, eliminating interference

n  Integrated memory partitioning and scheduling (IMPS) q  Prioritizes very low memory-intensity applications in scheduler q  Handles other applications’ interference by partitioning

n  MCP/IMPS provide better performance than application-aware memory request scheduling at lower hardware cost

195







ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] ] [Nychis+ SIGCOMM’12]



196

Summary: Memory QoS

n  Technology, application, architecture trends dictate new needs from memory system

n  A fresh look at (re-designing) the memory hierarchy q  Scalability: DRAM-System Codesign and New Technologies q  QoS: Reducing and controlling main memory interference:

QoS-aware memory system design q  Efficiency: Customizability, minimal waste, new technologies

n  QoS-unaware memory: uncontrollable and unpredictable n  Providing QoS awareness improves performance,

predictability, fairness, and utilization of the memory system

197

Summary: Memory QoS Approaches and Techniques

n  Approaches: Smart vs. dumb resources q  Smart resources: QoS-aware memory scheduling q  Dumb resources: Source throttling; channel partitioning q  Both approaches are effective in reducing interference q  No single best approach for all workloads

n  Techniques: Request/thread scheduling, source throttling, memory partitioning q  All approaches are effective in reducing interference q  Can be applied at different levels: hardware vs. software q  No single best technique for all workloads

n  Combined approaches and techniques are the most powerful q  Integrated Memory Channel Partitioning and Scheduling [MICRO’11]

198 MCP Micro 2011 Talk

QoS-Aware Memory Systems

Onur Mutlu [email protected]

July 8, 2013 INRIA

Additional Material

200

Two Works

n  Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi, "Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

n  Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

201

Staged Memory Scheduling

Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu, "Staged Memory Scheduling: Achieving High Performance

and Scalability in Heterogeneous Systems” 39th International Symposium on Computer Architecture (ISCA),

Portland, OR, June 2012.

SMS ISCA 2012 Talk

Executive Summary n  Observation: Heterogeneous CPU-GPU systems require


n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes



203

n  All cores contend for limited off-chip bandwidth q  Inter-application interference degrades system performance q  The memory scheduler can help mitigate the problem

n  How does the memory scheduler deliver good performance and fairness?

Main Memory is a Bottleneck

204

Memory Scheduler


To DRAM

Mem

ory

Requ

est

Buffe

r

Req Req Req Req Req Req

Req

Data Data

Req Req

Currently open row B

n  Prioritize row-buffer-hit requests [Rixner+, ISCA’00]

q  To maximize memory bandwidth

n  Prioritize latency-sensitive applications [Kim+, HPCA’10]

q  To maximize system throughput

n  Ensure that no application is starved [Mutlu and Moscibroda, MICRO’07]

q  To minimize unfairness

Three Principles of Memory Scheduling

205

Req 1 Row A Req 2 Row B Req 3 Row C Req 4 Row A Req 5 Row B

Application Memory Intensity (MPKI) 1 5 2 1 3 2 4 10

Older

Newer

Memory Scheduling for CPU-GPU Systems n  Current and future systems integrate a GPU along with

multiple cores

n  GPU shares the main memory with the CPU cores

n  GPU is much more (4x-20x) memory-intensive than CPU

n  How should memory scheduling be done when GPU is integrated on-chip?

206

n  GPU occupies a significant portion of the request buffers

q  Limits the MC’s visibility of the CPU applications’ differing memory behavior à can lead to a poor scheduling decision

Introducing the GPU into the System

207

Memory Scheduler


To DRAM

Req Req

GPU

Req Req Req Req Req Req Req


Req Req Req Req Req Req Req Req

Req

Naïve Solution: Large Monolithic Buffer

208

Memory Scheduler

To DRAM


Req Req Req Req Req Req Req Req Req Req Req Req Req Req Req Req



Req Req Req Req Req Req Req Req Req Req

GPU

n  A large buffer requires more complicated logic to: q  Analyze memory requests (e.g., determine row buffer hits) q  Analyze application characteristics q  Assign and enforce priorities

n  This leads to high complexity, high power, large die area

Problems with Large Monolithic Buffer

209

Memory Scheduler

Req Req

Req

Req Req

Req Req Req Req Req

Req Req

Req Req Req

Req Req

Req Req Req

Req Req Req

Req

Req

Req Req

Req Req Req

Req

Req

Req

Req

Req Req Req Req

Req Req

Req Req

More Complex Memory Scheduler

n  Design a new memory scheduler that is: q  Scalable to accommodate a large number of requests q  Easy to implement q  Application-aware q  Able to provide high performance and fairness, especially in

heterogeneous CPU-GPU systems

Our Goal

210

Key Functions of a Memory Controller n  Memory controller must consider three different things

concurrently when choosing the next request: 1) Maximize row buffer hits

q  Maximize memory bandwidth

2) Manage contention between applications q  Maximize system throughput and fairness

3) Satisfy DRAM timing constraints n  Current systems use a centralized memory controller

design to accomplish these functions q  Complex, especially with large request buffers

211

Key Idea: Decouple Tasks into Stages n  Idea: Decouple the functional tasks of the memory controller

q  Partition tasks across several simpler HW structures (stages)

1) Maximize row buffer hits q  Stage 1: Batch formation q  Within each application, groups requests to the same row into

batches

2) Manage contention between applications q  Stage 2: Batch scheduler q  Schedules batches from different applications

3) Satisfy DRAM timing constraints q  Stage 3: DRAM command scheduler q  Issues requests from the already-scheduled order to each bank

212

SMS: Staged Memory Scheduling

213

Memory Scheduler


To DRAM

GPU

Req Req

Req

Req Req

Req Req Req Req Req

Req Req Req Req Req

Req Req

Req Req Req

Req Req Req

Req

Req

Req Req

Req Req Req Req Req


Req

Req Req Batch Scheduler

Stage 1

Stage 2

Stage 3

Req

Mon

olith

ic S

ched

uler

Batch Formation

DRAM Command Scheduler


Stage 1

Stage 2


214


To DRAM

GPU


Batch Formation

Stage 3



Stage 1: Batch Formation n  Goal: Maximize row buffer hits n  At each core, we want to batch requests that access the

same row within a limited time window

n  A batch is ready to be scheduled under two conditions 1) When the next request accesses a different row 2) When the time window for batch formation expires

n  Keep this stage simple by using per-core FIFOs

215


Stage 1: Batch Formation Example

216

Row A Row B Row B Row C

Row D Row D Row E Row F

Row E

Batch Boundary

To Stage 2 (Batch Scheduling)

Row A

Time window expires

Next request goes to a different row Stage 1

Batch Formation


217

Stage 1

Stage 2


To DRAM

GPU


Batch Formation

Stage 3



Stage 2: Batch Scheduler n  Goal: Minimize interference between applications

n  Stage 1 forms batches within each application n  Stage 2 schedules batches from different applications

q  Schedules the oldest batch from each application

n  Question: Which application’s batch should be scheduled next?

n  Goal: Maximize system performance and fairness q  To achieve this goal, the batch scheduler chooses between

two different policies

218

Stage 2: Two Batch Scheduling Algorithms n  Shortest Job First (SJF)

q  Prioritize the applications with the fewest outstanding memory requests because they make fast forward progress

q  Pro: Good system performance and fairness q  Con: GPU and memory-intensive applications get deprioritized

n  Round-Robin (RR) q  Prioritize the applications in a round-robin manner to ensure

that memory-intensive applications can make progress q  Pro: GPU and memory-intensive applications are treated fairly q  Con: GPU and memory-intensive applications significantly

slow down others

219

Stage 2: Batch Scheduling Policy n  The importance of the GPU varies between systems and

over time à Scheduling policy needs to adapt to this

n  Solution: Hybrid Policy n  At every cycle:

q  With probability p : Shortest Job First à Benefits the CPU q  With probability 1-p : Round-Robin à Benefits the GPU

n  System software can configure p based on the importance/weight of the GPU q  Higher GPU importance à Lower p value

220


221

Stage 1

Stage 2


To DRAM

GPU


Batch Formation

Stage 3



Stage 3: DRAM Command Scheduler n  High level policy decisions have already been made by:

q  Stage 1: Maintains row buffer locality q  Stage 2: Minimizes inter-application interference

n  Stage 3: No need for further scheduling n  Only goal: service requests while satisfying DRAM

timing constraints

n  Implemented as simple per-bank FIFO queues

222

Current Batch Scheduling

Policy SJF

Current Batch Scheduling

Policy RR

Batch Scheduler


Putting Everything Together

223


Stage 1: Batch Formation

Stage 3: DRAM Command Scheduler

GPU

Stage 2:

Complexity n  Compared to a row hit first scheduler, SMS consumes*

q  66% less area q  46% less static power

n  Reduction comes from: q  Monolithic scheduler à stages of simpler schedulers q  Each stage has a simpler scheduler (considers fewer

properties at a time to make the scheduling decision) q  Each stage has simpler buffers (FIFO instead of out-of-order) q  Each stage has a portion of the total buffer size (buffering is

distributed across stages)

224 * Based on a Verilog model using 180nm library

Methodology n  Simulation parameters

q  16 OoO CPU cores, 1 GPU modeling AMD Radeon™ 5870 q  DDR3-1600 DRAM 4 channels, 1 rank/channel, 8 banks/channel

n  Workloads q  CPU: SPEC CPU 2006 q  GPU: Recent games and GPU benchmarks q  7 workload categories based on the memory-intensity of CPU

applications à Low memory-intensity (L) à Medium memory-intensity (M) à High memory-intensity (H)

225

Comparison to Previous Scheduling Algorithms n  FR-FCFS [Rixner+, ISCA’00]

q  Prioritizes row buffer hits q  Maximizes DRAM throughput q  Low multi-core performance ç Application unaware

n  ATLAS [Kim+, HPCA’10] q  Prioritizes latency-sensitive applications q  Good multi-core performance q  Low fairness ç Deprioritizes memory-intensive applications

n  TCM [Kim+, MICRO’10] q  Clusters low and high-intensity applications and treats each

separately q  Good multi-core performance and fairness q  Not robust ç Misclassifies latency-sensitive applications

226

Evaluation Metrics n  CPU performance metric: Weighted speedup

n  GPU performance metric: Frame rate speedup

n  CPU-GPU system performance: CPU-GPU weighted speedup

227

Evaluated System Scenario: CPU Focused n  GPU has low weight (weight = 1)

n  Configure SMS such that p, SJF probability, is set to 0.9 q  Mostly uses SJF batch scheduling à prioritizes latency-

sensitive applications (mainly CPU)

228

1

n  SJF batch scheduling policy allows latency-sensitive applications to get serviced as fast as possible

0

2

4

6

8

10

12

L ML M HL HML HM H Avg

CG

WS FR-FCFS

ATLAS

TCM

SMS_0.9

Performance: CPU-Focused System

229

+17.2% over ATLAS

SMS is much less complex than previous schedulers p=0.9

Workload Categories

Evaluated System Scenario: GPU Focused n  GPU has high weight (weight = 1000)

n  Configure SMS such that p, SJF probability, is set to 0 q  Always uses round-robin batch scheduling à prioritizes

memory-intensive applications (GPU)

230

1000

n  Round-robin batch scheduling policy schedules GPU requests more frequently

0

200

400

600

800

1000

L ML M HL HML HM H Avg

CG

WS FR-FCFS

ATLAS

TCM

SMS_0

Performance: GPU-Focused System

231

+1.6% over FR-FCFS

SMS is much less complex than previous schedulers p=0

Workload Categories

Performance at Different GPU Weights

232

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best Best Previous Scheduler

ATLAS TCM FR-FCFS

n  At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight

Performance at Different GPU Weights

233

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best

SMS SMS

Best Previous Scheduler

Additional Results in the Paper n  Fairness evaluation

q  47.6% improvement over the best previous algorithms

n  Individual CPU and GPU performance breakdowns

n  CPU-only scenarios q  Competitive performance with previous algorithms

n  Scalability results q  SMS’ performance and fairness scales better than previous

algorithms as the number of cores and memory channels increases

n  Analysis of SMS design parameters

234

Conclusion n  Observation: Heterogeneous CPU-GPU systems require


n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer size



235

qos-aware memory systems - carnegie mellon...

Documents