|introduction |background |tap (tlp-aware cache management policy) core sampling cache block...

TAPA TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture

Jaekyu Lee Hyesoon Kim


CPU-GPU Heterogeneous Architecture

| Combining GPU cores with conventional CMPs is a trend.

| Various resources are shared between CPU and GPU cores. LLC, on-chip interconnection, memory controller, and DRAM

| Shared cache is one of most important resources.

Intel’s Sandy Bridge

AMD’s Fusion Denver Project

3/25


Cache Sharing Problem

| Many researchers have proposed various cache mechanisms. Dynamic cache partitioning

Suh+[HPCA’02], Kim+[PACT’04], Qureshi+[MICRO’06] Dynamic cache insertion policies

Qureshi+[ISCA’07], Jaleel+[PACT’08,ISCA’10], Wu+[MICRO’11,MICRO’11]

Many other mechanisms

| All mechanisms target CMPs.

| These may not be directly applicable to CPU-GPU heterogeneous architectures because CPU and GPU cores have different characteristics.

4/25


CPU vs. GPU cores (1)

| SIMD, massive threading, lack of speculative execution, …

| GPU cores have an order-of-magnitude more threads. CPU: 1-4 way SMT GPU: 10s of active threads in a core

| GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores.

| TLP has a significant impact on how caching affects performance of applications.

5/25


| With low TLP | With High TLP| This type is hardly found

in CPU applications

TLP Effect on Caching

Cache Size

Cache Size

MPKI

CPI

MPKI

CPI

MPKICPI

Compute intensive

or Thrashing

TLP Dominant

Cache friendly

6/25


TLP Effect on Caching

| Cache-oriented metrics cannot differentiate two types. Unable to recognize the effect of TLP

| We need to directly monitor performance effect by caching.

Cache Size

Cache Size

MPKI

CPI

MPKI

CPICPI CPI

TLP Dominant

Cache friendly

Identical

Different

7/25


Core Sampling

| Samples GPU cores with different cache policies

Core

L1 L1 L1L1L1 L1

DRAM

Core CoreCoreCore Core

Shared Last-Level Cache

CorePOL2

Bypassing LLC

(No L3)

MRU insertion

policy in LLC

CoreFollow

CoreFollow

CoreFollow

CoreFollow

Followers

CorePOL1

CPUs

GPUs

8/25


Core Sampling

| Measures performance differences

Core SamplingController

Calculate ∆ (IPC1, IPC2)

∆ > Threshold

Cache-friendly Caching improves perf.

Not Cache-friendly Caching does not affect perf.

Yes No

Collect Performance Samples

Calculate Performance Delta

Make a decision

IPC1 IPC2

CorePOL2

Bypassing LLC

(No L3)

MRU insertion

policy in LLC

CoreFollow

CoreFollow

CoreFollow

CoreFollow

Followers

CorePOL1

9/25


Core Sampling Example

Cache Size

Cache Size

MPKI

CPI

MPKI

CPI

TLP Dominant

Cache friendly

CorePOL1

CorePOL2

CorePOL1

CorePOL2

∆ > Threshold: Cache-friendly

∆ < Threshold: Not cache-friendly

10/25

Bypassing LLC

MRU insertion

Bypassing LLC

MRU insertion


Core Sampling

| Having different LLC policies for cores to identify the effect of last-level cache

| Main goal - finding cache-friendly GPGPU applications

| How core sampling is viable SPMD (Single Program, Multiple Data) model

Each GPU core is running same program. GPGPUs usually have symmetric behavior on their running GPU

cores. Performance variance between GPU cores is very small.

11/25


CPU vs. GPU cores (2)

| GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores.

| GPU cores have an order-of-magnitude more cache accesses

| GPUs have higher tolerance for cache misses due to TLP Generate cache accesses from different threads without stalls

| SIMD execution – one SIMD instruction can generate multiple memory requests

12/25


More Frequent Accesses by GPU Cores

GPU Threads

CPU Thread

Processor stalled

Cache miss

Cache miss

nam

dsje

ng

gam

ess

h264

ref

gobm

kwrf

sphinx

3

bwav

esm

ilc

lesli

e3d

AVG0

20

40

60

80

100

Requests

per

1000

cycle

s

0500

10001500200025003000

Requests

per

1000

cycle

s

< 100 > 500

Stalled, fewer cache accesses

No stalls, more^2 cache accesses

CPU, 1-core GPU, 6-core

13/25


More Frequent Accesses by GPU Cores

| Why are much more frequent accesses from GPGPU applications problematic? Severe interference by GPGPU applications

e.g.) base LRU replacement policy

Performance impact of cache hits is different in applications. Perf. PenaltyCPU(cache miss) Perf. PenaltyGPU(cache miss)

| We have to consider the different degree of cache accesses.

| We propose Cache Block Lifetime Normalization.

14/25

=?>


Cache Block Lifetime Normalization

| Simple monitoring mechanism Monitor cache access rate differences between CPU and

GPGPU applications and periodically calculate the ratio

| Hints for proposed TAP mechanisms regarding access rate differences

XSRATIO

GPU $ Access Counter

CPU $ Access Counter

Calculate Ratioif > threshold XSRATIO =

15/25

if < threshold XSRATIO =


TLP-Aware Cache Management Policy

TAP

Core Sampling

Lifetime Normalization

To find cache-friendly applications

To consider different degree of cache accesses

UCPUtility-based

Cache Partitioning

RRIPRe-Reference

Interval Prediction

TAP-UCP

TAP-RRIP

In this talk

In the paper

16/25


TAP-UCP

| UCP-Mask Register| Core Sampling| Cache block lifetime

normalization

Partitioning Algorithm

Core SamplingController

Cache block lifetime

normalization

XSRATIO

TAP (TLP-Aware)

UCP-Mask

/LLC

Per application, ATD and hit counters

ATD(LRU

Stack)

ATD(LRU

Stack)

ATD(LRU

Stack) Way Hit Counters

Way Hit Counters

Way Hit Counters

n1 n2 n3 n4 n5 n6 n7 n8Optimal Partition

Divide hit counter by XSRATIO register valueto balance cache space

UCP-Mask = 1if not cache friendly

n2 n3 n4 n5 n6 n7GPUCPU

UCP

Assign 1 way to GPGPUIf UCP-Mask == 1

UCP [Qureshi and Patt, MICRO-2006]

17/25

UCPPartitioning Algorithm

Way Hit Counters

TAP


3 5.510.3 9 8.8 7.8

TAP-UCP Case 1: Non Cache-Friendly

16 3 8 20 5 8 3 2

CPU Hit Counters 32 6 16 40 10 16 6 4

GPU Hit Counters

UCP TAP-UCP

C G

3 5.510.3 9 8.8 7.8 6 1120.71817.615.7

C G G G G

6 1120.71817.615.7

C G G G G G G

10 1310.13 5.510.3 10 1310.1

C G G G G G G G

16 3 8 20 5 8 3 2 32 6 16 40 10 16 6 4

C C C C C C C G

Final Partition Final Partition

Marginal UtilityHow many more

hits are expected if N ways are given to an application

31way3+82way3+8+203way

1 CPU 7 GPU

7 CPU 1 GPU

MRU LRU MRU LRU

Utility

……

18/25

Not Cache-friendly

Caching has little effect on Perf.

Assign only 1 way to GPGPU

Case 1: Non Cache-Friendly

∆ < Threshold

UCP

TAP-UCP

1 CPU: 7 GPU

7 CPU: 1 GPU

Perf

orm

an

ce

4 CPU: 4 GPU

More GPU ways

More CPU ways


TAP-UCP Case 2: Cache-Friendly

16 3 8 20 5 8 3 2

CPU Hit Counters 32 6 16 40 10 16 6 4

GPU Hit Counters

3 5.510.3 9 8.8 7.8

16 3 8 20 5 8 3 2

3 5.510.3 9 8.8 7.8

C G

3 5.510.3 9 8.8 7.8

C G C C C

5 6.5 5.3 3 5.510.33 5.510.3

Cache-friendly∆ >

Threshold

C G C C C G G G

C G C C C G G G

Divide hit counters

by XSRATIO

Final Partition4 CPU 4 GPU

Utility

UCP

C G G G G G G G

Final Partition1 CPU 7 GPU

TAP-UCP

XSRATIO = 2

MRU LRU MRU LRU

19/25

Case 2: Cache-Friendly

UCP

TAP-UCPPerf

orm

an

ce

1 CPU: 7 GPU

7 CPU: 1 GPU

4 CPU: 4 GPU

More GPU ways

More CPU ways


Evaluation Methodology

| MacSim simulator (http://code.google.com/p/macsim) [GT] Trace-driven, timing simulator, x86+ptx instructions

| Workload CPU: SPEC 2006 GPGPU: CUDA SDK, Parboil, Rodinia, ERCBench1-CPU (1 CPU + 1 GPU)

2-CPU(2 CPUs + 1 GPU)

4-CPU(4 CPUs + 1 GPU)

Stream-CPU(Stream CPU + 1 GPU)

152 150 75 25

OOO4-

wide

Private

L1/L2

16 SIMDwidth

PrivateL1

CPU (1-4 cores)

GPU (6 cores)

32-way8MB

Shared LLC

(Base: LRU)

LLCDDR3-1333,

41.6GB/s BW

FR-FCFS

DRAM

21/25

http://code.google.com/p/macsim


Compu

te In

tens

ive

Thra

shin

g

Cache

-frie

ndly

TLP

dom

inan

t

Thra

shin

g (T

LP)

AVG0.9

0.95

1

1.05

1.1

1.15

1.2

UCP TAP-UCP

Sp

ee

du

p o

ve

r LR

U

| UCP is effective with thrashing.

| Less effective with cache-sensitive GPGPU applications.

| RRIP is generally less effective on heterogeneous workloads.

Result

Compu

te In

tens

ive

Thra

shin

g

Cache

-frie

ndly

TLP

dom

inan

t

Thra

shin

g (T

LP)

AVG0.9

0.95

1

1.05

1.1

1.15

1.2

RRIP TAP-RRIP

Sp

ee

du

p o

ve

r LR

U

11% 12%

22/25


| Sphinx3 + Stencil

Case Study

| Stencil TLP dominant

| MPKI CPU: significant decrease GPGPU: considerable

increase Overall MPKI: increased

| Performance CPU: huge improvement GPU: no change Overall: huge

improvement

MPKI CPU MPKI GPU MPKI Overall

0

0.4

0.8

1.2Previous TAP

No

rma

lize

d M

PK

I

CPU Speedup

GPU Speedup

Overall Speedup

0

0.5

1

1.5

2

2.5 Previous TAP

Sp

ee

du

p o

ve

r LR

U

23/25


Result – Multiple CPU Applications

| TAP mechanisms show higher benefits with more CPU applications.

1 CPU App + 1 GPGPU App

2 CPU Apps + 1 GPGPU App

4 CPU Apps + 1 GPGPU App

1

1.05

1.1

1.15

1.2

1.25

1.3

UCP TAP-UCP RRIP TAP-RRIP

Sp

ee

du

p o

ve

r LR

U

11%12.5%

17.5%

12%14%

24%

24/25


Conclusion

| CPU-GPU Heterogeneous architecture is a popular trend. Resource sharing problem is more significant.

| We propose TAP for CPU-GPU heterogeneous architecture First proposal to consider the resource sharing problem

| We introduce a core sampling technique that samples GPU cores with different policies to identify cache-friendliness.

| Two TAP mechanisms improve the performance of the system significantly. TAP-UCP: 11% over LRU and 5% over UCP TAP-RRIP: 12% over LRU and 9% over RRIP

25/25

|introduction |background |tap (tlp-aware cache management policy) core sampling cache block...

Documents

core gpu cores

cachefriendly threshold

dram shared cache

various cache mechanisms

effect of tlp

cacheoriented metrics

level cache main goal

cpu cores