heterogeneous system coherenceheterogeneous system coherence for integrated cpu-gpu systems jason...

46
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU , SOORAJ PUTHOOR , BRADFORD M BECKMANN , MARK D HILL* , STEVEN K REINHARDT , DAVID A WOOD* *University of Wisconsin-Madison Advanced Micro Devices, Inc.

Upload: others

Post on 14-Jan-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

JASON POWER*, ARKAPRAVA BASU*, JUNLI GU†, SOORAJ PUTHOOR†,

BRADFORD M BECKMANN†, MARK D HILL*†, STEVEN K REINHARDT†, DAVID A WOOD*†

*University of Wisconsin-Madison

†Advanced Micro Devices, Inc.

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 2

Powerpoint version available on:

http://pages.cs.wisc.edu/~powerjg/

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 4

ABSTRACT

Hardware coherence can increase the utility of

heterogeneous systems

Major bottlenecks in current coherence implementations

‒High bandwidth difficult to support at directory

‒Extreme resource requirements

We propose Heterogeneous System Coherence

‒Leverages spatial locality and region coherence

‒Reduces bandwidth by 94%

‒Reduces resource requirements by 95%

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 5

PHYSICAL INTEGRATION

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 6

PHYSICAL INTEGRATION

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 7

PHYSICAL INTEGRATION

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 8

PHYSICAL INTEGRATION

CPU

Cores

GPU

Stacked High-bandwidth DRAM

Credit: IBM

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 9

LOGICAL INTEGRATION

General-purpose GPU computing

‒OpenCL

‒CUDA

Heterogeneous Uniform Memory Access (hUMA)

‒Shared virtual address space

‒Cache coherence

Allows new heterogeneous apps

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 10

OUTLINE

Motivation

Background ‒System overview

‒Cache architecture reminder

Heterogeneous System Bottlenecks

Heterogeneous System Coherence Details

Results

Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 11

SYSTEM OVERVIEW SYSTEM LEVEL

Accelerated

Processing

Unit (APU)

DRAM Channels

High-

bandwidth

interconnect

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 12

SYSTEM OVERVIEW APU

APU

CPU

Cluster

To DRAM

Directory

GPU

Cluster

Direct-access

bus

(used for graphics)

Invalidation

traffic

GPU compute

accesses must

stay coherent

Arrow thickness

→bandwidth

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 13

SYSTEM OVERVIEW GPU

GPU Cluster

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

CU

L1

L1 L1 L1 L1 L1 L1 L1 L1L1 L1 L1 L1 L1 L1 L1 L1

CU CU CU CU CU CU CU CUCU CU CU CU CU CU CU CU

GPU L2 Cache

Very high bandwidth:

L2 has high miss rate

CU

I-Fetch / Decode

Register File

Ex Ex Ex Ex

Ex Ex Ex Ex

Ex Ex Ex Ex

Ex Ex Ex Ex

Local Scratchpad

Memory

Co

ale

sce

rTo L1

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 14

CPU Cluster

CPU

Core

L1

CPU

Core

L1

CPU

Core

L1

CPU

Core

L1

To DirL2

SYSTEM OVERVIEW

Low bandwidth:

Low L2 miss rate

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 15

CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE

Demand

Requests

Cache Tag Arrays

Hit

Miss

Requests

Core Data

Responses

Probe

Requests

Data

Responses

MS

HR

En

trie

s

MSHRs

Coherent

Network

Interface

Demand requests

from L1 cache Allocates an MSHR

entry

Searches cache tags

for a tag match

On a hit, return

data to the L1

On a miss, send

request to directory

On a directory

probe, check

MSHRs and tags

Tag hit on probe: send

data to other core

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 16

DIRECTORY ARCHITECTURE REMINDER DIRECTORY

Block Directory Tag Array

PR

En

trie

s

Probe

Request RAM

Coherent

Block Requests

Miss

Hit

Block Probe

Requests/

Responses

MS

HR

En

trie

s

MSHRs

To DRAM

Demand requests

from L2 cache Allocates an MSHR

entry

Searches cache tags

for a tag match

Allocate and send

probes to L2 caches

On a miss, the data

comes from DRAM

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 17

BACKGROUND SUMMARY

System under investigation

‒Heterogeneous CPU-GPU on chip

‒High-bandwidth DRAM

Directory pipeline complex

‒MSHR array is associative

‒Difficult to pipeline with more than 1 request per cycle

‒ Important resources: MSHR entries

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 18

OUTLINE

Motivation

Background

Heterogeneous System Bottlenecks ‒Simulation overview

‒Directory bandwidth

‒MSHRs

‒Performance is significantly affected

Heterogeneous System Coherence Details

Results

Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 19

SIMULATION DETAILS

gem5 simulator

‒Simple CPU

‒GPU simulator based on AMD GCN

‒All memory requests through gem5

CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 2 MB (16-way banked) GPU Clock 1 GHz Compute Units 32 GPU Shared L2 4 MB (64-way banked) L3 (Memory-side) 16 MB (16-way banked) DRAM DDR3, 16 channels Peak Bandwidth 700 GB/s Baseline Directory 256k entries (8-way banked)

Workloads

‒Modified to use hUMA

‒Rodinia & AMD APP SDK

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 20

GPGPU BENCHMARKS

Rodinia benchmarks ‒ bp trains the connection weights on a neural network ‒ bfs breadth-first search ‒ hs performs a transient 2D thermal simulation (5-point stencil) ‒ lud matrix decomposition ‒ nw performs a global optimization for DNA sequence alignment

‒ km does k-means clustering ‒ sd speckle-reducing anisotropic diffusion

AMD SDK ‒ bn bitonic sort ‒ dct discrete cosine transform ‒ hg histogram ‒ mm matrix multiplication

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 21

SYSTEM BOTTLENECKS

Difficult to scale directory bandwidth

‒Difficult to multi-port

‒Complicated pipeline

High resource usage

‒Must allocate MSHR for entire duration of request

‒MSHR array difficult to scale

APU

CPU

Cluster

To DRAM

Directory

GPU

Cluster

High bandwidth

Designed to

support CPU

bandwidth

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 22

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

bp bfs hs lud nw km sd bn dct hg mm

Dir

ecto

ry a

cces

ses

per

GPU

cyc

le

DIRECTORY TRAFFIC

Difficult to support >1

request per cycle

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 23

1

10

100

1000

10000

100000

bp bfs hs lud nw km sd bn dct hg mm

Max

imum

MSH

Rs

RESOURCE USAGE

Causes significant

back-pressure on L2s

Steady state at

700 GB/s

Very difficult to

scale MSHR array

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 24

PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bp bfs hs lud nw km sd bn dct hg mm

Slow

dow

n

Back-pressure from limited

MSHRs and bandwidth

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 25

BOTTLENECKS SUMMARY

Directory bandwidth

‒Must support up to 4 requests per cycle

‒Difficult to construct pipeline

Resource usage

‒MSHRs are a constraining resource

‒Need more than 10,000

‒Without resource constraints, up to 4x better performance

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 26

OUTLINE

Motivation

Background

Heterogeneous System Bottlenecks

Heterogeneous System Coherence Details ‒Overall system design

‒Region buffer design

‒Region directory design

‒Example

‒Hardware complexity

Results

Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 27

BASELINE DIRECTORY COHERENCE

APU

CPU

Cluster

To DRAM

Directory

GPU

Cluster

Kernel Launch

Initialization

Read result

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 28

HETEROGENEOUS SYSTEM COHERENCE (HSC)

APU

CPU

Cluster

To DRAM

Directory

GPU

Cluster

Kernel Launch

Initialization

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 29

APU

CPU

Cluster

To DRAM

Directory

GPU

Cluster

HETEROGENEOUS SYSTEM COHERENCE (HSC)

APU

To DRAM

Region

Directory

GPU

Cluster

CPU

Cluster

APU

To DRAM

Region

Directory

GPU

Cluster

CPU

ClusterRegion

Buffer

Region

Buffer

Region buffers

coordinate with

region directory

Direct-access bus Direct-access bus

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 32

HSC: EXAMPLE MEMORY REQUEST

APU

To DRAM

Region

Directory

GPU

Cluster

CPU

ClusterRegion

Buffer

Region

BufferGPU Region Buffer

GPU L2 Cache

Region Directory

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 33

Demand

Requests

Cache Tag Arrays

Hit

Miss

Requests

Core Data

Responses

Probe

Requests

Data

Responses

MS

HR

En

trie

s

MSHRs

Coherent

Network

Interface

HSC: L2 CACHE & REGION BUFFER

Miss

Hit

Miss

Demand

Requests

Cache Tag Arrays

HitCore Data

Responses

Coherent

Network

Interface

Probe

Requests

Region Buffer

Direct Access Bus Interface

Hit

Miss

MS

HR

En

trie

s

MSHRs

Region tags and

permissions

Interface for

direct-access bus

Only region-level

permission traffic

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 34

Block Directory Tag Array

PR

En

trie

s

Probe

Request RAM

Coherent

Block Requests

Miss

Hit

Block Probe

Requests/

Responses

MS

HR

En

trie

s

MSHRs

To DRAM

HSC: REGION DIRECTORY

Region Directory Tag Array

Region

Permission

Requests

Miss

Hit

MS

HR

En

trie

sMSHRs

PR

En

trie

s

Probe

Request RAM

Block Probe

Requests/Responses

Region tags,

sharers, and

permissions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 35

HSC: HARDWARE COMPLEXITY

Region protocols reduce

directory size

‒Region directory: 8x fewer entries

Region buffers

‒At each L2 cache

‒1-KB region (16 64-B blocks)

‒16-K region entries

‒Overprovisioned for low-locality

workloads

(b) Region Buffer Entry

(a) Region Directory Entry

Region Tag State B0 B1 B2 ... B15

18 bits 1 valid bit per

block in the region

Region Tag State CPU GPU

1 valid bit

per cluster

2 bits

2 bits18 bits

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 36

HSC SUMMARY

Key insight

‒GPU-CPU applications exhibit high spatial locality

‒Use direct-access bus present in systems

‒Offload bandwidth onto direct-access bus

Use coherence network only for permission

Add region buffer to track region information

‒At each L2 cache

‒Bypass coherence network and directory

Replace directory with region directory

‒Significantly reduces total size needed

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 37

OUTLINE

Motivation

Background

Heterogeneous System Bottlenecks

Heterogeneous System Coherence Details

Results ‒Speed-up

‒Latency of loads

‒Bandwidth

‒MSHR usage

Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 38

THREE CACHE-COHERENCE PROTOCOLS

Broadcast: Null-directory that broadcasts on all requests

Baseline: Block-based, mostly inclusive, directory

HSC: Region-based directory with 1-KB region size

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 39

HSC PERFORMANCE

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bp bfs hs lud nw km sd bn dct hg mm

Nor

mal

ized

spe

ed-u

p

Broadcast Baseline HSCLargest slowdowns

from constrained

resources

Largest slowdowns

from constrained

resources

Largest slowdowns

from constrained

resources

Largest slow-downs

from constrained

resources

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 40

DIRECTORY TRAFFIC REDUCTION

0

0.2

0.4

0.6

0.8

1

1.2

bp bfs hs lud nw km sd bn dct hg mmNor

mal

ized

dir

ecto

ry b

andw

idth

broadcast baseline HSC

Average bandwidth

significantly reduced Theoretical

reduction from 16

block regions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 41

HSC RESOURCE USAGE

0

0.05

0.1

0.15

0.2

0.25

bp bfs hs lud nw km sd bn dct hg mm

Nor

mal

ized

dir

ecto

ry M

SHR

s re

quir

ed

Maximum

MSHRs

significantly

reduced

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 42

RESULTS SUMMARY

Used a detailed timing simulator for CPU and GPU

HSC significantly improves performance

‒Reduces the average load latency

‒Decreases bandwidth requirement of directory

HSC reduces the required MSHRs at the directory

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 43

RELATED WORK

Coarse-grained coherence

‒Region coherence

‒Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005]

[Zebchuk, MICRO 2007]

‒ Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013]

‒Spatiotemporal coherence [Alisafaee, MICRO 2012]

‒Dual-grain directory coherence [Basu, UW-TR 2013]

‒Primarily focused on directory size

GPU coherence [Singh et al. HPCA 2013]

‒ Intra-GPU coherence

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 44

CONCLUSIONS

Hardware coherence can increase the utility of

heterogeneous systems

Major bottlenecks in current coherence implementations

‒High bandwidth difficult to support at directory

‒Extreme resource requirements

We propose Heterogeneous System Coherence

‒Leverages spatial locality and region coherence

‒Reduces bandwidth by 94%

‒Reduces resource requirements by 95%

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 45

Questions? Contact: [email protected]

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 46

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and

typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to

product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences

between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or

otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to

time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR

ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO

EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM

THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of

Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance

Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

Backup

Slides

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 48

LOAD LATENCY

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

bp bfs hs lud nw km sd bn dct hg mm

Nor

mal

ized

load

late

ncy

broadcast baseline HSC

Average load time

significantly reduced

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 49

EXECUTION TIME BREAKDOWN

0

20

40

60

80

100

120

bp bfs hs lud nw km sd bn dct hg mm

Exec

utio

n ti

me

(%)

GPU CPU