complexity-effective memory access scheduling for many-core accelerator architectures

36
George L. Yuan, Ali Bakhoda and Tor M. Aamodt Electrical and Computer Engineering University of British Columbia December 14 th , 2009 (MICRO 2009) Complexity- Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Upload: tuvya

Post on 07-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures. George L. Yuan, Ali Bakhoda and Tor M. Aamodt Electrical and Computer Engineering University of British Columbia December 14 th , 200 9 (MICRO 200 9 ). The Trend: DRAM Access Locality in Many-Core. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

George L. Yuan, Ali Bakhoda and Tor M. Aamodt

Electrical and Computer EngineeringUniversity of British Columbia

December 14th, 2009 (MICRO 2009)

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Page 2: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

2

The Trend: DRAM Access Locality in Many-Core

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller

8 16 32 64

Before Interconnect After Interconnect

Number of Cores

DR

AM

Acc

ess

Loca

lity

Good

Bad

Pre-interconnect access locality

Post-interconnect access locality

Page 3: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

Opened Row: A

DRAM

3

Today’s Solution: Out-of-Order Scheduling

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Row B

Row A

Row A

Request Queue

Row B

Row A

Row A

Youngest

Oldest

Switching RowOpened Row: B

Queue size needs to increase as number of cores increase

Requires fully-associative logic Circuit issues:

o Cycle timeo Areao Power

OoO OK for Single Core, OK for Multi-Core, but for Many-Core..?

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 4: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

4Complexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

No prior work for memory access scheduling for 10,000+ threads

Related Work Rixner, Dally, et al

o First-Ready First-Come First-Serve (FRFCFS) Patents by Intel, Nvidia, etc.. Mutlu & Moscibroda

o Stall-time Fair Memoryo Parallelism-Aware Batch Scheduling

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 5: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

5Complexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Our Contributions Show request stream interleaving in interconnect First paper that considers problem of DRAM

scheduling for tens of thousands of threads Integration of DRAM scheduling in interconnect,

allowing for more complexity-effective design Achieves 91% of performance of out-of-order

scheduling with in-order scheduling for memory-limited applications

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 6: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

6Complexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Outline Introduction Background on DRAM The Request Interleaving Problem Hold-Grant Interconnect Arbitration Experimental Results Conclusion

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 7: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

7

Example of many-core accelerator? GPUs

High FLOP capacity for high resolution graphics Nvidia’s GTX285: 30 8-wide multiprocessors 10,000’s of concurrent threads Demand on memory system extremely high

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 8: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

8

Background: DRAM

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

DRAM

Column Decoder

Memory Array

Row

Dec

oderM

emor

y C

ontr

olle

r

Row BufferRow Buffer

Row

Dec

oder

Column Decoder

Row Buffer

Column Decoder

Row Buffer

Row Access: Activate a row of DRAM bank and load into row buffer (slow)

Column Access:Read and write data in row buffer (fast)

Precharge: Write row buffer data back into row (slow)

Page 9: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

9

tRC = row cycle time

tRP = row precharge time

tRCD = row activate time

Bank Precharge Row A Activate Row B Pre...RB RBRARARARA Precharge Row B Act..tRP tRCD

tRC

Background: DRAM Row Access LocalityDefinition: Number of accesses to a row between row switches

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

“row switch”

Row access locality Achievable DRAM Bandwidth Performance

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 10: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

10

The Request Interleaving Problem

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 11: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

11

FR-FCFS vs FIFO

FRFCFS vs FIFO: Almost 2x Speedup

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

fwt lib mum neu nn ray red sp wp HM0

50

100

150

200FIFO FR-FCFS

IPC

Page 12: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

12

Alternative Solution: Banked FIFO for Bank-level Parallelism

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

FIFO for DRAM Bank 0

FIFO

Banked FIFO1

2

3 ~23% speedup over FIFO

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 13: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

13

Our SolutionHold grant interconnection arbitration policies

“Hold Grant” (HG): Previously granted input has highest priority

“Row-Matching Hold Grant” (RMHG): Previously granted input has highest priority if requested row matches previously requested row

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 14: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

NW Router E

S

14

Interconnect Arbitration Policy: Round-Robin

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

RowA

RowAMemory Controller 0

RowBRowBRowCRowX

RowY

RowA

RowA

RowB

RowB

RowC

RowY

RowX

Memory Controller 1

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 15: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

15

Interconnect Arbitration Policy: HG

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

RowA

RowAMemory Controller 0

RowBRowBRowCRowX

RowY

RowA

RowA

RowB

RowB

RowC

RowY

RowX

Memory Controller 1

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

NW Router E

S

Page 16: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

16

Interconnect Arbitration Policy: RMHG

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

RowA

RowAMemory Controller 0

RowBRowBRowCRowX

RowY

RowA

RowA

RowB

RowB

RowC

RowY

RowX

Memory Controller 1

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

NW Router E

S

Page 17: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

17

Complexity ComparisonScheme Complexity

FRFCFS 3584 bits compared

BFIFO+HG (XBAR) 224 bits stored and compared

BFIFO+RMHG (XBAR) 608 bits stored, 320 bits compared

BFIFO+HMHG4 (XBAR) 320 bits stored, 320 bits compared

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

For 32 entry queues: 15x reduction in bit comparisons, reduction from 32-way associative to direct mapped

Page 18: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

Shader cores 28

Threads per shader core 1024

Maximum supported in-flight requests per shader core

64

Number of DRAM Controllers 8

DRAM controller scheduler FIFO, Banked FIFO, First-Ready First-Come First-Serve (FRFCFS)

GDDR3 memory timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8

Topologies swept Crossbar, Mesh, Ring

Queue sizes swept 8, 16, 32, 64

Number of virtual channels swept 1, 2, 4

18

Methodology: Microarchitecture Parameters

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 19: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

19

GPGPU-Sim: A massively multithreaded architecture performance simulator (www.gpgpu-sim.org)

Supports NVIDIA’s Compute Unified Device Architecture (CUDA) framework

Simulates Parallel Thread Execution (PTX) instructions

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Methodology: Simulator

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 20: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

20

Results – IPC Normalized to FR-FCFS

Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues:BFIFO: 14% speedup over regular FIFOBFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

0%

20%

40%

60%

80%

100%

fwt lib mum neu nn ray red sp wp HM

FIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 21: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

21

Row Streak Breakers

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

RowARowARowB RowCRowA

Requests From Core 1

Requests From Core 2

OldestYoungest

Memory Controller QueueDRAM

RowA

“Row Streak”Row Streak Breakers

Stranded Request

Page 22: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

B H B H B H B H B H B H B H B H B Hfwt lib mum neu nn ray red sp wp

0%

20%

40%

60%

80%

100%Same Core Different Core

Row

Stre

ak B

reak

er

Cla

ssifi

catio

n

22

Row Streak Breakers

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

“bad”

“good”

B = banked FIFO; H = banked FIFO + Hold Grant

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Arithmetic mean average reduction: 73%Harmonic mean average reduction: 96%

Page 23: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

23Complexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Conclusion Show request stream interleaving in interconnect

o Effect gets worse as number of cores increase First paper that considers problem of DRAM

scheduling for tens of thousands of threadso No prior work on memory scheduling for many-core

Integration of DRAM scheduling in interconnect, allowing for more complexity-effective designo Should allow for faster clock speeds, power/area savings

Achieves 91% of performance of out-of-order scheduling with in-order scheduling for memory-limited applications

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 24: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

24

Future Work

Improve upon our memory scheduler design Evaluate performance of graphic applications Design a hold-grant scheme that works in

conjunction with multiple virtual channel deadlock avoidance schemes for torus networks

Synthesize, layout, and use SPICE to determine actual power/area overheads, cycle time

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 25: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

25

Thank you

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 26: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

26

Methodology: Microarchitecture Parameters

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia

Page 27: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

27

FR-FCFS vs FIFO

Need out-of-order scheduling inside DRAM controller to improve row access locality of requests to DRAM chips

FIFO vs FRFCFS: 46.8% Slowdown

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

fwt lib mum neu nn ray red sp wp HM0%

20%

40%

60%

80%

100%

XBAR MESH RING

Spee

dup

Page 28: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

28

Varying Topology

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Ring networks require multiple virtual channels for deadlock avoidance

Multiple virtual channels = path diversity

Path diversity => requests arrive out of order = interleaving

Page 29: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

VC0 VC1Router

29

Multiple Virtual Channels :

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Row B

Row A

Row A

Row X

Source

Destination

Congestion

Dynamic Virtual Channel Allocation

Page 30: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

VC0 VC1Router

30

Multiple Virtual Channels :

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Row B

Row A

Row A

Row X

Source

Destination

Congestion

Static Virtual Channel Allocation

Page 31: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

31

SVCA vs DVCAHarmonic mean IPC for different virtual channel configurations

SVCA speedup over DVCA by up to 18.5%

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Page 32: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

32

BenchmarksAbr. BenchmarkFWT Fast Walsh TransformLIB LIBOR Monte CarloMUM MUMmerGPUNEU Neural Network Digit RecognitionNN Nearest NeighborRAY Ray TracingRED ReductionRAY Ray TracingWP Weather Prediction

George YuanSupervisor: Dr. Tor Aamodt

University of British Columbia

Page 33: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

33

Sensitivity Analysis

George YuanSupervisor: Dr. Tor Aamodt

University of British Columbia

Varying DRAM Controller Queue Size

Varying Topology

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Page 34: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

34

More Results

Memory Latency:33.9% reduction for HG and35.3% reduction for HMHG4compared to BFIFO

DRAM Efficiency:15.1% improvement for HG and HMHG4 over BFIFO

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

Page 35: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

35

Row Access Locality Reduction After Interconnect

44% for Crossbar, 48% for Mesh, 52% for Ring

George YuanSupervisor: Dr. Tor Aamodt

University of British Columbia

Page 36: Complexity-Effective Memory Access Scheduling for  Many-Core Accelerator Architectures

36

DRAM Parameters

George YuanSupervisor: Dr. Tor Aamodt

University of British Columbia