complexity-effective memory access scheduling for many-core accelerator architectures

George L. Yuan, Ali Bakhoda and Tor M. Aamodt

Electrical and Computer EngineeringUniversity of British Columbia

December 14th, 2009 (MICRO 2009)

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

2

The Trend: DRAM Access Locality in Many-Core

George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia


Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller

8 16 32 64

Before Interconnect After Interconnect

Number of Cores

DR

AM

Acc

ess

Loca

lity

Good

Bad

Pre-interconnect access locality

Post-interconnect access locality

Opened Row: A

DRAM

3

Today’s Solution: Out-of-Order Scheduling


Row B

Row A

Row A

Request Queue

Row B

Row A

Row A

Youngest

Oldest

Switching RowOpened Row: B

Queue size needs to increase as number of cores increase

Requires fully-associative logic Circuit issues:

o Cycle timeo Areao Power

OoO OK for Single Core, OK for Multi-Core, but for Many-Core..?


4Complexity-Effective Memory Access Scheduling

for Many-Core Accelerator Architectures

No prior work for memory access scheduling for 10,000+ threads

Related Work Rixner, Dally, et al

o First-Ready First-Come First-Serve (FRFCFS) Patents by Intel, Nvidia, etc.. Mutlu & Moscibroda

o Stall-time Fair Memoryo Parallelism-Aware Batch Scheduling




Our Contributions Show request stream interleaving in interconnect First paper that considers problem of DRAM

scheduling for tens of thousands of threads Integration of DRAM scheduling in interconnect,

allowing for more complexity-effective design Achieves 91% of performance of out-of-order

scheduling with in-order scheduling for memory-limited applications




Outline Introduction Background on DRAM The Request Interleaving Problem Hold-Grant Interconnect Arbitration Experimental Results Conclusion


7

Example of many-core accelerator? GPUs

High FLOP capacity for high resolution graphics Nvidia’s GTX285: 30 8-wide multiprocessors 10,000’s of concurrent threads Demand on memory system extremely high



8

Background: DRAM



DRAM

Column Decoder

Memory Array

Row

Dec

oderM

emor

y C

ontr

olle

r

Row BufferRow Buffer

Row

Dec

oder

Column Decoder

Row Buffer

Column Decoder

Row Buffer

Row Access: Activate a row of DRAM bank and load into row buffer (slow)

Column Access:Read and write data in row buffer (fast)

Precharge: Write row buffer data back into row (slow)

9

tRC = row cycle time

tRP = row precharge time

tRCD = row activate time

Bank Precharge Row A Activate Row B Pre...RB RBRARARARA Precharge Row B Act..tRP tRCD

tRC

Background: DRAM Row Access LocalityDefinition: Number of accesses to a row between row switches


“row switch”

Row access locality Achievable DRAM Bandwidth Performance


10

The Request Interleaving Problem



11

FR-FCFS vs FIFO

FRFCFS vs FIFO: Almost 2x Speedup


fwt lib mum neu nn ray red sp wp HM0

50

100

150

200FIFO FR-FCFS

IPC

12

Alternative Solution: Banked FIFO for Bank-level Parallelism


FIFO for DRAM Bank 0

FIFO

Banked FIFO1

2

3 ~23% speedup over FIFO


13

Our SolutionHold grant interconnection arbitration policies

“Hold Grant” (HG): Previously granted input has highest priority

“Row-Matching Hold Grant” (RMHG): Previously granted input has highest priority if requested row matches previously requested row



NW Router E

S

14

Interconnect Arbitration Policy: Round-Robin


RowA

RowAMemory Controller 0

RowBRowBRowCRowX

RowY

RowA

RowA

RowB

RowB

RowC

RowY

RowX

Memory Controller 1


15

Interconnect Arbitration Policy: HG


RowA


RowBRowBRowCRowX

RowY

RowA

RowA

RowB

RowB

RowC

RowY

RowX

Memory Controller 1


NW Router E

S

16

Interconnect Arbitration Policy: RMHG


RowA


RowBRowBRowCRowX

RowY

RowA

RowA

RowB

RowB

RowC

RowY

RowX

Memory Controller 1


NW Router E

S

17

Complexity ComparisonScheme Complexity

FRFCFS 3584 bits compared

BFIFO+HG (XBAR) 224 bits stored and compared

BFIFO+RMHG (XBAR) 608 bits stored, 320 bits compared

BFIFO+HMHG4 (XBAR) 320 bits stored, 320 bits compared



For 32 entry queues: 15x reduction in bit comparisons, reduction from 32-way associative to direct mapped

Shader cores 28

Threads per shader core 1024

Maximum supported in-flight requests per shader core

64

Number of DRAM Controllers 8

DRAM controller scheduler FIFO, Banked FIFO, First-Ready First-Come First-Serve (FRFCFS)

GDDR3 memory timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8

Topologies swept Crossbar, Mesh, Ring

Queue sizes swept 8, 16, 32, 64

Number of virtual channels swept 1, 2, 4

18

Methodology: Microarchitecture Parameters



19

GPGPU-Sim: A massively multithreaded architecture performance simulator (www.gpgpu-sim.org)

Supports NVIDIA’s Compute Unified Device Architecture (CUDA) framework

Simulates Parallel Thread Execution (PTX) instructions


Methodology: Simulator


20

Results – IPC Normalized to FR-FCFS

Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues:BFIFO: 14% speedup over regular FIFOBFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS


0%

20%

40%

60%

80%

100%

fwt lib mum neu nn ray red sp wp HM

FIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS


21

Row Streak Breakers



RowARowARowB RowCRowA

Requests From Core 1

Requests From Core 2

OldestYoungest

Memory Controller QueueDRAM

RowA

“Row Streak”Row Streak Breakers

Stranded Request

B H B H B H B H B H B H B H B H B Hfwt lib mum neu nn ray red sp wp

0%

20%

40%

60%

80%

100%Same Core Different Core

Row

Stre

ak B

reak

er

Cla

ssifi

catio

n

22

Row Streak Breakers


“bad”

“good”

B = banked FIFO; H = banked FIFO + Hold Grant


Arithmetic mean average reduction: 73%Harmonic mean average reduction: 96%



Conclusion Show request stream interleaving in interconnect

o Effect gets worse as number of cores increase First paper that considers problem of DRAM

scheduling for tens of thousands of threadso No prior work on memory scheduling for many-core

Integration of DRAM scheduling in interconnect, allowing for more complexity-effective designo Should allow for faster clock speeds, power/area savings

Achieves 91% of performance of out-of-order scheduling with in-order scheduling for memory-limited applications


24

Future Work

Improve upon our memory scheduler design Evaluate performance of graphic applications Design a hold-grant scheme that works in

conjunction with multiple virtual channel deadlock avoidance schemes for torus networks

Synthesize, layout, and use SPICE to determine actual power/area overheads, cycle time



25

Thank you



26

Methodology: Microarchitecture Parameters



27

FR-FCFS vs FIFO

Need out-of-order scheduling inside DRAM controller to improve row access locality of requests to DRAM chips

FIFO vs FRFCFS: 46.8% Slowdown

George YuanSupervisor: Dr. Tor Aamodt

University of British ColumbiaComplexity-Effective Memory Access Scheduling


fwt lib mum neu nn ray red sp wp HM0%

20%

40%

60%

80%

100%

XBAR MESH RING

Spee

dup

28

Varying Topology




Ring networks require multiple virtual channels for deadlock avoidance

Multiple virtual channels = path diversity

Path diversity => requests arrive out of order = interleaving

VC0 VC1Router

29

Multiple Virtual Channels :




Row B

Row A

Row A

Row X

Source

Destination

Congestion

Dynamic Virtual Channel Allocation

VC0 VC1Router

30

Multiple Virtual Channels :




Row B

Row A

Row A

Row X

Source

Destination

Congestion

Static Virtual Channel Allocation

31

SVCA vs DVCAHarmonic mean IPC for different virtual channel configurations

SVCA speedup over DVCA by up to 18.5%




32

BenchmarksAbr. BenchmarkFWT Fast Walsh TransformLIB LIBOR Monte CarloMUM MUMmerGPUNEU Neural Network Digit RecognitionNN Nearest NeighborRAY Ray TracingRED ReductionRAY Ray TracingWP Weather Prediction


University of British Columbia

33

Sensitivity Analysis



Varying DRAM Controller Queue Size

Varying Topology


34

More Results

Memory Latency:33.9% reduction for HG and35.3% reduction for HMHG4compared to BFIFO

DRAM Efficiency:15.1% improvement for HG and HMHG4 over BFIFO




35

Row Access Locality Reduction After Interconnect

44% for Crossbar, 48% for Mesh, 52% for Ring



36

DRAM Parameters



complexity-effective memory access scheduling for many-core accelerator architectures

Documents