complexity-effective memory access scheduling for many-core accelerator architectures
DESCRIPTION
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures. George L. Yuan, Ali Bakhoda and Tor M. Aamodt Electrical and Computer Engineering University of British Columbia December 14 th , 200 9 (MICRO 200 9 ). The Trend: DRAM Access Locality in Many-Core. - PowerPoint PPT PresentationTRANSCRIPT
George L. Yuan, Ali Bakhoda and Tor M. Aamodt
Electrical and Computer EngineeringUniversity of British Columbia
December 14th, 2009 (MICRO 2009)
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
2
The Trend: DRAM Access Locality in Many-Core
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller
8 16 32 64
Before Interconnect After Interconnect
Number of Cores
DR
AM
Acc
ess
Loca
lity
Good
Bad
Pre-interconnect access locality
Post-interconnect access locality
Opened Row: A
DRAM
3
Today’s Solution: Out-of-Order Scheduling
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
Row B
Row A
Row A
Request Queue
Row B
Row A
Row A
Youngest
Oldest
Switching RowOpened Row: B
Queue size needs to increase as number of cores increase
Requires fully-associative logic Circuit issues:
o Cycle timeo Areao Power
OoO OK for Single Core, OK for Multi-Core, but for Many-Core..?
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
4Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
No prior work for memory access scheduling for 10,000+ threads
Related Work Rixner, Dally, et al
o First-Ready First-Come First-Serve (FRFCFS) Patents by Intel, Nvidia, etc.. Mutlu & Moscibroda
o Stall-time Fair Memoryo Parallelism-Aware Batch Scheduling
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
5Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Our Contributions Show request stream interleaving in interconnect First paper that considers problem of DRAM
scheduling for tens of thousands of threads Integration of DRAM scheduling in interconnect,
allowing for more complexity-effective design Achieves 91% of performance of out-of-order
scheduling with in-order scheduling for memory-limited applications
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
6Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Outline Introduction Background on DRAM The Request Interleaving Problem Hold-Grant Interconnect Arbitration Experimental Results Conclusion
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
7
Example of many-core accelerator? GPUs
High FLOP capacity for high resolution graphics Nvidia’s GTX285: 30 8-wide multiprocessors 10,000’s of concurrent threads Demand on memory system extremely high
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
8
Background: DRAM
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
DRAM
Column Decoder
Memory Array
Row
Dec
oderM
emor
y C
ontr
olle
r
Row BufferRow Buffer
Row
Dec
oder
Column Decoder
Row Buffer
Column Decoder
Row Buffer
Row Access: Activate a row of DRAM bank and load into row buffer (slow)
Column Access:Read and write data in row buffer (fast)
Precharge: Write row buffer data back into row (slow)
9
tRC = row cycle time
tRP = row precharge time
tRCD = row activate time
Bank Precharge Row A Activate Row B Pre...RB RBRARARARA Precharge Row B Act..tRP tRCD
tRC
Background: DRAM Row Access LocalityDefinition: Number of accesses to a row between row switches
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
“row switch”
Row access locality Achievable DRAM Bandwidth Performance
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
10
The Request Interleaving Problem
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
11
FR-FCFS vs FIFO
FRFCFS vs FIFO: Almost 2x Speedup
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
fwt lib mum neu nn ray red sp wp HM0
50
100
150
200FIFO FR-FCFS
IPC
12
Alternative Solution: Banked FIFO for Bank-level Parallelism
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
FIFO for DRAM Bank 0
FIFO
Banked FIFO1
2
3 ~23% speedup over FIFO
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
13
Our SolutionHold grant interconnection arbitration policies
“Hold Grant” (HG): Previously granted input has highest priority
“Row-Matching Hold Grant” (RMHG): Previously granted input has highest priority if requested row matches previously requested row
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
NW Router E
S
14
Interconnect Arbitration Policy: Round-Robin
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
RowA
RowAMemory Controller 0
RowBRowBRowCRowX
RowY
RowA
RowA
RowB
RowB
RowC
RowY
RowX
Memory Controller 1
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
15
Interconnect Arbitration Policy: HG
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
RowA
RowAMemory Controller 0
RowBRowBRowCRowX
RowY
RowA
RowA
RowB
RowB
RowC
RowY
RowX
Memory Controller 1
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
NW Router E
S
16
Interconnect Arbitration Policy: RMHG
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
RowA
RowAMemory Controller 0
RowBRowBRowCRowX
RowY
RowA
RowA
RowB
RowB
RowC
RowY
RowX
Memory Controller 1
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
NW Router E
S
17
Complexity ComparisonScheme Complexity
FRFCFS 3584 bits compared
BFIFO+HG (XBAR) 224 bits stored and compared
BFIFO+RMHG (XBAR) 608 bits stored, 320 bits compared
BFIFO+HMHG4 (XBAR) 320 bits stored, 320 bits compared
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
For 32 entry queues: 15x reduction in bit comparisons, reduction from 32-way associative to direct mapped
Shader cores 28
Threads per shader core 1024
Maximum supported in-flight requests per shader core
64
Number of DRAM Controllers 8
DRAM controller scheduler FIFO, Banked FIFO, First-Ready First-Come First-Serve (FRFCFS)
GDDR3 memory timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8
Topologies swept Crossbar, Mesh, Ring
Queue sizes swept 8, 16, 32, 64
Number of virtual channels swept 1, 2, 4
18
Methodology: Microarchitecture Parameters
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
19
GPGPU-Sim: A massively multithreaded architecture performance simulator (www.gpgpu-sim.org)
Supports NVIDIA’s Compute Unified Device Architecture (CUDA) framework
Simulates Parallel Thread Execution (PTX) instructions
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
Methodology: Simulator
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
20
Results – IPC Normalized to FR-FCFS
Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues:BFIFO: 14% speedup over regular FIFOBFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
0%
20%
40%
60%
80%
100%
fwt lib mum neu nn ray red sp wp HM
FIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
21
Row Streak Breakers
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
RowARowARowB RowCRowA
Requests From Core 1
Requests From Core 2
OldestYoungest
Memory Controller QueueDRAM
RowA
“Row Streak”Row Streak Breakers
Stranded Request
B H B H B H B H B H B H B H B H B Hfwt lib mum neu nn ray red sp wp
0%
20%
40%
60%
80%
100%Same Core Different Core
Row
Stre
ak B
reak
er
Cla
ssifi
catio
n
22
Row Streak Breakers
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
“bad”
“good”
B = banked FIFO; H = banked FIFO + Hold Grant
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
Arithmetic mean average reduction: 73%Harmonic mean average reduction: 96%
23Complexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Conclusion Show request stream interleaving in interconnect
o Effect gets worse as number of cores increase First paper that considers problem of DRAM
scheduling for tens of thousands of threadso No prior work on memory scheduling for many-core
Integration of DRAM scheduling in interconnect, allowing for more complexity-effective designo Should allow for faster clock speeds, power/area savings
Achieves 91% of performance of out-of-order scheduling with in-order scheduling for memory-limited applications
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
24
Future Work
Improve upon our memory scheduler design Evaluate performance of graphic applications Design a hold-grant scheme that works in
conjunction with multiple virtual channel deadlock avoidance schemes for torus networks
Synthesize, layout, and use SPICE to determine actual power/area overheads, cycle time
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
25
Thank you
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
26
Methodology: Microarchitecture Parameters
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
George Yuan, Ali Bakhoda, Tor AamodtUniversity of British Columbia
27
FR-FCFS vs FIFO
Need out-of-order scheduling inside DRAM controller to improve row access locality of requests to DRAM chips
FIFO vs FRFCFS: 46.8% Slowdown
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
fwt lib mum neu nn ray red sp wp HM0%
20%
40%
60%
80%
100%
XBAR MESH RING
Spee
dup
28
Varying Topology
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Ring networks require multiple virtual channels for deadlock avoidance
Multiple virtual channels = path diversity
Path diversity => requests arrive out of order = interleaving
VC0 VC1Router
29
Multiple Virtual Channels :
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Row B
Row A
Row A
Row X
Source
Destination
Congestion
Dynamic Virtual Channel Allocation
VC0 VC1Router
30
Multiple Virtual Channels :
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
Row B
Row A
Row A
Row X
Source
Destination
Congestion
Static Virtual Channel Allocation
31
SVCA vs DVCAHarmonic mean IPC for different virtual channel configurations
SVCA speedup over DVCA by up to 18.5%
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
32
BenchmarksAbr. BenchmarkFWT Fast Walsh TransformLIB LIBOR Monte CarloMUM MUMmerGPUNEU Neural Network Digit RecognitionNN Nearest NeighborRAY Ray TracingRED ReductionRAY Ray TracingWP Weather Prediction
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia
33
Sensitivity Analysis
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia
Varying DRAM Controller Queue Size
Varying Topology
Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures
34
More Results
Memory Latency:33.9% reduction for HG and35.3% reduction for HMHG4compared to BFIFO
DRAM Efficiency:15.1% improvement for HG and HMHG4 over BFIFO
George YuanSupervisor: Dr. Tor Aamodt
University of British ColumbiaComplexity-Effective Memory Access Scheduling
for Many-Core Accelerator Architectures
35
Row Access Locality Reduction After Interconnect
44% for Crossbar, 48% for Mesh, 52% for Ring
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia
36
DRAM Parameters
George YuanSupervisor: Dr. Tor Aamodt
University of British Columbia