multi-gpu mapreduce on gpu clusters

15
Dongguk University Jaegwang Lim CSE7098-01

Upload: dr-jaegwang-lim

Post on 22-Jan-2018

353 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Multi-GPU MapReduce on GPU Clusters

Dongguk University

Jaegwang Lim

CSE7098-01

Page 2: Multi-GPU MapReduce on GPU Clusters

1. Intro

• GPMR(GPU Map-Reduce)

– Google Map-Reduce

– pronounced G-Primer

– Stand along Machine

– Multiple GPU devices

• Existing GPU Map-Reduce work only targets solo GPUs

– No Network I/O

Page 3: Multi-GPU MapReduce on GPU Clusters

2. Background - GPU

• GPU ?

• Nvidia 10-Series Architecture

– 240 Thread Processors execute kernel threads

– 30 Multiprocessors, each contains

Page 4: Multi-GPU MapReduce on GPU Clusters

2. Background - GPU

• “Local” memory resides in devices DRAM

– Use registers and shared memory to minimize local memory use

• Host can read and write global memory but not shared

memory

Page 5: Multi-GPU MapReduce on GPU Clusters

2. Background - GPU

• Kernel launches a grid of threads blocks

– Threads within a block cooperate via shared memory

– Threads within a block can synchronize

– Threads in different block cannot cooperate

Page 6: Multi-GPU MapReduce on GPU Clusters

3. Implementation

• CPU to GPU

Chunk0

Chunk1

Chunk2

Chunk3

PCI

Data

Networks

Scheduler

Page 7: Multi-GPU MapReduce on GPU Clusters

3. Implementation

• Map

Global Memory

Chunk0

Block 0 Block 1 Block 2

Shared Memory Shared Memory Shared Memory

Chunk00 Chunk01 Chunk02

Scheduler

Page 8: Multi-GPU MapReduce on GPU Clusters

3. Implementation

• Map

Global Memory (4GB ~)

Chunk0

Block 0 Block 1 Block 2Shared Memory Shared Memory Shared Memory

Chunk00 Chunk01 Chunk02

Scheduler

Combiner

Bin

Page 9: Multi-GPU MapReduce on GPU Clusters

3. Implementation

• Reduce

Bin

Global Memory (4GB ~)

Chunk0

Block 0 Block 1 Block 2

Shared Memory Shared Memory Shared Memory

Chunk00 Chunk01 Chunk02

Combiner

Sort

Scheduler

ReducerOutput

Page 10: Multi-GPU MapReduce on GPU Clusters

3. Implementation

• Overall Local

GPU 0

GPU 1

GPU 2

GPU 3

…….

CPU

Map & Reduce

Map & Reduce

Map & Reduce

Map & Reduce

Scheduler

&

Bin

Network

Page 11: Multi-GPU MapReduce on GPU Clusters

3. Implementation

• Overall GlobalGPU 0

GPU 1

GPU 2

GPU 3

…….

CPU

Map & Reduce

Map & Reduce

Map & Reduce

Map & Reduce

Scheduler

&

Bin

Network

GPU 0

GPU 1

GPU 2

GPU 3

…….

CPU

Map & Reduce

Map & Reduce

Map & Reduce

Map & Reduce

Scheduler

&

Bin

GPU 0

GPU 1

GPU 2

GPU 3

…….

CPU

Map & Reduce

Map & Reduce

Map & Reduce

Map & Reduce

Scheduler

&

Bin

GPU 0

GPU 1

GPU 2

GPU 3

…….

CPU

Map & Reduce

Map & Reduce

Map & Reduce

Map & Reduce

Scheduler

&

Bin

Page 12: Multi-GPU MapReduce on GPU Clusters

4. Benchmark Result

Page 13: Multi-GPU MapReduce on GPU Clusters

4. Benchmark Result

Page 14: Multi-GPU MapReduce on GPU Clusters

4. Benchmark Result

• GPMR vs Phoenix

• GPMR vs Mars

Page 15: Multi-GPU MapReduce on GPU Clusters

5. Conclusion

• High performance

• New capability

• New scalability

• Limitation

– Low GPU Memory (512MB)

– Network I/O for GPU