daniel cederman and philippas tsigas distributed computing and systems chalmers university of...

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

System Model The Algorithm Experiments

Overview

5 8 3 2

System Model

5 8 3 2

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Designed for general purpose computation Extensions to C/C++

System Model – Global Memory

Global Memory

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges

The Algorithm

5 8 3 2

Quicksort

5 8 3 2

Quicksort

5 8 3 2

Data Parallel!

Quicksort

5 8 3 2

NOTData

Parallel!

Quicksort

5 8 3 2

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Quicksort

5 8 3 2

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

Fetch-And-Addon

LeftPointer

LeftPointer = 2

Thread Block ONE

Thread BlockTWO

Fetch-And-Addon

LeftPointer

LeftPointer = 4

Thread Block ONE

Thread BlockTWO

3 2 2 3

LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2

Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase

The Algorithm

5 8 3 2

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Models

Distributions

Models

Distributions

Models

Distributions

Models

Distributions

Models

Distributions

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

GPU-Quicksort

CPU-Reference~4s

GPU-Quicksort

GPUSort andRadix-Merge

GPU-Quicksort

GPU-Quicksort~380ms

Global Radix~510ms

8800GTX – Sorted Distribution

1M 2M 4M 8M0

8800GTX – Sorted Distribution

1M 2M 4M 8M0

8800GTX – Sorted – 8M

GPU-Quicksort

Reference is faster than

GPUSort and Radix-Merge

GPU-Quicksort

GPU-Quicksorttakes less than

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

8800GTX – Zero – 16M

GPU-Quicksort

Bitonic is unaffected

by distribution

GPU-Quicksort

Three-way partition

8600GTS – Uniform Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

8600GTS – Uniform Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

GPU-Quicksort

Reference equal to

Radix-Merge

GPU-Quicksort

Quicksort and hybrid ~4 times

faster

8600GTS – Sorted Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

8600GTS – Sorted Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

GPU-Quicksort

Hybrid needs randomization

GPU-Quicksort

Reference better

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

s)Visibility Ordering – 8800GTX

100200300400500600700800900

Similar to uniform

distribution

100200300400500600700800900

Sequence needs to be a power of

two in length

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Thank you!

www.cs.chalmers.se/~cederman

daniel cederman and philippas tsigas distributed computing and systems chalmers university of...

sharedmemorysharedmemory

cc slide

experience slide

global memory slide

gpu thread block

memory latency slide

cpu thread block

assign thread blocks

Documents

(c) ph. tsigas 2003-2004© ph. tsigas 2003- 2004 algorithm...

distributed computing and systems philippas tsigas...

a consistency framework for iteration operations in...

concurrent data structures in architectures with limited...

dynamic load balancing using work-stealing 35 -...

fast and lock-free concurrent priority queues for...

by thomas w. hertel and marinos e. tsigas i. … · by...

by thomas w. hertel and marinos e. tsigas i. …...

1 konfliktforschung i: kriegsursachen im historischen...

understanding performance of concurrent data structures on...

distributed systems ii a polynomial local solution to mutual...

reactive multi-word synchronization for multiprocessors ·...

programming multi-core and many …tsigas/papers/lock-free...

parmarksplit: a parallel mark- split garbage collector based...

parallel programming philippas tsigas chalmers university of...

cederman, wimmer, and min, final - columbia...

distributed systems ii agreement - commit (2-3 phase commit)...

towards a software transactional memory for graphics...

on dynamic load balancing on graphics processors daniel...

distributed systems ii replication cnt. prof philippas...