daniel cederman and philippas tsigas distributed computing and systems chalmers university of...

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

System Model The Algorithm Experiments

Overview

5 8 3 2

3 2

2 3

5 8

5 8


System Model

5 8 3 2

3 2

2 3

5 8

5 8

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Designed for general purpose computation Extensions to C/C++

CUDA

System Model – Global Memory

Global Memory

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges


The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Data Parallel!

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

NOTData

Parallel!

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8


Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8


Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

1

0

Fetch-And-Addon

LeftPointer


LeftPointer = 2

Thread Block ONE

Thread BlockTWO

2

3

3 2

Fetch-And-Addon

LeftPointer


LeftPointer = 4

Thread Block ONE

Thread BlockTWO

5

4

3 2 2 3


LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2


Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

<

>

=

>

<< <

>

<

>

<

=

<

>

<

> > >

<

>

=

<

> >

= = =

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase


The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

P

x1x2

x3

x4

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)


GPU-Quicksort


500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

CPU-Reference~4s


GPU-Quicksort


500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPUSort andRadix-Merge

~1.5s


GPU-Quicksort


500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPU-Quicksort~380ms

Global Radix~510ms

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800


Size

Tim

e (m

s)

8800GTX – Sorted – 8M

GPU-Quicksort


100

200

300

400

500

600

700

800

Tim

e (m

s)


GPU-Quicksort


100

200

300

400

500

600

700

800

Tim

e (m

s)

Reference is faster than

GPUSort and Radix-Merge


GPU-Quicksort


100

200

300

400

500

600

700

800

Tim

e (m

s)

GPU-Quicksorttakes less than

200ms

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800


Size

Tim

e (m

s)

8800GTX – Zero – 16M

GPU-Quicksort


200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)


GPU-Quicksort


200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Bitonic is unaffected

by distribution


GPU-Quicksort


200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Three-way partition

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)


GPU-Quicksort


500

1000

1500

2000

2500

Tim

e (m

s)

Reference equal to

Radix-Merge


GPU-Quicksort


500

1000

1500

2000

2500

Tim

e (m

s)

Quicksort and hybrid ~4 times

faster

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)


GPU-Quicksort


0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Hybrid needs randomization


GPU-Quicksort


0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Reference better

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX


100200300400500600700800900

1000


Tim

e (m


Similar to uniform

distribution


100200300400500600700800900

1000


Tim

e (m


Sequence needs to be a power of

two in length

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Thank you!

www.cs.chalmers.se/~cederman

daniel cederman and philippas tsigas distributed computing and systems chalmers university of...

Documents

sharedmemorysharedmemory

cc slide

experience slide

global memory slide

gpu thread block

memory latency slide

cpu thread block

assign thread blocks