daniel cederman and philippas tsigas distributed computing and systems chalmers university of...

Post on 01-Apr-2015

231 Views

Category:

Documents

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

System Model The Algorithm Experiments

Overview

5 8 3 2

3 2

2 3

5 8

5 8

System Model The Algorithm Experiments

System Model

5 8 3 2

3 2

2 3

5 8

5 8

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Designed for general purpose computation Extensions to C/C++

CUDA

System Model – Global Memory

Global Memory

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges

System Model The Algorithm Experiments

The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Data Parallel!

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

NOTData

Parallel!

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Quicksort – Phase One

Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Quicksort – Phase One

Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

1

0

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 2

Thread Block ONE

Thread BlockTWO

2

3

3 2

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 4

Thread Block ONE

Thread BlockTWO

5

4

3 2 2 3

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

<

>

=

>

<< <

>

<

>

<

=

<

>

<

> > >

<

>

=

<

> >

= = =

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase

System Model The Algorithm Experiments

The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

P

x1x2

x3

x4

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

CPU-Reference~4s

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPUSort andRadix-Merge

~1.5s

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPU-Quicksort~380ms

Global Radix~510ms

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

Reference is faster than

GPUSort and Radix-Merge

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

GPU-Quicksorttakes less than

200ms

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Bitonic is unaffected

by distribution

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Three-way partition

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Reference equal to

Radix-Merge

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Quicksort and hybrid ~4 times

faster

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Hybrid needs randomization

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Reference better

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Similar to uniform

distribution

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Sequence needs to be a power of

two in length

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Thank you!

www.cs.chalmers.se/~cederman

top related