daniel cederman and philippas tsigas distributed computing and systems chalmers university of...

67
A Practical Quicksort Algorithm for Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Upload: yadiel-gardiner

Post on 01-Apr-2015

231 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

Page 2: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model The Algorithm Experiments

Overview

5 8 3 2

3 2

2 3

5 8

5 8

Page 3: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model The Algorithm Experiments

System Model

5 8 3 2

3 2

2 3

5 8

5 8

Page 4: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Page 5: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Designed for general purpose computation Extensions to C/C++

CUDA

Page 6: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model – Global Memory

Global Memory

Page 7: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Page 8: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Page 9: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges

Page 10: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model The Algorithm Experiments

The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Page 11: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Page 12: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Data Parallel!

Page 13: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

NOTData

Parallel!

Page 14: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Page 15: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Page 16: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Page 17: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort – Phase One

Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Page 18: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort – Phase One

Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Page 19: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

1

0

Page 20: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 2

Thread Block ONE

Thread BlockTWO

2

3

3 2

Page 21: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 4

Thread Block ONE

Thread BlockTWO

5

4

3 2 2 3

Page 22: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2

Page 23: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Page 24: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

<

>

=

>

<< <

>

<

>

<

=

<

>

<

> > >

<

>

=

<

> >

= = =

Page 25: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase

Page 26: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

System Model The Algorithm Experiments

The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Page 27: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Page 28: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 29: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 30: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 31: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 32: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 33: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

P

x1x2

x3

x4

Page 34: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

Page 35: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

Page 36: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 37: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 38: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Page 39: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

CPU-Reference~4s

Page 40: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPUSort andRadix-Merge

~1.5s

Page 41: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPU-Quicksort~380ms

Global Radix~510ms

Page 42: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 43: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 44: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

Page 45: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

Reference is faster than

GPUSort and Radix-Merge

Page 46: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

GPU-Quicksorttakes less than

200ms

Page 47: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 48: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 49: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Page 50: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Bitonic is unaffected

by distribution

Page 51: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Three-way partition

Page 52: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

Page 53: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

Page 54: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Page 55: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Reference equal to

Radix-Merge

Page 56: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Quicksort and hybrid ~4 times

faster

Page 57: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

Page 58: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

Page 59: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Page 60: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Hybrid needs randomization

Page 61: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Reference better

Page 62: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Page 63: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Similar to uniform

distribution

Page 64: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Sequence needs to be a power of

two in length

Page 65: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Page 66: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Page 67: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Thank you!

www.cs.chalmers.se/~cederman