parallel algorithms continued patrick cozzi university of pennsylvania cis 565 - spring 2012
TRANSCRIPT
![Page 1: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/1.jpg)
Parallel Algorithms Continued
Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2012
![Page 2: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/2.jpg)
Announcements
Presentation topics due tomorrow, 02/07
Homework 2 due next Monday, 02/13
![Page 3: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/3.jpg)
Agenda
Parallel AlgorithmsReview Stream CompressionRadix Sort
CUDA Performance
![Page 4: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/4.jpg)
Radix Sort
Efficient for small sort keysk-bit keys require k passes
![Page 5: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/5.jpg)
Radix Sort
Each radix sort pass partitions its input based on one bit
First pass starts with the least significant bit (LSB). Subsequent passes move towards the most significant bit (MSB)
010LSBMSB
![Page 6: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/6.jpg)
Radix Sort
100
Example from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
111 010 110 011 101 001 000
Example input:
![Page 7: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/7.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
First pass: partition based on LSB
LSB == 0 LSB == 1
![Page 8: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/8.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
Second pass: partition based on middle bit
bit == 0 bit == 1
100 010 110000 111 011101 001
![Page 9: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/9.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
Final pass: partition based on MSB
MSB == 0 MSB == 1
100 010 110000 111 011101 001
000 100 101001 110 111010 011
![Page 10: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/10.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
Completed:
100 010 110000 111 011101 001
000 100 101001 110 111010 011
![Page 11: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/11.jpg)
Radix Sort
4 7 2 6 3 5 1 0
4 2 6 0 7 3 5 1
Completed:
4 2 60 7 35 1
0 4 51 6 72 3
![Page 12: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/12.jpg)
Parallel Radix Sort
Where is the parallelism?
![Page 13: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/13.jpg)
Parallel Radix Sort
1. Break input arrays into tilesEach tile fits into shared memory for an SM
2. Sort tiles in parallel with radix sort
3. Merge pairs of tiles using a parallel bitonic merge until all tiles are merged.
Our focus is on Step 2
![Page 14: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/14.jpg)
Parallel Radix Sort
Where is the parallelism?Each tile is sorted in parallelWhere is the parallelism within a tile?
![Page 15: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/15.jpg)
Parallel Radix Sort
Where is the parallelism?Each tile is sorted in parallelWhere is the parallelism within a tile?
Each pass is done in sequence after the previous pass. No parallelism
Can we parallelize an individual pass? How?
Merge also has parallelism
![Page 16: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/16.jpg)
Parallel Radix Sort
Implement spilt. Given:Array, i, at pass b:
Array, b, which is true/false for bit b:
Output array with false keys before true keys:
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
100 010 110 000 111 011 101 001
![Page 17: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/17.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
i array
b array
Step 1: Compute e array
1 0 1 1 0 0 0 1 e array
![Page 18: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/18.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0 b array
Step 2: Exclusive Scan e
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
i array
![Page 19: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/19.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
i array
b array
Step 3: Compute totalFalses
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
totalFalses = e[n – 1] + f[n – 1]totalFalses = 1 + 3totalFalses = 4
![Page 20: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/20.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
t array
t[i] = i – f[i] + totalFalses
totalFalses = 4
![Page 21: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/21.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 t array
t[0] = 0 – f[0] + totalFalsest[0] = 0 – 0 + 4t[0] = 4 totalFalses = 4
100 111 010 110 011 101 001 000
![Page 22: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/22.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 t array
t[1] = 1 – f[1] + totalFalsest[1] = 1 – 1 + 4t[1] = 4 totalFalses = 4
100 111 010 110 011 101 001 000
![Page 23: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/23.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 t array
t[2] = 2 – f[2] + totalFalsest[2] = 2 – 1 + 4t[2] = 5 totalFalses = 4
100 111 010 110 011 101 001 000
![Page 24: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/24.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
totalFalses = 4
t[i] = i – f[i] + totalFalses
100 111 010 110 011 101 001 000
![Page 25: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/25.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
0 d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
![Page 26: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/26.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
0 4 d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
![Page 27: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/27.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
0 4 1 d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
![Page 28: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/28.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
0 4 1 2 5 6 7 3
![Page 29: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/29.jpg)
Parallel Radix Sort
i array
Step 5: Scatter based on address d
0 4 1 2 5 6 7 3 d
100 111 010 110 011 101 001 000
output
![Page 30: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/30.jpg)
Parallel Radix Sort
i array
Step 5: Scatter based on address d
0 4 1 2 5 6 7 3 d
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001 output
![Page 31: Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697c00d1a28abf838cc92fd/html5/thumbnails/31.jpg)
Parallel Radix Sort
Given k-bit keys, how do we sort using our new split function?
Once each tile is sorted, how do we merge tiles to provide the final sorted array?