aa-sort with sse4.1
TRANSCRIPT
![Page 1: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/1.jpg)
AA-sort with SSE4.1AA-sort with SSE4.1
Cybozu Labs
2012/6/16 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 4(#x86opti)
![Page 2: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/2.jpg)
/292
AgendaAgenda Introduction of AA-sort classic combsort vectorized combsort vectorized merge
benchmark
2012/6/16 #x86opti 4
![Page 3: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/3.jpg)
/293
AA-sortAA-sort Aligned-Access sort proposed by Hiroshi Inoue, etc. in
"A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011http://www.research.ibm.com/trl/people/inouehrs/
SPE_SIMDsort.htmhttp://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
For SIMDless conditional branch, no unaligned data access
For multicore processorsthey implemented it for PowerPC and Cell BE
O(n log n) complexity I tried it for Intel CPU(not complete) https://github.com/herumi/opti/blob/master/intsort.hpp
current version is for only one processor2012/6/16 #x86opti 4
![Page 4: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/4.jpg)
/294
AA-sortAA-sort vectorized combsort for a block (<= L2cache?) vectorized merge sorted block
2012/6/16 #x86opti 4
input array
block 0 block 1 block 2 block3 ...
< < < < ...
sort sort sort sort
< < ...
merge merge
< ...
merge
![Page 5: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/5.jpg)
/295
AA-sort algorithmAA-sort algorithm sort each block O(n log n)
merge sorted block O(n)
2012/6/16 #x86opti 4
![Page 6: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/6.jpg)
/296
classic combsort(1/2)classic combsort(1/2) improved bubble sort unstable O(n log n) compare two elements having a gap(>=1)
gap is divided by shrink factor (about 1.3)
2012/6/16 #x86opti 4
size_t nextGap(size_t N) { return (N * 10) / 13; }
void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } …
![Page 7: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/7.jpg)
/297
classic combsort(2/2)classic combsort(2/2) gap = 1 means bubble sort loop until the array is fully sorted
2012/6/16 #x86opti 4
… for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; }}
![Page 8: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/8.jpg)
/298
gap functiongap function Combsort11 last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by
http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm
a little faster if line(*) is appended
2012/6/16 #x86opti 4
size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n;}
![Page 9: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/9.jpg)
/299
vectorized combsortvectorized combsort step1 : sort values within each vector(32bitx4) step2 : SIMD version combsort step3 : reorder data
2012/6/16 #x86opti 4
1 116 8 9 3 5 7 12 14 0 4 20
3 5 0
9 7 1
6 12 4
8 14 20
…
…
…
…
v0 v1 v2 v3
+0+1+2+3
…
…
…
…
sortstep1
0 1 3102
104
105
389
391
392
511
515
612
…
…
…
…
101
380
502
973
step2
389
392
0 1 3 …101
102
104
105
…380
391
…
step3
sort
...
![Page 10: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/10.jpg)
/2910
step1step1 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3 step1.2 : transpose
2012/6/16 #x86opti 4
3 5 0
2 7 1
8 12 4
9 14 20
8
2
13
15 sort
0 3 5
1 2 2
4 8 12
9 14 15
8
7
13
20
step1.1
0 1 4
3 2 8
5 2 12
8 7 13
9
14
15
20
step1.2
transpose
v0 v1 v2 v3
![Page 11: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/11.jpg)
/2911
sort of 4 itemssort of 4 items use max ud, minud for uint32_t x 4
2012/6/16 #x86opti 4
min01 max01 min23 max23
v0 v1 v2 v3
min0123s=max(min01,min2
3)
t=min(max01,max2
3)max0123
min0123 min(s,t) max(s,t) max0123
a b
min(a,b) max(a,b)
<
< <
< <
<
sorted
![Page 12: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/12.jpg)
/2912
source of step1.1source of step1.1 V128 is a type of 32-bit integer x 4 pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3
2012/6/16 #x86opti 4
void sort_step1_vec(V128 x[4]){ V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t);}
![Page 13: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/13.jpg)
/2913
transpose of 4x4 matrixtranspose of 4x4 matrix use unpcklps and unpckhps
2012/6/16 #x86opti 4
3 5 0
2 7 1
8 12 4
9 14 20
8
2
13
15
x0 x1 x2 x3
3 5 8
0 8 4
2 7 9
1 2 20
12
13
14
15
t0 t1 t2 t3
+0+1+2+3
t0=unpcklps(x0,x2)t2=unpckhps(x0,x2)t1=unpcklps(x1,x3)t3=unpckhps(x1,x3)
3 5 8
0 8 4
2 7 9
1 2 20
12
13
14
15
t0 t1 t2 t3
x0=unpcklps(t0,t1)x1=unpckhps(t0,t1)x2=unpcklps(t2,t3)x3=unpckhps(t2,t3)
3 2 8
5 7 12
0 1 4
8 2 13
9
14
20
15
x0 x1 x2 x3
![Page 14: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/14.jpg)
/2914
source of transpose and step1source of transpose and step1
2012/6/16 #x86opti 4
void transpose(V128 x[4]){ V128 x0 = x[0]; V128 x1 = x[1]; V128 x2 = x[2]; V128 x3 = x[3]; V128 t0 = unpcklps(x0, x2); V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3);}
void sort_step1(V128 *va, size_t N){ for(size_t i = 0; i < N; i+= 4) { sort_step1_vec(&va[i]); transpose(&va[i]); }}
![Page 15: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/15.jpg)
/2915
SIMD version combsortSIMD version combsort first half code use vector_cmpswap vector_cmpswap_skew
2012/6/16 #x86opti 4
bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ...
![Page 16: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/16.jpg)
/2916
vector_cmpswapvector_cmpswap no conditional branch
2012/6/16 #x86opti 4
void vector_cmpswap(V128& a, V128& b){ V128 t = pmaxud(a, b); a = pminud(a, b); b = t;}
if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
vectorised
a b
min(a,b)
max(a,b)
<
![Page 17: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/17.jpg)
/2917
vector_cmpswap_skewvector_cmpswap_skew for boundary of array
2012/6/16 #x86opti 4
a
b
a3min(a2,b3
)min(a1,b2
)min(a0,b1
)a'
a3 a2 a1 a0
b3 b2 b1 b0
max(a2,b3)
max(a1,b2)
max(a0,b1)
b0b'
(a',b') = vector_cmpswap_ske(a,b)
![Page 18: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/18.jpg)
/2918
isSortedVecisSortedVec check whether array is sorted ptest_zf(a, b) is true if (a & b) == 0 a <= b max(a,b) == b c := max(a,b) – b == 0 pcmpgtd is for int32_t, so we can't use it
2012/6/16 #x86opti 4
bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true;}
![Page 19: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/19.jpg)
/2919
loop for gap == 1loop for gap == 1 vectorised bubble sort for gap == 1 retire if loop count reaches maxLoop
fall to std::sort almost rare
2012/6/16 #x86opti 4
const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; }
![Page 20: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/20.jpg)
/2920
AA-sort algorithmAA-sort algorithm sort each block O(n log n)
merge sorted block O(n)
2012/6/16 #x86opti 4
![Page 21: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/21.jpg)
/2921
merge two sorted vectormerge two sorted vector a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted c = [b:a] = merge and sort (a, b)
2012/6/16 #x86opti 4
a0 a1 a2 a3
b0 b1 b2 b3
sorted
a
b
c0 c1 c2 c3 c0 c1 c2 c3
[b:a] = vector_merge(a,b)
sorted
sorted
![Page 22: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/22.jpg)
data flow of mergedata flow of merge
2012/6/16 #x86opti 4 /2922
a0 a1 a2 a3 b0 b2 b3
sorted sorted
b1
min00 max00 min11 max11 min22 max22 min33 max33< < < <
< <
< < <
![Page 23: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/23.jpg)
/2923
source of vector_mergesource of vector_merge Too complex good idea?
2012/6/16 #x86opti 4
void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8);}
![Page 24: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/24.jpg)
/2924
std::merge()std::merge() merge [begin1, end1) and [begin2, end2)
2012/6/16 #x86opti 4
template <class In1, class In2, class Out>Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out){ for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); }}
![Page 25: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/25.jpg)
/2925
vectorised mergevectorised merge merge arrays with vector_merge()
2012/6/16 #x86opti 4
void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; }
; compare ta0 with tb0
![Page 26: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/26.jpg)
block size and rate of sortblock size and rate of sort What is good size for vectorised sort? half size of L2 is recommended for PowerPC 970MP
L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t
BS = 32Ki seems good for Xeon, Core i7 profile of sort and merge
2012/6/16 #x86opti 4 /2926
64Ki
128Ki
256Ki
512Ki1Mi
2Mi4Mi
8Mi0
20
40
60
80
100
merge(%)sort(%)
![Page 27: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/27.jpg)
/2927
Benchmark(1/3)Benchmark(1/3) AA-sort vs std::sort for random data Xeon X5650 + gcc-4.6.3
4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi
2012/6/16 #x86opti 4
16 32 64 128
256
512
1Ki2Ki4Ki8Ki 16Ki
32Ki
64Ki
128Ki
256Ki
512Ki
1Mi
2Mi
4Mi
8Mi
1
10
100
1000
10000
100000
1000000
10000000std::sort
# of uint32_t
clock
cycl
e
fast
![Page 28: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/28.jpg)
Benchmark(2/3)Benchmark(2/3) sort 64Ki uint on Xeon + gcc-4.6.3 AA-sort speed does not strongly depend on pattern
2012/6/16 #x86opti 4 /2928
random
16bit random
8bit random
all zero
almost
sorte
d
sorte
d
reverse
d0
5000
10000
15000
20000
25000
std::sort
AA-sort
fast
![Page 29: AA-sort with SSE4.1](https://reader034.vdocuments.mx/reader034/viewer/2022052622/5591817b1a28ab0c538b4682/html5/thumbnails/29.jpg)
Benchmark(3/3)Benchmark(3/3) sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11
2012/6/16 #x86opti 4 /2929
fast
random
16bit random
8bit random
all zero
almost
sorte
d
sorte
d
reverse
d0
2000
4000
6000
8000
10000
12000
14000
16000
std::sort(gcc)AA-sort(gcc)std::sort(VC)AA-sort(VC)