a model for the effect of caching on algorithmic efficiency in radix based sorting

15
Jun 21, 2022 OMS 2007 O M S 2 0 0 7 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway

Upload: jam

Post on 16-Mar-2016

49 views

Category:

Documents


1 download

DESCRIPTION

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting. Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway. Overview. Motivation, CPU versus Memory speed Caches A cache test A simple model for the execution times of algorithms - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based

Sorting

Arne Maus and Stein GjessingDept. of Informatics,

University of Oslo, Norway

Page 2: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Overview

Motivation, CPU versus Memory speed– Caches

A cache test A simple model for the execution times of algorithms

– Does theoretical cache tests carry over to real programs? A real example – three Radix sorting algorithms

compared

The number of instructions executed is no longer a good measure for the performance of an algorithm

Page 3: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

The need for caches, the CPU-Memory performance gap

from: John L. Hennessy , David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003

Page 4: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

A cache test random vs. sequential access i large arrays

Both a and b are of length n (n= 100, 200, 400,..., 97m)

2 test runs – the same number of instruction performed:– Random access: set b[i] = random(0..n-1)

We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost)

– Sequential access :set b[i] = i. then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b

and 1 in a

for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i;

Page 5: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Random vs. sequential read in arrays [0:n-1]

0

10

20

30

40

50

60

70

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000n (log scale)

Rand

om re

ad ti

mes

slo

wer

AMDOpteron254 2.8GHz

Intel Xeon 2.8 GHz

Intel Core Duo U2500 1.16GHz

UltraSparc III 1.1 GHz

Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random

access down a factor: 50 – 60 (4 CPUs)

start of cache miss from L1 to L2

start of cache miss from L2 to memory

Page 6: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Why a slowdown of 50-60 and not factor 400 ?

Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why?

Answer: Every array access in Java is checked for lower and upper array limits – say:– load array index– compare with zero (lower limit)– load upper limit– compare index and upper limit– load array base address– load/store array element ( = possible cache miss)

We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67

Page 7: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

From the figure for random access, we see a asymptotical slowdown factor of:

1 if n < L1 4 if L1 < n < L2 50 if L2 < n

The access time TR for one random read or write is then: TR = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory)

( = 1* L1/n + 4* L2/n + 50* (n - L2)/n , when n > L2 )

The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses

A simple model for the execution time of a program For every loop in program

– Count the number of sequential references– Count the number of random accesses and the

number of places n in which the randomly accessed object (array) is used

Random vs. sequential read in arrays [0:n-1]

0

10

20

30

40

50

60

70

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000n (log scale)

Rand

om re

ad ti

mes

slo

wer

AMDOpteron254 2.8GHzIntel Xeon 2.8 GHzIntel Core Duo U2500 1.16GHzUltraSparc III 1.1 GHz

n L2L1

Page 8: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Applying the model to Radix sorting – the test

Three Radix algorithms– radix1, sorting the array in one pass with one ‘large’ digit– radix2, sorting the array in two passes with two half sized digits– radix3, sorting the array in three passes with three ‘small’ digits

radix3 performs almost three times as many instructions as radix1– should be almost 3 times as slow as radix1?

radix2 performs almost twice as many instructions as radix1– should be almost 2 times as slow as radix1?

Page 9: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

static void radixSort ( int [] a, int [] b ,int left, int right, int maskLen, int shift) {

int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1];

// a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++;

// b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; }

// c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; }

Base: Right Radix sorting algorithm :

One pass of array a with one sorting digit of width: maskLen (shifted shift bits up)

Page 10: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes

static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1;

for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];

while (max >= (1<<numBit)) numBit++;

int [] b = new int [n];

radixSort( a,b, left, right, numBit, 0); }

static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1;

for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];

while (max >= (1<<numBit)) numBit++;

int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2);

int [] b = new int [n];

radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); }

3 1

Page 11: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Random /sequential test (AMD Opteron) , Radix 1, 2 and 3 compared withQuicksort and Flashsort

Opteron254 - 2.8Ghz - Uniform(0:n-1) distr.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Rel

ativ

e pe

rfor

man

ce to

Q

uick

sort

(=1)

Java Arrays.sort (Quick):RRadix - 1pass:RRadix - 2pass:RRadix - 3passFlashSort - 1pass

radix1 slowed down by a factor 7

radix3, no slowdown radix2, slowdown

started

Page 12: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Random /sequential test (Intel Xeon) , Radix 1, 2 and 3 compared withQuicksort and Flashsort

Xeon 2.8GHz - Uniform U(0:n-1) distr

0

0.5

1

1.5

2

2.5

3

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Rel

ativ

e Q

uick

sort

(=1)

Java Arrays.sort (Quick):RRadix - 1pass:RRadix - 2pass:RRadix - 3passFlashSort - 1pass

Page 13: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

The model, careful counting of loops in radix1,2,3

Let Ek denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and Rk a random read or write in m different places in an array where:

kkn nm /)(log2

After some simplification:

and (+ some more simplifications):

Page 14: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Model vs. test results (Opteron and Xeon)

Test -Opteron

n =100 n= 52m

R2 / R1 1.44 0.31R3 / R1 1.77 0.25

Modeln small:n < L1, R1 = R2= R3= S

n large:n >>L2,

R1 = 10R2= 50R3= 50S

E2 / E1 21/13 = 1.61 (15+6*10)/(10*1+3*50) = 0.46

E3 /E1 31/13 = 2.38 31/(10*1+3*50) = 0.19

Test - Xeon n =100 n= 52m

R2 / R1 1.28 0.43R3 / R1 1.74 0.18

Page 15: A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Apr 24, 2023 OMS 2007

OM

S 2007

Conclusions

The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays.

We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n.

i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses

1. The number of instructions executed is no longer a good measure for the performance of an algorithm.

2. Algorithms should be rewritten such that random access inlarge data structures is removed.