Download - A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A Memory-hierarchy Conscious and Self-

tunable Sorting Library

To appear in 2004 International Symposium on Code Generation and Optimization (CGO’04)

Xiaoming Li, María Jesús Garzarán, and David Padua

University of Illinois at Urbana-Champaign

2

Motivation Sorting

– Core operation in many applications, such as databases

– Well understood symbolic computing problem

Libraries generators such as ATLAS and SPIRAL have used empirical search to adapt to – Architectural features of the target machine– Size of the input dataBut, performance of sorting also depends on the

distribution of the values to be sorted

3

Main difficulties to build a sorting library

1. Theoretical complexity is not sufficient to measure quality• Cache effect, instructions executed

2. Performance depends on the characteristics of the input• Amount & distribution of data to sort• A single algorithm is not optimal for all

possible input sets

Motivation

4

Contributions1. Identify the architectural and runtime factors

that affect the performance of the sorting algorithms.

2. Use empirical search to identify the best shape and parameter values of a sorting algorithm.

3. Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set.

5

ContributionsIBM Power 3, sorting 12 M keys (integer 32 bits)

Standard deviation of the inputs

Exec

utio

n Ti

me

(Cyc

les)

6

Outline Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions

7

Sorting Algorithms Our sorting library contains

– Quicksort– CC-Radix– Multiway Merge– Insertion Sort– Sorting Networks

For small partitions

8

Quicksort Divide and conquer in-place sorting

algorithm

Our implementation includes Sedgewick’s optimizations:– Set guardians at both ends of the input array.– Eliminate recursion.– Correctly select the pivot.– Use insertion sort for small partitions.

9

Radix sort

Non comparison algorithm

12233113 4 1

012345

Vectorto sort

2121

1234

counter

0235

1234

accum.

3231341

012345

Dest.vector

31 1122333 4

1223

112334

3

123

1231

10

CC-radix (Cache Conscious Radix Sort) Tries to exploit data locality in caches Based on radix sort (Jimenez and Larriba – UPC)

if fits in cache (bucket) then radix sort (bucket)

CC-radix(bucket)

elsesub-buckets = Reverse sorting(bucket)

for each sub-bucket in sub-buckets CC-radix(sub-buckets) endfor endif

11

Multiway Merge Sort

SortedSubset

SortedSubset

SortedSubset

SortedSubset

Heap

p subsets

2*p -1 nodes

This algorithm exploits data locality very efficiently

12

Sorting algorithms for small partitions Insertion sort Exploits locality in the

cache line

Sorting networks Register blocking

13

Performance Comparison

4000

4500

5000

5500

6000

6500

7000

100 1000 10000 100000 1000000 10000000

Standard Devi ati on

Exec

utio

n Ti

me (

Cycl

es)

I nt el MKLQui cksor t

Pentium III Xeon, 16 M keys (float)

14

Outline Sorting Algorithms Factors that determine

performance The Library Evaluation Future Work Conclusions

15

Factors that determine performance Architectural Factors Considered

– Cache / TLB size– Number of Registers– Cache Line Size

Runtime Factors Considered– Amount of data to Sort– Distribution of the data

16

Architectural: Cache Size/TLB Size Tiling: Partition the data in subsets that fit in

the cache– Quicksort

•Using multiple pivots to tile– CC-radix

•Fit each partition into cache•The # active partitions < TLB size

– Multiway Merge Sort•Fit the heap into cache•Fit sorted subsets into cache

17

Architectural: Number of Registers For small partitions, sort in place using the processor

registers Optimizations like unroll and scheduling can be applied

cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r1,r2)cmp&swap(r0,r3)cmp&swap(r4,r5)…..

cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r4,r5)cmp&swap(r1,r2)cmp&swap(r0,r3)

18

Architectural: Cache Line Size Fanout = Cache Line Size Increase cache line utilization when accessing children nodes

…

Cache Line

19

Runtime: Amount and Distribution Shape

Number of Keys (Millions)

Exec

utio

n Ti

me

(Cyc

les)

20

Runtime: Amount and Distribution Shape

Exec

utio

n Ti

me

(Cyc

les)

Number of Keys (Millions)

21

Runtime: Standard DeviationEx

ecut

ion

Tim

e (C

ycle

s)

Standard deviation of the keys

Pentium III Xeon, 16 M keys

22


23

Library adaptation Architectural Factors

– Cache / TLB size– Number of Registers – Cache Line Size

Empirical Search

Runtime Factors– Distribution shape of the data

– Amount of data to Sort – Standard Deviation

Does not matter

Machine learning and runtime adaptation

24

The Library Building the library Intallation time

– Empirical Search– Learning Procedure

• Use of training data Running the library Runtime

– Runtime Procedure

RuntimeAdaptation

25

Runtime Adaptation: Learning Procedure Goal function:

f:(N,E) {Multiway Merge Sort, Quicksort, CC-radix}

N: amount of input dataE: the entropy vector

– Use N to choose between Multiway Merge or Quicksort– Use the entropy and Winnow algorithm to learn the best

algorithm • Output: weight vector ( ) and threshold (Ө)

w→

26

Runtime Adaptation:Runtime Procedure

Sample the input array Compute the entropy vector Compute S = ∑i wi * entropyi

If S ≥ Ө choose CC-radix

elsechoose others

27


28

Experimental Setup Test Platforms:

– SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB

– UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB

– PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB

– IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB

29

Sun UltraSparcIII: 12 M keysEx

ecut

ion

Tim

e (C

ycle

s per

key

)


30

IBM Power3: 12 M KeysEx

ecut

ion

Tim

e (C

ycle

s per

key

)


31

Conclusions Identify the architectural and runtime factors

Use empirical search to find the best parameters values

Our machine learning techniques prove to be quite effective:– Always selects the best algorithm.– The wrong decision introduces a 37% average

performance degradation– Overhead (average 5%, worst case 7%)

32

Future Work1. Search in the space of sorting algorithms using

high-level primitives

2. Extend sorting to include more data types

3. Include other comparison strategies

4. Parallel algorithms

5. Explore other database operations, such as join.

For example, less than to sort vectors, graphs, …

33

Empirical Search Adaptation to the architecture of the machine

– Quicksort and CC-radix, • the best configuration does not change significantly with the

characteristics of the input data set. • Quicksort, CC-Radix:

- Use of insertion sort/sorting networks for small partitions- Threshold to use them

• CC-radix- Size of the radix

– Multiway Merge Sort• the best configuration changes with the amount and the

distribution of the input data. • The best values will be searched during the learning

procedure.

35

Multiway Merge Sort

SortedRun

SortedRun

SortedRun

SortedRun

Heap

11 21 23 607 42

21 60

60

42

28

60

42

28

4

42

28

23

36

Empirical SearchExample: Multiway Merge

• Search the heap size that obtains the best performance:- Different amount of data and

standard deviation

Download - A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Top Related