sorting and searching - uow.edu.aukoren/csci124/objprognotes/2-sorting1.pdfcsci124 sorting-1 sorting...

16
_______________________________________________________________________________________________ CSCI124 Sorting-1 Sorting and Searching In the introductory example, we used a linear search to find a particular item in an array. Now let's start utilising what you learnt in CSCI103 – but with some revision of the material as well. Let's start by looking at three simple ways of sorting data. 1. Three Simple Sorts Suppose we have an array of n integer values stored in the array X. Let's assume we want this array ordered from smallest to largest value. 1.1 Selection Sort We start by scanning the array to find the smallest value. If this value is not already at the position X[0], swap it with the contents of X[0]. We now have the smallest value in the correct position. Then the remainder of the array, i.e. X[1..n-1] is searched for next smallest, which is swapped for X[1]. Proceed through the array. After each pass through the array, one item is guaranteed to be in its correct position. So, after n passes, each through smaller portions of the array, we can guarantee the entire array is sorted. Here is an implementation. for (i=0;i<n-1;i++) { smallest = i; // location of smallest so far for (j=i+1;j<n;j++) if (X[smallest] > X[j]) smallest = j; // new location of smallest if (smallest != i) // swap if not already in position { temp = X[i]; X[i] = X[smallest]; X[smallest] = temp; } } When we looked at the linear search earlier, we indicated that the number of checks needed indicated the efficiency of the method. Note here how there are two loops involved, one nested inside the other. The outer loop is performed n-1 times. The inner loop is performed a lesser number of times for each step of the outer loop: n-1 the first time (i=0), n-2 the next, n-3, n-4, . . . down to once when i=n-2. So the comparison inside the inner loop is performed a total of (n-1) + (n-2) + (n-3) + . . . + 1 = n(n-1)/2 times. We'll see in a later section how we use this value to compare efficiency. The number of exchanges needed is less than n. 1.2 Bubble Sort This involves the same number of passes through the array as Selection Sort, but we move more values. On each pass, consecutive values are compared and swapped if in the wrong order. So the big values will 'bubble' up towards the end. (We are sorting into increasing order again.) In fact, after each pass, one extra value will be guaranteed to be in its correct position towards the end of the array. So, for (i=n-1;i>0;i--) { for (j=0;j<i;j++) if (X[j] > X[j+1]) swap values j and j+1 } This approach involves the same number of comparisons as Selection Sort but we do a lot more swaps. Note that, at each pass through the outer loop, the variable i is the last value in the array that is yet to

Upload: nguyenminh

Post on 02-May-2019

242 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-1

Sorting and SearchingIn the introductory example, we used a linear search to find a particular item in an array. Now let'sstart utilising what you learnt in CSCI103 – but with some revision of the material as well. Let's startby looking at three simple ways of sorting data.

1. Three Simple SortsSuppose we have an array of n integer values stored in the array X. Let's assume we want this arrayordered from smallest to largest value.

1.1 Selection SortWe start by scanning the array to find the smallest value. If this value is not already at the positionX[0], swap it with the contents of X[0]. We now have the smallest value in the correct position. Thenthe remainder of the array, i.e. X[1..n-1] is searched for next smallest, which is swapped for X[1].Proceed through the array.

After each pass through the array, one item is guaranteed to be in its correct position. So, after npasses, each through smaller portions of the array, we can guarantee the entire array is sorted. Here isan implementation.

for (i=0;i<n-1;i++){

smallest = i; // location of smallest so farfor (j=i+1;j<n;j++)

if (X[smallest] > X[j])smallest = j; // new location of smallest

if (smallest != i) // swap if not already in position{

temp = X[i];X[i] = X[smallest];X[smallest] = temp;

}}

When we looked at the linear search earlier, we indicated that the number of checks needed indicatedthe efficiency of the method. Note here how there are two loops involved, one nested inside the other.The outer loop is performed n-1 times. The inner loop is performed a lesser number of times for eachstep of the outer loop: n-1 the first time (i=0), n-2 the next, n-3, n-4, . . . down to once when i=n-2.So the comparison inside the inner loop is performed a total of

(n-1) + (n-2) + (n-3) + . . . + 1 = n(n-1)/2 times.

We'll see in a later section how we use this value to compare efficiency. The number of exchangesneeded is less than n.

1.2 Bubble SortThis involves the same number of passes through the array as Selection Sort, but we move more values.On each pass, consecutive values are compared and swapped if in the wrong order. So the big valueswill 'bubble' up towards the end. (We are sorting into increasing order again.) In fact, after each pass,one extra value will be guaranteed to be in its correct position towards the end of the array.

So,for (i=n-1;i>0;i--){

for (j=0;j<i;j++)if (X[j] > X[j+1])

swap values j and j+1}

This approach involves the same number of comparisons as Selection Sort but we do a lot more swaps.Note that, at each pass through the outer loop, the variable i is the last value in the array that is yet to

Page 2: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-2

be sorted. Let's modify the outer loop to a while loop, changing the variable name to reflect itsmeaning.

LastUnsorted = n-1;while (LastUnsorted > 0){

for (j=0;j<LastUnsorted;j++)if (X[j] > X[j+1])

swap values j and j+1LastUnsorted--;

}

We can further improve on this. If, during the inner loop, no swaps were needed after, say,LastSwapIndex, then the remaining entries from X[LastSwapIndex] to X[LastUnsorted-1] andhence to X[n-1] must already be ordered.

So,LastUnsorted = n-1;while (LastUnsorted > 0){

LastSwapIndex = 0;for (j=0;j<LastUnsorted;j++){

if (X[j] > X[j+1]){

swap X[j] and X[j+1]LastSwapIndex = j;

}}LastUnsorted = LastSwapIndex;

}

This means Bubble Sort will (most likely) require a smaller number of comparisons than Selection Sort.But how does it perform as n increases? We'll see later.

1.3 Insertion SortThis algorithm is essentially how a person might sort a set of values when given the items to sort oneat a time. We start with one value (sorted). We then look at the next item to be added to the order,finding where it should be placed in relation to the number(s) that are already sorted. Let us againassume we want increasing order.

So, for each item to be added to the sorted set there are two steps:• Find where the item goes.• Put it there.

Given the array X[0..i-1] of already sorted items, how do we find where the value Item fits?

Pos = 0;while (Pos < i && Item >= X[Pos])

Pos++;

Upon completion Pos has the value of the index at which the value Item must go, either at Pos=i orbefore X[Pos].

If the former, then

X[Pos] = Item;i++;

whereas, in the latter case, the array elements must be moved to accommodate the extra value

for (j=i-1;j>=Pos;j--);X[j+1] = X[j];

Page 3: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-3

and then

X[Pos] = Item;i++;

We can actually combine the two steps by searching for the insertion point from the upper end, movingthe array elements up as we go. Finally we have

Pos = i-1;while (Pos >= 0 && Item < X[Pos]){

X[Pos+1] = X[Pos];Pos--;

}X[Pos+1] = Item;i++;

This algorithm can be extended to sorting an already occupied but unsorted array X, by starting with aone-long sorted subarray containing X[0]. Now we find where Item=X[1] fits in this sorted array.Then where Item=X[2] fits into the now sorted array of length 2. And so on. Thus, at any stage in theprocess, all entries so far inserted are in the correct order.

As for the number of comparisons required, on average, the value being inserted will be placed in themiddle of the values already sorted. Thus, Insertion Sort requires half the comparisons of SelectionSort. So n(n-1)/4 comparisons will be needed – on average.

Thus we now have three sorting methods:SelectionBubbleInsertion

Which is the best? How can we compare their efficiency?

2. Efficiency and ProofIn the last section we talked about guarantees. That is, we had to be sure that, if we followed theprocedure described, by the completion of the process the data would be sorted. Whenever an algorithmis designed, we must be certain that the algorithm works. Just because some test runs of a particularimplementation produces the correct result is not a proof of correctness. The advantage of having aknowledge of standard algorithms – tried-and-tested methods – is that we can be assured that theprocedure will work. This is the first criterion for a good algorithm.

The second criterion for judging an algorithm is efficiency. We want methods that perform well onour data. In particular, the efficiency of algorithms that are designed for manipulating data sets of acertain size can be described by how they perform as the size of the data set increases. When it comesto sorting algorithms, the number of comparisons which have to be made is a good indication ofefficiency (although the number of exchanges of data items is also of importance). Let's look atSelection Sort first.

We stated above that Selection Sort involves n(n-1)/2 comparisons. For those with a mathematicalmind, we say that such an expression is of the order of n2, as the expression 0.5n2-0.5n behaves like aconstant times n2 when n is large. (In actuality, we are saying that the number of comparisons isbounded above by a constant times n2 for values of n beyond some particular integer.) The linearterm is overpowered by the square, and is ignored. We write this as O(n2) – called big-O notation. Infact all three of the algorithms above are O(n2) sorts. But that doesn't mean that they are all equal.They differ in the constant multiplier – and whether this is an average or a constant performance. Itdoes, however, mean that doubling the data set size will quadruple the time it takes to do the sort – onaverage. So how do we decide which of the above three is best?

We look at best and worst case performance. That is, are there initial arrangements of the data thatmake the algorithms perform better or worse than O(n2)?

Page 4: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-4

For Selection Sort, it doesn't matter. Because the loops are of fixed lengths, the sort will always takeexactly the same time. Obviously, if the data is already sorted, there will be no exchanges, but thecomparisons will all take place.

For Bubble Sort, an already sorted data set will involve one pass through the data, with no exchangestaking place. The algorithm will detect that no further passes are necessary and so only n-1comparisons are needed – O(n). If the data is completely reversed, the full number of comparisons willbe needed, as each pass through the data will move one item from the front of the set to the back withno change in the relative positions of other entries. Thus the only thing we can say is that the worstcase is O(n2), and that on average less than n(n-1)/2 comparisons will be needed.

For Insertion Sort, best case is sorted, where we need only compare each new entry to the end of theordered part of the data to indicate that the new item goes where it already is. The worst case iscompletely reversed data, where we have to compare each new entry to all already-sorted data. For theaverage case, we previously stated that about half the already sorted values would need to be checked,so n(n-1)/4 comparisons would be needed.

Thus, Selection Sort is worst, with Bubble and Insertion Sort somewhat similar. The advantage ofInsertion Sort is that extra data items can be inserted into an already sorted set efficiently.

Can we improve on O(n2)?

3. Binary SearchWe indicated earlier that the searching of a database could be improved by having the data sorted bythe key used to search. This would enable us to shorten the search for key values not present in thedata. It also enables us to find a more efficient way of locating key values that are present in the data.Now that we have the concept of order, we can say that linear search is an O(n) process, as the numberof comparisons needed behaves like a constant times the size of the set.

An improved method of locating a value in a set of values is binary search.Suppose we have a sorted set of n values in an array X[0..n-1]. We start by assuming that the valuewe are looking for, called key, is somewhere in that range 0 to n-1. We can reduce the size of the setthat we are searching in by inspecting the value in the middle of the set, namely X[(n-1)/2]. If n isodd, this is the middle value. If n is even, we truncate, leading to the data item just below centre.

If key is equal to this value, we've found it. If not, then we can eliminate (more than) half the set as apossible location. So, if key < X[(n-1)/2], we know key must be located in the lower half of the set,otherwise it is in the upper half.

Let us suppose that we know the value key should be between entry number L and entry number U.We check entry number M=(L+U)/2. If that entry is not key, then key < X[M] means key is in rangefrom L to M-1, so make U=M-1. Otherwise key is in range L=M+1 to U.

How do we stop if key is not in the set? We'll at some point set L or U so that L > U.

We know that the method will work (proof of correctness) as we reduce the search set by a factor of 2each step. Ultimately the size will be 1.

How efficient is it? We need to know how many comparisons will be needed. How many times can wehalve a set of n values? That is, for what value of M is n=2M (approximately)? The answer is log2n.

We say that binary search is O(log n). (What base the log is taken to doesn't matter (in discussingorder) as we can show logan = c logbn, and the constant is ignored in order notation. In terms of actualcounts, log to base 2 is used.)

Thus we could improve Insertion Sort by using binary search to determine where in the already-sortedset the new entry should go. This would make the number of comparisons O(nlogn). But the number ofmoves would not change. This indicates that we should not necessarily be swayed by just the numberof comparisons.

Page 5: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-5

If we could get the moves down to O(nlogn) then we would improve the sort. Well we can.But first let's consider a similar problem

4. Reversing the contents of an arrayGiven an array X[0..n-1] in ascending order, how can we reverse the array so that it is in descendingorder?

Here's an implementation.

Left = 0;Right = n-1;while (Left<Right){

swap X[Left] and X[Right]Left++;Right--;

}

So, for example

1 2 3 4 5 6Left Right

6 2 3 4 5 1Left Right

6 5 3 4 2 1LeftRight

6 5 4 3 2 1LeftRight

5. QuicksortThe aim of this method, developed by C.A.R. Hoare, is the partitioning of the array of values relative toa value called the pivot so that

values≤ pivot

values≥ pivot

pivot

The pivot value is now in its correct position. Then, the same procedure is performed on the two sub-arrays. Continuing this process, we ultimately reach single entry sub-arrays. As each step of thealgorithm fixes the pivot, we guarantee the success of the process in at most n steps.

For example, consider the set of values 7 9 6 8 4 3 5

For the moment, let us choose the first value as the pivot, namely 7.

So, after partitioning we have 6 4 3 5 7 9 8

If we have values equal to the pivot, they can go into either sub-array.

Now we consider the left part 6 4 3 5

Partitioning using the 6 as pivot, we get 4 3 5 6 where the right part of this partitioning isnull (empty). A further partitioning step on the left part of this partition would be required. The rightpart of the first partitioning would also need further partitioning.

How can we perform the partitioning?Again, let's choose the first entry as the pivot. Then we scan the rest of the array from both ends,looking for values which are at the wrong end. This is similar to the procedure we used to reverse anarray above. Values equal to the pivot value will be considered to be at the wrong end. When two suchelements are found, they are swapped, and the scan continued. When the two scans meet, thepartitioning is complete.

Page 6: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-6

Consider this larger example

54 13 56 95 21 94 31 69 46 19 55 65 72 81 73For the pivot value, use the first entry 54.Scanning from the second from the left for a value which should be in the right partition, namely avalue ≥ 54 we reach the 56; then scanning from the right for a value ≤ 54 we reach 19, so

54 13 56 95 21 94 31 69 46 19 55 65 72 81 73

Left RightWe now swap the values indicated.

54 13 19 95 21 94 31 69 46 56 55 65 72 81 73

Left RightContinuing the right and left scans we get to

54 13 19 95 21 94 31 69 46 56 55 65 72 81 73

Left Rightand, after swapping

54 13 19 46 21 94 31 69 95 56 55 65 72 81 73

Left Rightand then

54 13 19 46 21 94 31 69 95 56 55 65 72 81 73

LeftRightwhich becomes, after swapping

54 13 19 46 21 31 94 69 95 56 55 65 72 81 73

LeftRightThe next scan results in

54 13 19 46 21 31 94 69 95 56 55 65 72 81 73

LeftRightwhen the Right value is no longer to the right of Left. Here we stop.Right now points at the rightmost value which is less than or equal to the pivot.

By exchanging this value with the pivot value, we get

31 13 19 46 21 54 94 69 95 56 55 65 72 81 73

RightNow all values to the left of Right are less than or equal to the pivot, while all values to the right ofRight are greater than or equal to the pivot, and the pivot value is in its final position in the sorted list.

The two sub-arrays 31 13 19 46 21

and 94 69 95 56 55 65 72 81 73

can then be sorted using the same approach.

We aim to halve the array at each step, so that the number of comparisons is halved. In this way,comparisons approach O(nlog n), just as in binary search. But, we would expect, on average, only halfthe values to be on the wrong side of the pivot, so the number of exchanges would also be halved.

Let's develop the partitioner.First a rough function.

void Partition(int X[], int Left, int Right, int& Pivot){// partition the array X]Left..Right] into// [Left..Pivot-1] and [Pivot+1..Right] where all values// X[Left..Pivot-1] <= X[Pivot] and// X[Pivot+1..Right] >= X[Pivot]}

Now let's refine the content.

Page 7: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-7

Select the pivot valueWHILE the scans don't overlap DO

Scan the array X[Left+1..Right] fromboth ends exchanging values as needed

ENDPosition the pivot element

Selecting the pivot value entails

PValue = X[Left];

How would we do the scan? Suppose our last leftmost swap was at LIndex.Then

do{

LIndex++;} while (X[LIndex] < PValue);

or, using a while we get

LIndex++;while (X[LIndex] < PValue)

LIndex++;

Similarly, the right scan would be

RIndex--;while (PValue < X[RIndex])

RIndex--;

The scans will overlap when RIndex <= LIndex, and the scans would start at Left and Right+1, aswe increment LIndex and decrement RIndex first.

Thus the outer code would look like this:

LIndex = Left;RIndex = Right+1;while (LIndex < RIndex){

// scan left then rightif (LIndex < RIndex)

swap X[LIndex], X[RIndex]}

After the scans meet, the point at which RIndex finishes is the rightmost value <=PValue, so we setthe pivot to this point and exchange this value for the pivot value X[Left].

Pivot = RIndex;swap X[Left] and X[RIndex]

So we end with the following function.

Page 8: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-8

void Partition(int X[], int Left, int Right, int& Pivot){

int LIndex, RIndex;int PValue;

PValue = X[Left];LIndex = Left;RIndex = Right+1;while (LIndex < RIndex){

LIndex++;while (X[LIndex] < PValue) // Left Scan

LIndex++;RIndex--;while (X[RIndex] > PValue) // right Scan

RIndex--;if (LIndex < RIndex)

swap X[LIndex], X[RIndex]}Pivot = RIndex;swap X[Left] and X[RIndex]

}

There is one problem with the above program:

What if all X[Left+1..Right] are less than PValue ?

Then the

while (X[LIndex] < PValue)LIndex++;

will not terminate, unless we ensure LIndex never exceeds Right (or Rindex), or that there is a valueX[Right+1] which is bigger than all the contents of the array. The former adds an extra comparisonto the while loop, while the second requires an extra array entry, complicated by requiring a knowledgeof the contents of the array. Well use the latter.

Once this partitioning has been done, the sub-arrays X[Left..Pivot-1] and X[Pivot+1..Right]need to be partitioned, if they are not single values (or, in fact, empty). As you have probably realised,this successive partitioning is ideally suited to recursion.

Let's now look at the Quicksort function.

void Quicksort(int X[], int Left, int Right){

int Pivot;

Partition(X,Left,Right,Pivot);Quicksort(X,Left,Pivot-1);Quicksort(X,Pivot+1,Right);

}

This function is initially called as Quicksort(X,0,n-1).

Obviously, we have to account for sub-arrays of length one or empty.

Page 9: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-9

So we finish with

void Quicksort(int X[], int Left, int Right){

int Pivot;

Partition(X,Left,Right,Pivot);if (Pivot-1 > Left)

Quicksort(X,Left,Pivot-1);if (Right > Pivot+1)

Quicksort(X,Pivot+1,Right);}

Now let's go back to the beginning.What is the purpose of the pivot? To divide the array into two approximately equal parts. Is the firstelement the most appropriate? Not exactly.

Suppose the array is already sorted:

1 2 3 4 5 6 7

Using the first element as the pivot, we end with

1 2 3 4 5 6 7LeftRight

So the array is partitioned into a null array and 2 3 4 5 6 7 .

The right partition then is partitioned into a null part and 2 3 4 5 6 7 and so on.

Thus the size of the partition is reduced by one every pass, the same as O(n2) sorts. This is the worstcase for Quicksort.

A more carefully chosen pivot value can improve this behaviour.For already sorted data, the middle value X[Middle], where Middle=(Left+Right)/2, is the best,since this ensures the set is divided approximately equally. Thus, we first swap the middle value andthe leftmost value, and then proceed as before.

If the data is random, the first element, last element, or even a random element is as good as themiddle value.

We aim to partition the array into two equal parts each time. The value which does this is called themedian. But to find the median, we need the data already sorted – catch 22.So we use a median estimate. A good estimate is the median of the three array elements X[Left],X[Middle] and X[Right].

For example, consider

54 13 56 95 21 94 31 69 46 19 55 65 72 81 73There are 15 values so the three values are the first, eighth and fifteenth, namely 54, 69, and 73. Thusthe pivot value is 69.

So, after swapping the 69 for the leftmost value and starting the scans, we get

69 13 56 95 21 94 31 54 46 19 55 65 72 81 73

Left Rightand then

69 13 56 65 21 94 31 54 46 19 55 95 72 81 73

Left Rightand

69 13 56 65 21 55 31 54 46 19 94 95 72 81 73

LeftRight

Page 10: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-10

and finally

19 13 56 65 21 55 31 54 46 69 94 95 72 81 73

Right

with a partition into 19 13 56 65 21 55 31 54 46 and 94 95 72 81 73

So now we can have a look at the relative efficiency of the four sorting algorithms we have discussed.

6. A Final Sorting Comparison of EfficiencyNumber of Comparisons

Average Worst Best

Selection n(n-1)/2 n(n-1)/2 n(n-1)/2

Bubble <n(n-1)/2 n(n-1)/2 n-1

Insertion n(n-1)/4 n(n-1)/2 n-1

Quicksort 1.4n log2n n(n-1)/2 .5n log2n

For relatively short arrays, say n<12, Insertion Sort is best. Often Quicksort is accelerated by changingto insertion when the length of a subarray is smaller than some number.

7. A programmer's approach to efficiency analysisOK. You've worked out your own algorithm to perform some task on a data set of size n. How can youdetermine its performance without looking at it theoretically? By testing.

Suppose we want to find the number of comparisons involved in a sort technique. All we need to do isreplace all the comparisons of data items in the code with a call to a function to perform thecomparison.

For example, consider the code for Quicksort. Here is where the comparisons take place:

while (X[LIndex] < PValue) // Left ScanLIndex++;

RIndex--;while (X[RIndex] > PValue) // Right Scan

RIndex--;

We write a couple of functions

static NoComp = 0; // keeps track of comparisons

bool CompareData(int V1, int V2){

NoComp++;return (V1<V2);

}

int HowMany() // reports the total{

return NoComp;}

Page 11: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-11

and rewrite Quicksort so that

while (CompareData(X[LIndex],PValue)) // Left ScanLIndex++;

RIndex--;while (CompareData(PValue,X[RIndex])) // Right Scan

RIndex--;

We can then generate some multiple random data sets of assorted sizes (n=1000, 2000, 4000, ...) andreport the number of comparisons each requires. We then check the ratio of the numbers generatedand compare the values to ratios for assorted order functions, such as: n doubles O(n) doubles; O(n2)quadruples; and O(nlog n) performs as

n 125 250 500 1000 2000 4000nlog2n 871 1991 4483 9966 21932 47863(ratios) 2.29 2.25 2.22 2.20 2.18

The same checks can be made on exchanges or moves by replacing the swaps by a function call.

static NoSwaps = 0;

void ExchangeData(int& V1, int& V2){

int temp = V1;

NoSwaps++;V1 = V2;V2 = temp;

}

int ReportSwaps(){

return NoSwaps;}

8. But What If?8.1 We don't want our data in increasing orderSuppose the required order of the data in our array is decreasing so that X[0] is the largest. We couldthen just replace all the comparisons between data items containing < with >, and vice versa.

However, we've already provided a simpler approach. When analysing our algorithms we created afunction CompareData() which returned true if the first argument's value was less than the second. Ifwe reinterpret the function so that the test is "Does the first argument's value appear before the secondin the required order?" In that way, no changes need be made in the other functions.

8.2 We don't want to sort integers?All our discussions and implementations of sorting (and searching) algorithms involved integers. Notevery database has an int as its key value. No worries. Provided we can compare them, any values canbe sorted. All we have to do is replace the comparisons and swaps with those for the specific data type.Most basic C++ data types can be sorted. What about our own data types? structs?

Suppose we have an array of items of type DataType. All we need to be able to do is answer thequestion posed for the function CompareData() above. Thus the function can then return true bytesting a particular key (or keys). For example, a database of student records may contain fields forfamily name and given names. When sorting the database into alphabetic order of name, we couldcompare family names first; if two persons had the same family name, they could be correctly orderedby then comparing given names. So Smith, John would come after Smith, Jane. This is called amulti-key sort.

Page 12: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-12

We thus have two functions

bool CompareData(DataType V1, DataType V2);and

void ExchangeData(DataType& V1, DataType& V2);

Although the contents of CompareData() may be quite complex to achieve the decision of whether thefirst argument precedes the second, the big problem comes in the second function. When sortingintegers it is simple to exchange the values of two integer variables. But suppose the DataType beingsorted occupies 200 bytes. Performing three assignments takes time. An alternative approach involvesa form of indirect reference of the data items.

Suppose we have the data in an array

DataType X[SIZE];

We create a second array

int Index[SIZE];

which we set initially to the values 0, 1, 2, . . ., n-1. Now we can return to the original functions withinteger arguments.

bool CompareData(int IV1, int IV2){

// the arguments are two elements of the Index array and// this function compares X[IV1] to X[IV2]

}

void ExchangeData(int& IV1, int& IV2){

// exchanges IV1 and IV2 which are now two elements of the// index array

}

Both these functions must know about the data array X but not the Index array. At the completion ofthe sort the original array has not been touched. The first value in the sorted order is found by indirectreference, X[Index[0]]. The last value is found at X[Index[n-1]].

Note now that, no matter what we are sorting, the references in the sort routines to the data are alwaysas integers – the indexes in the array Index.

Let's now see how all these come together in a Quicksort implementation.

bool CompareData(int,int); // prototype for comparisonvoid ExchangeData(int&,int&); // prototype for swap

void Quicksort(int Index[], int Left, int Right){

int Pivot;

Partition(Index,Left,Right,Pivot);if (Pivot-1 > Left)

Quicksort(Index,Left,Pivot-1);if (Right > Pivot+1)

Quicksort(Index,Pivot+1,Right);}

void Partition(int Index[], int Left, int Right, int& Pivot){

int LIndex, RIndex;int PIndex;PIndex = Index[Left];LIndex = Left;RIndex = Right+1;

Page 13: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-13

while (LIndex < RIndex){

LIndex++;while (CompareData(Index[LIndex],PIndex)) // Left Scan

LIndex++;RIndex--;while (CompareData(PIndex,Index[RIndex])) // Right Scan

RIndex--;if (LIndex < RIndex)

ExchangeData(Index[LIndex], Index[RIndex]);}Pivot = RIndex;ExchangeData(Index[Left], Index[RIndex]);

}

These can be placed in one file. A second file would contain the exchange and comparison functions forthe particular DataType we wish to sort.

There is a more general way of performing the indirect referencing that the Index array provides – thepointer. Unfortunately, because pointers to different datatypes are different types, we cannot writegeneric code as above – unless there was a generic pointer. We'll see this later.

8.3 Our program involves more than one sortMany times you will find that your program will need to sort a particular data set into more than onedifferent order. We might want to sort a set of student records by name, and by degree, for example.How can we use the same sorting function to sort different ways?

Our method of implementation described above requires a separate file containing the Comparison andExchange routines. If our program required differing comparisons we could incorporate some form ofswitch to decide on the method of comparison.

static int CompType=0;

void SetCompType(int type){

CompType = type;}

bool CompareData(int IV1, int IV2){

switch(CompType){case 0: // first comparison method// compares X[IV1] to X[IV2] on first methodcase 1: // second comparison method// compares X[IV1] to X[IV2] on second method. . .. . .

}

We could even keep more than one sorted set at the same time by using different Index arrays for eachsort. The actual data set would remain in the same original order.

9 Another Search Technique: HashingAll the techniques discussed so far involved comparing the given key to the keys in the data structureuntil the key matches a data key, or we convince ourselves that it isn't there.

We are also aware that if the items we are searching are, for example, consecutive letters of thealphabet, the easiest search is to place the keys into prescribed positions in an array, and index themusing some numerical coding of the search key. In the case of the letters 'a..z' for example, we woulduse letter 'a' to index the array and hence Key-'a' would determine directly the successful search.

Page 14: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-14

In a similar case, to create an inverted University phone book – look up a number and find who is atthat number – we could use the fact that all the numbers are consecutive, or nearly so, to be able todirectly index the number.

But, suppose we wanted to do the same for staff members' home phones. Here the numbers are merelysay 2000 numbers in a range of hundreds of thousands.

We would then turn to our previous methods.

But there is an approach to searching which utilises this coding approach and results in betterperformance figures than binary search. In fact, this method can find successfully by examining onlyone or two data keys. The technique is called hashing.

Suppose we have n keys in our data structure. Hashing involves finding a function f(K) which convertsthe keys to some numeric value which can be used as the index into an array. We would like to havesuch a function yield the integers 0..n-1 with each K having a unique f(K). Unfortunately, suchfunctions are practically impossible to find – even if they exist !

So we have to reduce our expectations for such a hashing function.

Let's suppose we can find a function f(K) which, for each K, yields a unique integer in the range 0..m-1where m>n. Provided m is not too much larger than n, this would result in some waste space (m-nlocations) but would still yield the result of one comparison per search !!

The ratio n/m is called the occupancy rate.

What if the search key did not exist?

Then, either the hash function would yield a value not in the range 0..m-1 or the location indexedwould either be empty or the key located there would not match – still only one comparison !!

Unfortunately, such hashing functions are yet again not easy to find.

We finally must accept a function f(K) which yields an integer in the range 0..m-1 but not necessarilyunique.

When the data is being inserted into this table and we find the value f(K) results in an already occupiedlocation, we have a collision. One solution is to just place the data into the next available emptylocation. Then, when searching for a particular K, if f(K) yields a location which is not empty, butcontains a non-matching key, a sequential search of higher locations is made to find the required key.

But what if our search key does not exist, but f(K) generates a legitimate value?

If the location is empty, we have an unsuccessful search as required.

If it is occupied, we will find its key does not match the search key. This could mean two things.

• If the search key and the found key both have the same f(K), we must continue with thesequential search until we encounter an empty location.

• If the search key and the found key differ in f(K), then the found key may be the result of acollision lower in the array, and so the sequential search still has to take place.

Thus, we need empty locations to terminate unsuccessful searches.

An occupancy of more than 80% leads to a poor termination of unsuccessful searches.

Let's look at an example, and how we might find an f(K).Consider the 15 most frequent words in the English language. These are

THE, OF, AND, TO, A, IN, THAT, IS, I, IT, FOR, AS, WITH, WAS, HIS.

Page 15: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-15

Let's take the letters of each word, and add their ordinal values together. We get

225, 149, 211, 163, 65, 151, 305, 156, 73, 157, 231,148, 316, 235, 228

a unique set ranging from 65..316 – an occupancy rate of less than 6% – too low.

If we wanted an occupancy of 50% we would have 30 locations. So, suppose we reduce the numbers bydividing by 10 and subtracting 6. We then get

16, 8, 15,10, 0, 9, 24, 9, 1, 9, 17, 8, 25, 17, 16

Note that we now have quite a few collisions, so not a good one.How about reducing them modulo 30.

15, 29, 1, 13, 5, 1, 5, 6, 13, 7, 21, 28, 16, 25, 18

Now we have only three collisions.So let's use this hash function.

f(K) = (sum of ascii codes) modulo 30

This results in the following array

AND

THE

OF

TO

A IS IT

FOR AS

WITH

WAS

HIS

IN THAT

I

0 1 2 3 4 5 6 7 8 9

0

10

20

The collisions were put in last, so thatf(IN) = 2f(THAT) = 5 ends up in 8f(I) = 13 ends up in 14

Now only one comparison is needed for 12 of the keys, while the others need 2,4 and 2. This means anaverage of 1.33 comparisons !!!

But what about unsuccessful searches?Consider the next most frequent word HE. f(HE) = (72+69) modulo 30 = 21This is the same as f(FOR), so we would need to search further. The next location is empty, so thesearch is complete in 2 checks.

An alternative insertion in order of frequency.

AND

THE

OF

TO

A IS IT

FOR AS

WITH

WAS

HIS

IN THAT

I

0

10

20

0 1 2 3 4 5 6 7 8 9

This time THAT occupies 6 because 5 is occupied. So when IS should go into 6 it ends up in 7, and ITgoes into 8 when it should have gone in 7.Here the comparisons needed average the same value.But if searches occur according to the frequency of the words (which has not be specifed here) we getbetter results for the latter insertion.

Page 16: Sorting and Searching - uow.edu.aukoren/csci124/ObjProgNotes/2-Sorting1.pdfCSCI124 Sorting-1 Sorting and Searching ... This algorithm is essentially how a person might sort a set of

_______________________________________________________________________________________________CSCI124 Sorting-16

Any attempt to get a higher occupancy ratio would yield more collisions. But let's try for, say, 75%.That is, 20 locations. Use modulo 20. We get

5, 9, 11, 3, 5, 11, 5, 16, 13, 17, 11, 8, 16, 15, 8

yielding

AND

THE OFTO A

IS ITFOR

AS

WITHWAS HIS IN

THAT

I

0 1 2 3 4 5 6 7 8 9

Now the average search is 1.67.The search for HE this time takes 1 (f(HE)=2).When searching for the next available empty space upon collision, there is a wraparound from the endof the array to the beginning.

On average, the number of locations checked for a successful search is

1 + 1 / ( 1 − α ) 2

and

1 + 1 / ( 1 − α ) 2

2 for an unsuccessful search, where a is the occupancy rate. That is, fewer than 3 for a successful searchand 13 for an unsuccessful search at 80% occupancy.

It is considered that hashing as a search method is O(1).