data structures - lecture 8 - study notes

25
CHAPTER 4 Algorithm Analysis Algorithms are designed to solve problems, but a given problem can have many dierent solutions. How then are we to determine which solution is the most ecient for a given problem? One approach is to measure the execution time. We can implement the solution by constructing a computer program, using a given programming language. We then execute the program and time it using a wall clock or the computer’s internal clock. The execution time is dependent on several factors. First, the amount of data that must be processed directly aects the execution time. As the data set size increases, so does the execution time. Second, the execution times can vary de- pending on the type of hardware and the time of day a computer is used. If we use a multi-process, multi-user system to execute the program, the execution of other programs on the same machine can directly aect the execution time of our program. Finally, the choice of programming language and compiler used to implement an algorithm can also influence the execution time. Some compilers are better optimizers than others and some languages produce better optimized code than others. Thus, we need a method to analyze an algorithm’s eciency independent of the implementation details. 4.1 Complexity Analysis To determine the eciency of an algorithm, we can examine the solution itself and measure those aspects of the algorithm that most critically aect its execution time. For example, we can count the number of logical comparisons, data interchanges, or arithmetic operations. Consider the following algorithm for computing the sum of each row of an n n matrix and an overall sum of the entire matrix: totalSum = 0 # Version 1 for i in range( n ) : rowSum[i] = 0 for j in range( n ) : rowSum[i] = rowSum[i] + matrix[i,j] totalSum = totalSum + matrix[i,j] 97

Upload: haitham-el-ghareeb

Post on 11-May-2015

2.342 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Structures - Lecture 8 - Study Notes

CHAPTER 4Algorithm Analysis

Algorithms are designed to solve problems, but a given problem can have manydi↵erent solutions. How then are we to determine which solution is the moste�cient for a given problem? One approach is to measure the execution time. Wecan implement the solution by constructing a computer program, using a givenprogramming language. We then execute the program and time it using a wallclock or the computer’s internal clock.

The execution time is dependent on several factors. First, the amount of datathat must be processed directly a↵ects the execution time. As the data set sizeincreases, so does the execution time. Second, the execution times can vary de-pending on the type of hardware and the time of day a computer is used. Ifwe use a multi-process, multi-user system to execute the program, the executionof other programs on the same machine can directly a↵ect the execution time ofour program. Finally, the choice of programming language and compiler used toimplement an algorithm can also influence the execution time. Some compilersare better optimizers than others and some languages produce better optimizedcode than others. Thus, we need a method to analyze an algorithm’s e�ciencyindependent of the implementation details.

4.1 Complexity AnalysisTo determine the e�ciency of an algorithm, we can examine the solution itself andmeasure those aspects of the algorithm that most critically a↵ect its execution time.For example, we can count the number of logical comparisons, data interchanges,or arithmetic operations. Consider the following algorithm for computing the sumof each row of an n⇥ n matrix and an overall sum of the entire matrix:

totalSum = 0 # Version 1for i in range( n ) :rowSum[i] = 0for j in range( n ) :rowSum[i] = rowSum[i] + matrix[i,j]totalSum = totalSum + matrix[i,j]

97

Page 2: Data Structures - Lecture 8 - Study Notes

98 CHAPTER 4 Algorithm Analysis

Suppose we want to analyze the algorithm based on the number of additionsperformed. In this example, there are only two addition operations, making this asimple task. The algorithm contains two loops, one nested inside the other. Theinner loop is executed n times and since it contains the two addition operations,there are a total of 2n additions performed by the inner loop for each iterationof the outer loop. The outer loop is also performed n times, for a total of 2n2

additions.Can we improve upon this algorithm to reduce the total number of addition

operations performed? Consider a new version of the algorithm in which the secondaddition is moved out of the inner loop and modified to sum the entries in therowSum array instead of individual elements of the matrix.

totalSum = 0 # Version 2for i in range( n ) :rowSum[i] = 0for j in range( n ) :rowSum[i] = rowSum[i] + matrix[i,j]

totalSum = totalSum + rowSum[i]

In this version, the inner loop is again executed n times, but this time, it onlycontains one addition operation. That gives a total of n additions for each iterationof the outer loop, but the outer loop now contains an addition operator of its own.To calculate the total number of additions for this version, we take the n additionsperformed by the inner loop and add one for the addition performed at the bottomof the outer loop. This gives n + 1 additions for each iteration of the outer loop,which is performed n times for a total of n2 + n additions.

If we compare the two results, it’s obvious the number of additions in the secondversion is less than the first for any n greater than 1. Thus, the second version willexecute faster than the first, but the di↵erence in execution times will not be signif-icant. The reason is that both algorithms execute on the same order of magnitude,namely n2. Thus, as the size of n increases, both algorithms increase at approxi-mately the same rate (though one is slightly better), as illustrated numerically inTable 4.1 and graphically in Figure 4.1.

n 2n2 n2 + n

10 200 110100 20,000 10,100

1000 2,000,000 1,001,00010000 200,000,000 100,010,000

100000 20,000,000,000 10,000,100,000

Table 4.1: Growth rate comparisons for different input sizes.

Page 3: Data Structures - Lecture 8 - Study Notes

4.1 Complexity Analysis 99

���

���

���

���

���

��� ��� ��� ��� ���

��

������

����

Figure 4.1: Graphical comparison of the growth rates from Table 4.1.

4.1.1 Big-O NotationInstead of counting the precise number of operations or steps, computer scientistsare more interested in classifying an algorithm based on the order of magnitudeas applied to execution time or space requirements. This classification approx-imates the actual number of required steps for execution or the actual storagerequirements in terms of variable-sized data sets. The term big-O , which is de-rived from the expression “on the order of,” is used to specify an algorithm’sclassification.

Defining Big-O

Assume we have a function T (n) that represents the approximate number of stepsrequired by an algorithm for an input of size n. For the second version of ouralgorithm in the previous section, this would be written as

T2(n) = n2 + n

Now, suppose there exists a function f(n) defined for the integers n � 0, such thatfor some constant c, and some constant m,

T (n) cf(n)

for all su�ciently large values of n � m. Then, such an algorithm is said to have atime-complexity of, or executes on the order of, f(n) relative to the number ofoperations it requires. In other words, there is a positive integer m and a constantc (constant of proportionality) such that for all n � m, T (n) cf(n). The

Page 4: Data Structures - Lecture 8 - Study Notes

100 CHAPTER 4 Algorithm Analysis

function f(n) indicates the rate of growth at which the run time of an algorithmincreases as the input size, n, increases. To specify the time-complexity of analgorithm, which runs on the order of f(n), we use the notation

O( f(n) )

Consider the two versions of our algorithm from earlier. For version one, thetime was computed to be T1(n) = 2n2. If we let c = 2, then

2n2 2n2

for a result of O(n2). For version two, we computed a time of T2(n) = n2 + n.Again, if we let c = 2, then

n2 + n 2n2

for a result of O(n2). In this case, the choice of c comes from the observation thatwhen n � 1, we have n n2 and n2 + n n2 + n2, which satisfies the equation inthe definition of big-O.

The function f(n) = n2 is not the only choice for satisfying the conditionT (n) cf(n). We could have said the algorithms had a run time of O(n3) orO(n4) since 2n2 n3 and 2n2 n4 when n > 1. The objective, however, isto find a function f(·) that provides the tightest (lowest) upper bound or limitfor the run time of an algorithm. The big-O notation is intended to indicate analgorithm’s e�ciency for large values of n. There is usually little di↵erence in theexecution times of algorithms when n is small.

Constant of Proportionality

The constant of proportionality is only crucial when two algorithms have the samef(n). It usually makes no di↵erence when comparing algorithms whose growthrates are of di↵erent magnitudes. Suppose we have two algorithms, L1 and L2,with run times equal to n2 and 2n respectively. L1 has a time-complexity of O(n2)with c = 1 and L2 has a time of O(n) with c = 2. Even though L1 has a smallerconstant of proportionality, L1 is still slower and, in fact an order of magnitudeslower, for large values of n. Thus, f(n) dominates the expression cf(n) and the runtime performance of the algorithm. The di↵erences between the run times of thesetwo algorithms is shown numerically in Table 4.2 and graphically in Figure 4.2.

Constructing T(n)

Instead of counting the number of logical comparisons or arithmetic operations, weevaluate an algorithm by considering every operation. For simplicity, we assumethat each basic operation or statement, at the abstract level, takes the same amountof time and, thus, each is assumed to cost constant time . The total number of

Page 5: Data Structures - Lecture 8 - Study Notes

4.1 Complexity Analysis 101

n n2 2n

10 100 20100 10,000 200

1000 1,000,000 2,00010000 100,000,000 20,000

100000 10,000,000,000 200,000

Table 4.2: Numerical comparison of two sample algorithms.

���

���

���

���

���

��� ��� ��� ��� ���

��

���

��

Figure 4.2: Graphical comparison of the data from Table 4.2.

operations required by an algorithm can be computed as a sum of the times requiredto perform each step:

T (n) = f1(n) + f2(n) + . . . + fk(n).

The steps requiring constant time are generally omitted since they eventuallybecome part of the constant of proportionality. Consider Figure 4.3(a), whichshows a markup of version one of the algorithm from earlier. The basic operationsare marked with a constant time while the loops are marked with the appropriatetotal number of iterations. Figure 4.3(b) shows the same algorithm but with theconstant steps omitted since these operations are independent of the data set size.

Page 6: Data Structures - Lecture 8 - Study Notes

102 CHAPTER 4 Algorithm Analysis

totalSum = 0

for i in range( n ) : rowSum[i] = 0

for j in range( n ) : rowSum[i] = rowSum[i] + matrix[i,j] totalSum = totalSum + matrix[i,j]

1

1

n

1

n

1

1

1

for i in range( n ) : ...

for j in range( n ) : ...n

n

(a)

(b)

Figure 4.3: Markup for version one of the matrix summing algorithm: (a) shows all oper-ations marked with the appropriate time and (b) shows only the non-constant time steps.

Choosing the Function

The function f(n) used to categorize a particular algorithm is chosen to be thedominant term within T (n). That is, the term that is so large for big values ofn, that we can ignore the other terms when computing a big-O value. For example,in the expression

n2 + log2n + 3n

the term n2 dominates the other terms since for n � 3, we have

n2 + log2 n + 3n n2 + n2 + n2

n2 + log2 n + 3n 3n2

which leads to a time-complexity of O(n2). Now, consider the function T (n) =2n2 + 15n + 500 and assume it is the polynomial that represents the exact numberof instructions required to execute some algorithm. For small values of n (lessthan 16), the constant value 500 dominates the function, but what happens as ngets larger, say 100, 000? The term n2 becomes the dominant term, with the othertwo becoming less significant in computing the final result.

Classes of Algorithms

We will work with many di↵erent algorithms in this text, but most will have atime-complexity selected from among a common set of functions, which are listedin Table 4.3 and illustrated graphically in Figure 4.4.

Algorithms can be classified based on their big-O function. The various classesare commonly named based upon the dominant term. A logarithmic algorithm is

Page 7: Data Structures - Lecture 8 - Study Notes

4.1 Complexity Analysis 103

f(·) Common Name

1 constantlog n logarithmic

n linearn log n log linear

n2 quadraticn3 cubican exponential

Table 4.3: Common big-O functions listed from smallest to largest order of magnitude.

���

���

���

���

���

���

���

���

��

���

���

���

���

���

��

�� ��

��������

������

Figure 4.4: Growth rates of the common time-complexity functions.

any algorithm whose time-complexity is O(loga n). These algorithms are generallyvery e�cient since loga n will increase more slowly than n. For many problemsencountered in computer science a will typically equal 2 and thus we use the no-tation log n to imply log2 n. Logarithms of other bases will be explicitly stated.Polynomial algorithms with an e�ciency expressed as a polynomial of the form

amnm + am�1nm�1 + . . . + a2n

2 + a1n + a0

Page 8: Data Structures - Lecture 8 - Study Notes

104 CHAPTER 4 Algorithm Analysis

are characterized by a time-complexity of O(nm) since the dominant term is thehighest power of n. The most common polynomial algorithms are linear (m = 1),quadratic (m = 2), and cubic (m = 3). An algorithm whose e�ciency is char-acterized by a dominant term in the form an is called exponential . Exponentialalgorithms are among the worst algorithms in terms of time-complexity.

4.1.2 Evaluating Python CodeAs indicated earlier, when evaluating the time complexity of an algorithm or codesegment, we assume that basic operations only require constant time. But whatexactly is a basic operation? The basic operations include statements and func-tion calls whose execution time does not depend on the specific values of the datathat is used or manipulated by the given instruction. For example, the assignmentstatement

x = 5

is a basic instruction since the time required to assign a reference to the givenvariable is independent of the value or type of object specified on the righthandside of the = sign. The evaluation of arithmetic and logical expressions

y = xz = x + y * 6done = x > 0 and x < 100

are basic instructions, again since they require the same number of steps to performthe given operations regardless of the values of their operands. The subscriptoperator, when used with Python’s sequence types (strings, tuples, and lists) isalso a basic instruction.

Linear Time Examples

Now, consider the following assignment statement:

y = ex1(n)

An assignment statement only requires constant time, but that is the time requiredto perform the actual assignment and does not include the time required to executeany function calls used on the righthand side of the assignment statement.

To determine the run time of the previous statement, we must know the cost ofthe function call ex1(n). The time required by a function call is the time it takesto execute the given function. For example, consider the ex1() function, whichcomputes the sum of the integer values in the range [0 . . . n):

def ex1( n ):total = 0for i in range( n ) :total += i

return total

Page 9: Data Structures - Lecture 8 - Study Notes

4.1 Complexity Analysis 105

NO

TE

iEfficiency of String Operations. Most of the string operations havea time-complexity that is proportional to the length of the string. For

most problems that do not involve string processing, string operations sel-dom have an impact on the run time of an algorithm. Thus, in the text, weassume the string operations, including the use of the print() function, onlyrequire constant time, unless explicitly stated otherwise.

The time required to execute a loop depends on the number of iterations per-formed and the time needed to execute the loop body during each iteration. In thiscase, the loop will be executed n times and the loop body only requires constanttime since it contains a single basic instruction. (Note that the underlying mech-anism of the for loop and the range() function are both O(1).) We can computethe time required by the loop as T (n) = n ⇤ 1 for a result of O(n).

But what about the other statements in the function? The first line of thefunction and the return statement only require constant time. Remember, it’scommon to omit the steps that only require constant time and instead focus onthe critical operations, those that contribute to the overall time. In most instances,this means we can limit our evaluation to repetition and selection statements andfunction and method calls since those have the greatest impact on the overall timeof an algorithm. Since the loop is the only non-constant step, the function ex1()has a run time of O(n). That means the statement y = ex1(n) from earlier requireslinear time. Next, consider the following function, which includes two for loops:

def ex2( n ):count = 0for i in range( n ) :count += 1

for j in range( n ) :count += 1

return count

To evaluate the function, we have to determine the time required by each loop.The two loops each require O(n) time as they are just like the loop in functionex1() earlier. If we combine the times, it yields T (n) = n + n for a result of O(n).

Quadratic Time Examples

When presented with nested loops, such as in the following, the time required bythe inner loop impacts the time of the outer loop.

def ex3( n ):count = 0for i in range( n ) :for j in range( n ) :count += 1

return count

Page 10: Data Structures - Lecture 8 - Study Notes

106 CHAPTER 4 Algorithm Analysis

Both loops will be executed n, but since the inner loop is nested inside the outerloop, the total time required by the outer loop will be T (n) = n ⇤ n, resulting ina time of O(n2) for the ex3() function. Not all nested loops result in a quadratictime. Consider the following function:

def ex4( n ):count = 0for i in range( n ) :for j in range( 25 ) :count += 1

return count

which has a time-complexity of O(n). The function contains a nested loop, butthe inner loop executes independent of the size variable n. Since the inner loopexecutes a constant number of times, it is a constant time operation. The outerloop executes n times, resulting in a linear run time. The next example presents aspecial case of nested loops:

def ex5( n ):count = 0for i in range( n ) :for j in range( i+1 ) :count += 1

return count

How many times does the inner loop execute? It depends on the current it-eration of the outer loop. On the first iteration of the outer loop, the inner loopwill execute one time; on the second iteration, it executes two times; on the thirditeration, it executes three times, and so on until the last iteration when the innerloop will execute n times. The time required to execute the outer loop will be thenumber of times the increment statement count += 1 is executed. Since the innerloop varies from 1 to n iterations by increments of 1, the total number of timesthe increment statement will be executed is equal to the sum of the first n positiveintegers:

T (n) =n(n + 1)

2=

n2 + n

2

which results in a quadratic time of O(n2).

Logarithmic Time Examples

The next example contains a single loop, but notice the change to the modificationstep. Instead of incrementing (or decrementing) by one, it cuts the loop variablein half each time through the loop.

def ex6( n ):count = 0i = nwhile i >= 1 :

Page 11: Data Structures - Lecture 8 - Study Notes

4.1 Complexity Analysis 107

count += 1i = i // 2

return count

To determine the run time of this function, we have to determine the number ofloop iterations just like we did with the earlier examples. Since the loop variable iscut in half each time, this will be less than n. For example, if n equals 16, variablei will contain the following five values during subsequent iterations (16, 8, 4, 2, 1).

Given a small number, it’s easy to determine the number of loop iterations.But how do we compute the number of iterations for any given value of n? Whenthe size of the input is reduced by half in each subsequent iteration, the numberof iterations required to reach a size of one will be equal to

blog2 nc+ 1

or the largest integer less than log2 n, plus 1. In our example of n = 16, there arelog2 16 + 1, or four iterations. The logarithm to base a of a number n, which isnormally written as y = logan, is the power to which a must be raised to equaln, n = ay. Thus, function ex6() requires O(log n) time. Since many problems incomputer science that repeatedly reduce the input size do so by half, it’s not un-common to use log n to imply log2 n when specifying the run time of an algorithm.

Finally, consider the following definition of function ex7(), which calls ex6()from within a loop. Since the loop is executed n times and function ex6() requireslogarithmic time, ex7() will have a run time of O(n log n).

def ex7( n ):count = 0for i in range( n )count += ex6( n )

return count

Different Cases

Some algorithms can have run times that are di↵erent orders of magnitude fordi↵erent sets of inputs of the same size. These algorithms can be evaluated fortheir best, worst, and average cases. Algorithms that have di↵erent cases cantypically be identified by the inclusion of an event-controlled loop or a conditionalstatement. Consider the following example, which traverses a list containing integervalues to find the position of the first negative value. Note that for this problem,the input is the collection of n values contained in the list.

def findNeg( intList ):n = len(intList)for i in range( n ) :if intList[i] < 0 :return i

return None

Page 12: Data Structures - Lecture 8 - Study Notes

108 CHAPTER 4 Algorithm Analysis

At first glance, it appears the loop will execute n times, where n is the size ofthe list. But notice the return statement inside the loop, which can cause it toterminate early. If the list does not contain a negative value,

L = [ 72, 4, 90, 56, 12, 67, 43, 17, 2, 86, 33 ]p = findNeg( L )

the return statement inside the loop will not be executed and the loop will ter-minate in the normal fashion from having traversed all n times. In this case, thefunction requires O(n) time. This is known as the worst case since the functionmust examine every value in the list requiring the most number of steps. Nowconsider the case where the list contains a negative value in the first element:

L = [ -12, 50, 4, 67, 39, 22, 43, 2, 17, 28 ]p = findNeg( L )

There will only be one iteration of the loop since the test of the condition by theif statement will be true the first time through and the return statement insidethe loop will be executed. In this case, the findNeg() function only requires O(1)time. This is known as the best case since the function only has to examine thefirst value in the list requiring the least number of steps.

The average case is evaluated for an expected data set or how we expect thealgorithm to perform on average. For the findNeg() function, we would expect thesearch to iterate halfway through the list before finding the first negative value,which on average requires n/2 iterations. The average case is more di�cult toevaluate because it’s not always readily apparent what constitutes the averagecase for a particular problem.

In general, we are more interested in the worst case time-complexity of analgorithm as it provides an upper bound over all possible inputs. In addition, wecan compare the worst case run times of di↵erent implementations of an algorithmto determine which is the most e�cient for any input.

4.2 Evaluating the Python ListWe defined several abstract data types for storing and using collections of data inthe previous chapters. The next logical step is to analyze the operations of thevarious ADTs to determine their e�ciency. The result of this analysis depends onthe e�ciency of the Python list since it was the primary data structure used toimplement many of the earlier abstract data types.

The implementation details of the list were discussed in Chapter 2. In thissection, we use those details and evaluate the e�ciency of some of the more commonoperations. A summary of the worst case run times are shown in Table 4.4.

Page 13: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

CHAPTER 4

Basic Searching Algorithms

Searching for data is a fundamental computer programming task and onethat has been studied for many years. This chapter looks at just one aspect ofthe search problem—searching for a given value in a list (array).

There are two fundamental ways to search for data in a list: the sequentialsearch and the binary search. Sequential search is used when the items in thelist are in random order; binary search is used when the items are sorted inthe list.

SEQUENTIAL SEARCHING

The most obvious type of search is to begin at the beginning of a set ofrecords and move through each record until you find the record you arelooking for or you come to the end of the records. This is called a sequentialsearch.

A sequential search (also called a linear search) is very easy to implement.Start at the beginning of the array and compare each accessed array elementto the value you’re searching for. If you find a match, the search is over. If youget to the end of the array without generating a match, then the value is notin the array.

55

Page 14: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

56 BASIC SEARCHING ALGORITHMS

Here is a function that performs a sequential search:

bool SeqSearch(int[] arr, int sValue) {for (int index = 0; index < arr.Length-1; index++)if (arr[index] == sValue)return true;

return false;}

If a match is found, the function immediately returns True and exits.If the end of the array is reached without the function returning True,then the value being searched for is not in array and the function returnsFalse.

Here is a program to test our implementation of a sequential search:

using System;using System.IO;

public class Chapter4 {

static void Main() {int [] numbers = new int[100];StreamReader numFile =

File.OpenText("c:\\numbers.txt");for (int i = 0; i < numbers.Length-1; i++)numbers[i] =

Convert.ToInt32(numFile.ReadLine(), 10);int searchNumber;Console.Write("Enter a number to search for: ");searchNumber = Convert.ToInt32(Console.ReadLine(),

10);bool found;found = SeqSearch(numbers, searchNumber);if (found)Console.WriteLine(searchNumber + " is in the

array.");elseConsole.WriteLine(searchNumber + " is not in the

array.");

Page 15: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

Sequential Searching 57

}

static bool SeqSearch(int[] arr, int sValue) {for (int index = 0; index < arr.Length-1; index++)if (arr[index] == sValue)return true;

return false;}

}

The program works by first reading in a set of data from a text file. The dataconsists of the first 100 integers, stored in the file in a partially random order.The program then prompts the user to enter a number to search for and callsthe SeqSearch function to perform the search.

You can also write the sequential search function so that the function returnsthe position in the array where the searched-for value is found or a !1 if thevalue cannot be found. First, let’s look at the new function:

static int SeqSearch(int[] arr, int sValue) {for (int index = 0; index < arr.Length-1; index++)if (arr[index] == sValue)return index;

return -1;}

The following program uses this function:

using System;using System.IO;

public class Chapter4 {

static void Main() {int [] numbers = new int[100];StreamReader numFile =_

File.OpenText("c:\\numbers.txt");for (int i = 0; i < numbers.Length-1; i++)numbers[i] = Convert.ToInt32(numFile.ReadLine(),

10);

Page 16: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

58 BASIC SEARCHING ALGORITHMS

int searchNumber;Console.Write("Enter a number to search for: ");searchNumber = Convert.ToInt32(Console.ReadLine(),

10);int foundAt;foundAt = SeqSearch(numbers, searchNumber);if (foundAt >= 0)Console.WriteLine(searchNumber + " is in the_

array at position " + foundAt);elseConsole.WriteLine(searchNumber + " is not in the

array.");}

static int SeqSearch(int[] arr, int sValue) {for (int index = 0; index < arr.Length-1; index++)if (arr[index] == sValue)return index;

return -1;}

}

Searching for Minimum and Maximum Values

Computer programs are often asked to search an array (or other data structure)for minimum and maximum values. In an ordered array, searching for thesevalues is a trivial task. Searching an unordered array, however, is a little morechallenging.

Let’s start by looking at how to find the minimum value in an array. Thealgorithm is:

1. Assign the first element of the array to a variable as the minimum value.2. Begin looping through the array, comparing each successive array element

with the minimum value variable.3. If the currently accessed array element is less than the minimum value,

assign this element to the minimum value variable.4. Continue until the last array element is accessed.5. The minimum value is stored in the variable.

Page 17: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

Sequential Searching 59

Let’s look at a function, FindMin, which implements this algorithm:

static int FindMin(int[] arr) {int min = arr[0];for(int i = 0; i < arr.Length-1; i++)if (arr[index] < min)min = arr[index];

return min;}

Notice that the array search starts at position 1 and not at position 0. The0th position is assigned as the minimum value before the loop starts, so wecan start making comparisons at position 1.

The algorithm for finding the maximum value in an array works in the sameway. We assign the first array element to a variable that holds the maximumamount. Next we loop through the array, comparing each array element withthe value stored in the variable, replacing the current value if the accessedvalue is greater. Here’s the code:

static int FindMax(int[] arr) {int max = arr[0];for(int i = 0; i < arr.Length-1; i++)if (arr[index] > max)max = arr[index];

return max;}

An alternative version of these two functions could return the position ofthe maximum or minimum value in the array rather than the actual value.

Making Sequential Search Faster: Self-Organizing Data

The fastest successful sequential searches occur when the data element beingsearched for is at the beginning of the data set. You can ensure that a success-fully located data item is at the beginning of the data set by moving it thereafter it has been found.

The concept behind this strategy is that we can minimize search timesby putting frequently searched-for items at the beginning of the data set.

Page 18: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

60 BASIC SEARCHING ALGORITHMS

Eventually, all the most frequently searched-for data items will be located atthe beginning of the data set. This is an example of self-organization, in thatthe data set is organized not by the programmer before the program runs, butby the program while the program is running.

It makes sense to allow your data to organize in this way since the data beingsearched probably follows the “80–20” rule, meaning that 80% of the searchesconducted on your data set are searching for 20% of the data in the data set.Self-organization will eventually put that 20% at the beginning of the data set,where a sequential search will find them quickly.

Probability distributions such as this are called Pareto distributions, namedfor Vilfredo Pareto, who discovered these distributions studying the spread ofincome and wealth in the late nineteenth century. See Knuth (1998, pp. 399–401) for more on probability distributions in data sets.

We can modify our SeqSearch method quite easily to include self-organization. Here’s a first stab at the method:

static bool SeqSearch(int sValue) {for(int index = 0; i < arr.Length-1; i++)if (arr[index] == sValue) {swap(index, index-1);return true;

}return false;

}

If the search is successful, the item found is switched with the element atthe first of the array using a swap function, shown as follows:

static void swap(ref int item1, ref int item2) {int temp = arr[item1];arr[item1] = arr[item2];arr[item2] = temp;

}

The problem with the SeqSearch method as we’ve modified it is that fre-quently accessed items might be moved around quite a bit during the courseof many searches. We want to keep items that are moved to the first of the

Page 19: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

Sequential Searching 61

data set there and not moved farther back when a subsequent item fartherdown in the set is successfully located.

There are two ways we can achieve this goal. First, we can only swap founditems if they are located away from the beginning of the data set. We onlyhave to determine what is considered to be far enough back in the data set towarrant swapping. Following the “80–20” rule again, we can make a rule thata data item is relocated to the beginning of the data set only if its location isoutside the first 20% of the items in the data set. Here’s the code for this firstrewrite:

static int SeqSearch(int sValue) {for(int index = 0; i < arr.Length-1; i++)if (arr[index] == sValue && index > (arr.Length *_

0.2)) {swap(index, index-1);return index;

} elseif (arr[index] == sValue)return index;

return -1;}

The If–Then statement is short-circuited because if the item isn’t found inthe data set, there’s no reason to test to see where the index is in the data set.

The other way we can rewrite the SeqSearch method is to swap a found itemwith the element that precedes it in the data set. Using this method, whichis similar to how data is sorted using the Bubble sort, the most frequentlyaccessed items will eventually work their way up to the front of the data set.This technique also guarantees that if an item is already at the beginning ofthe data set, it won’t move back down.

The code for this new version of SeqSearch is shown as follows:

static int SeqSearch(int sValue) {for(int index = 0; i < arr.Length-1; i++)if (arr[index] == sValue) {swap(index, index-1);return index;

}return -1;

}

Page 20: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

62 BASIC SEARCHING ALGORITHMS

Either of these solutions will help your searches when, for whatever reason,you must keep your data set in an unordered sequence. In the next section, wewill discuss a search algorithm that is more efficient than any of the sequen-tial algorithms mentioned, but that only works on ordered data—the binarysearch.

Binary Search

When the records you are searching through are sorted into order, you canperform a more efficient search than the sequential search to find a value. Thissearch is called a binary search.

To understand how a binary search works, imagine you are trying to guessa number between 1 and 100 chosen by a friend. For every guess you make,the friend tells you if you guessed the correct number, or if your guess is toohigh, or if your guess is too low. The best strategy then is to choose 50 asthe first guess. If that guess is too high, you should then guess 25. If 50 is tolow, you should guess 75. Each time you guess, you select a new midpointby adjusting the lower range or the upper range of the numbers (dependingon if your guess is too high or too low), which becomes your next guess.As long as you follow that strategy, you will eventually guess the correctnumber. Figure 4.1 demonstrates how this works if the number to be chosenis 82.

We can implement this strategy as an algorithm, the binary search algo-rithm. To use this algorithm, we first need our data stored in order (ascending,preferably) in an array (though other data structures will work as well). Thefirst steps in the algorithm are to set the lower and upper bounds of the search.At the beginning of the search, this means the lower and upper bounds of thearray. Then, we calculate the midpoint of the array by adding the lower andupper bounds together and dividing by 2. The array element stored at thisposition is compared to the searched-for value. If they are the same, the valuehas been found and the algorithm stops. If the searched-for value is less thanthe midpoint value, a new upper bound is calculated by subtracting 1 from themidpoint. Otherwise, if the searched-for value is greater than the midpointvalue, a new lower bound is calculated by adding 1 to the midpoint. Thealgorithm iterates until the lower bound equals the upper bound, which indi-cates the array has been completely searched. If this occurs, a -1 is returned,indicating that no element in the array holds the value being searchedfor.

Page 21: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

Sequential Searching 63

Guessing Game-Secret number is 82

25 50 75 821 100

Answer : Too low

First Guess : 50

75 8251 100

Answer : Too low

Second Guess : 75

82 8876 100

Answer : Too high

Third Guess : 88

81 8276 87

Answer : Too low

Fourth Guess : 81

8482 87

Answer : Too high

Midpoint is 82.5, which is rounded to 82

Fifth Guess : 84

Answer : Correct

Sixth Guess : 82

82 83

FIGURE 4.1. A Binary Search Analogy.

Here’s the algorithm written as a C# function:

static int binSearch(int value) {int upperBound, lowerBound, mid;upperBound = arr.Length-1;lowerBound = 0;while(lowerBound <= upperBound) {mid = (upperBound + lowerBound) / 2;

Page 22: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

64 BASIC SEARCHING ALGORITHMS

if (arr[mid] == value)return mid;

elseif (value < arr[mid])upperBound = mid - 1;

elselowerBound = mid + 1;

}return -1;

}

Here’s a program that uses the binary search method to search an array:

static void Main(string[] args){Random random = new Random();CArray mynums = new CArray(9);for(int i = 0; i <= 9; i++)mynums.Insert(random.next(100));

mynums.SortArr();mynums.showArray();int position = mynums.binSearch(77, 0, 0);if (position >= -1){Console.WriteLine("found item");mynums.showArray();

} elseConsole.WriteLine("Not in the array");

Console.Read();}

A Recursive Binary Search Algorithm

Although the version of the binary search algorithm developed in the previ-ous section is correct, it’s not really a natural solution to the problem. Thebinary search algorithm is really a recursive algorithm because, by constantlysubdividing the array until we find the item we’re looking for (or run out ofroom in the array), each subdivision is expressing the problem as a smaller

Page 23: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

Sequential Searching 65

version of the original problem. Viewing the problem this ways leads us todiscover a recursive algorithm for performing a binary search.

In order for a recursive binary search algorithm to work, we have to makesome changes to the code. Let’s take a look at the code first and then we’lldiscuss the changes we’ve made:

public int RbinSearch(int value, int lower, int upper) {if (lower > upper)return -1;

else {int mid;mid = (int)(upper+lower) / 2;if (value < arr[mid])RbinSearch(value, lower, mid-1);

else if (value = arr[mid])return mid;

elseRbinSearch(value, mid+1, upper)

}}

The main problem with the recursive binary search algorithm, as comparedto the iterative algorithm, is its efficiency. When a 1,000-element array is sortedusing both algorithms, the recursive algorithm is consistently 10 times slowerthan the iterative algorithm:

Of course, recursive algorithms are often chosen for other reasons than effi-ciency, but you should keep in mind that anytime you implement a recursivealgorithm, you should also look for an iterative solution so that you cancompare the efficiency of the two algorithms.

Finally, before we leave the subject of binary search, we should mention thatthe Array class has a built-in binary search method. It takes two arguments,

Page 24: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

66 BASIC SEARCHING ALGORITHMS

an array name and an item to search for, and it returns the position of the itemin the array, or -1 if the item can’t be found.

To demonstrate how the method works, we’ve written yet another binarysearch method for our demonstration class. Here’s the code:

public int Bsearh(int value) {return Array.BinarySearch(arr, value)

}

When the built-in binary search method is compared with our custom-built method, it consistently performs 10 times faster than the custom-builtmethod, which should not be surprising. A built-in data structure or algorithmshould always be chosen over one that is custom-built, if the two can be usedin exactly the same ways.

SUMMARY

Searching a data set for a value is a ubiquitous computational operation. Thesimplest method of searching a data set is to start at the beginning and searchfor the item until either the item is found or the end of the data set is reached.This searching method works best when the data set is relatively small andunordered.

If the data set is ordered, the binary search algorithm is a better choice.Binary search works by continually subdividing the data set until the itembeing searched for is found. You can write the binary search algorithm usingboth iterative and recursive codes. The Array class in C# includes a built-inbinary search method, which should be used whenever a binary search iscalled for.

EXERCISES

1. The sequential search algorithm will always find the first occurrence ofan item in a data set. Create a new sequential search method that takes asecond integer argument indicating which occurrence of an item you wantto search for.

2. Write a sequential search method that finds the last occurrence of an item.3. Run the binary search method on a set of unordered data. What happens?

Page 25: Data Structures - Lecture 8 - Study Notes

P1: JzG0521670152c04 CUNY656/McMillan Printer: cupusbw 0 521 67015 2 February 17, 2007 21:10

Exercises 67

4. Using the CArray class with the SeqSearch method and the BinSearchmethod, create an array of 1,000 random integers. Add a new private Inte-ger data member named compCount that is initialized to 0. In each of thesearch algorithms, add a line of code right after the critical comparisonis made that increments compCount by 1. Run both methods, searchingfor the same number, say 734, with each method. Compare the values ofcompCount after running both methods. What is the value of compCountfor each method? Which method makes the fewest comparisons?