a fortran 77 program for computing percentiles of large data sets

15
Computers & Geosciences Vol. 9, No. 3, pp. 281 295, 1983 0098 3004/83 $3.(X) + ,00 Printed in Great Britain. Pergamon Press Lid A FORTRAN 77 PROGRAM FOR COMPUTING PERCENTILES OF LARGE DATA SETSt J. A. HOWELL Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A. (Received 20 January 1982; revised 28 October 1982) A~traet--An algorithm for computing percentiles of large sets of data is described. The algorithm first samples the data and obtains an estimate for the percentile. Then, using the estimate, a subset of the original data is extracted through which the algorithm searches for the true percentile. Key Words: Percentile, Large data set. INTRODUCTION Hardware technological advances of the past few years have led to easy, inexpensive, and efficient stor- age of large sets of data. The data are stored on a wide variety of media including disks, tapes, and cartridges. Accompanying these developments has been a data explosion that has resulted in the accumulation of large quantities of geological, astronomical, mete- orological, and other data. Analysis techniques for these large quantities of data have lagged far behind the hardware storage developments. Although com- puters have powerful computing capacities, they still are limited to a finite amount of storage and a finite number of computations per second. Software written specifically for large quantities of data is not usually available. Indeed, the cost for performing such com- putations with existing software may be prohibitively large. In locating percentiles, computations may be too expensive or too time consuming to perform on large data sets. One way to locate the Tth largest number in a set of numbers is to sort the data and count down to the T th element. Sorting, even in its most efficient form, still takes time proportional to n log (n). If it takes one second to sort a list of 1000 numbers, then sorting a list of 1,000,000 numbers would take half an hour, and 100,000,000 numbers would take more than three days. This is assuming, of course, that the com- puter has enough memory and auxiliary storage to perform the sort. One approach to this problem is through an efficient sorting device that is designed to overlap input/output time with computation time (Chen, Lum, and Tung, 1978). Another approach is to use a conventional sorting algorithm with a virtual memory operating system. This allows the user to run programs that are larger than the actual memory available but with some overhead. If one does not have a computer with a virtual memory, then conventional storage becomes the limiting factor. Still another approach is to obtain an approximation to the percentile (Weide, 1978). tWork performed under the auspices of the U.S. De- partment of Energy. Such approximations may be obtained efficiently and cheaply, but are not satisfactory if one must have an exact answer. The approach to the percentile problem described here, which does not involve sorting the entire set of numbers, has been used successfully on the aerial radiometric data collected by the U.S. Department of Energy (DOE) in its National Uranium Resource Evaluation (NURE) Project. A computer program to compute percentiles has been developed, and the algorithm will be described as it is implemented in this program. THE PERCENTILE ALGORITHM A. Sampling for an estimate One obvious method for computing percentiles in large data sets is to locate percentiles in a random sample from the data. This method produces an esti- mate of the true percentile. The method described here produces the true percentile. It locates the Tth largest number in a set of N numbers, where N may be very large. A large data set is defined as one with enough points that analysis by conventional methods is either very difficult or impossible. The number of points may range from thousands to millions or more. Although the algorithm described here works for a smaller num- ber of points, when the number is less than about 5000, it is more efficient to sort the entire set. This cutoff depends on the memory size involved. There are several parameters (Table 1) involved in the algorithm that the user may vary to meet the particular needs of the problem. They are described in more detail in the text and are illustrated in Figure 1. The algorithm that is recommended here for use on large data sets is a composite algorithm consisting of a variation of one by Floyd and Rivest (1975) and an algorithm by Blum, and others, (1973). The first step is to predict the Tth largest number or the P th per- centile, where P = 100*(N - T)/N. The prediction is made by taking a random sample of size KSA MP from the original N data points. For convenience, this could be a set of equally spaced points. The P th percentile in this sample is used as an estimate to determine the P th CAGEO Vol. 9, No. 3--A 281

Upload: ja-howell

Post on 25-Aug-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A fortran 77 program for computing percentiles of large data sets

Computers & Geosciences Vol. 9, No. 3, pp. 281 295, 1983 0098 3004/83 $3.(X) + ,00 Printed in Great Britain. Pergamon Press Lid

A FORTRAN 77 PROGRAM FOR COMPUTING PERCENTILES OF LARGE DATA SETSt

J. A. HOWELL Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.

(Received 20 January 1982; revised 28 October 1982)

A~traet--An algorithm for computing percentiles of large sets of data is described. The algorithm first samples the data and obtains an estimate for the percentile. Then, using the estimate, a subset of the original data is extracted through which the algorithm searches for the true percentile.

Key Words: Percentile, Large data set.

I N T R O D U C T I O N

Hardware technological advances of the past few years have led to easy, inexpensive, and efficient stor- age of large sets of data. The data are stored on a wide variety of media including disks, tapes, and cartridges. Accompanying these developments has been a data explosion that has resulted in the accumulation of large quantities of geological, astronomical, mete- orological, and other data. Analysis techniques for these large quantities of data have lagged far behind the hardware storage developments. Although com- puters have powerful computing capacities, they still are limited to a finite amount of storage and a finite number of computations per second. Software written specifically for large quantities of data is not usually available. Indeed, the cost for performing such com- putations with existing software may be prohibitively large.

In locating percentiles, computations may be too expensive or too time consuming to perform on large data sets. One way to locate the Tth largest number in a set of numbers is to sort the data and count down to the T th element. Sorting, even in its most efficient form, still takes time proportional to n log (n). If it takes one second to sort a list of 1000 numbers, then sorting a list of 1,000,000 numbers would take half an hour, and 100,000,000 numbers would take more than three days. This is assuming, of course, that the com- puter has enough memory and auxiliary storage to perform the sort.

One approach to this problem is through an efficient sorting device that is designed to overlap input/output time with computation time (Chen, Lum, and Tung, 1978). Another approach is to use a conventional sorting algorithm with a virtual memory operating system. This allows the user to run programs that are larger than the actual memory available but with some overhead. If one does not have a computer with a virtual memory, then conventional storage becomes the limiting factor. Still another approach is to obtain an approximation to the percentile (Weide, 1978).

tWork performed under the auspices of the U.S. De- partment of Energy.

Such approximations may be obtained efficiently and cheaply, but are not satisfactory if one must have an exact answer.

The approach to the percentile problem described here, which does not involve sorting the entire set of numbers, has been used successfully on the aerial radiometric data collected by the U.S. Department of Energy (DOE) in its National Uranium Resource Evaluation (NURE) Project. A computer program to compute percentiles has been developed, and the algorithm will be described as it is implemented in this program.

THE P E R C E N T I L E A L G O R I T H M

A. Sampling for an estimate One obvious method for computing percentiles in

large data sets is to locate percentiles in a random sample from the data. This method produces an esti- mate of the true percentile. The method described here produces the true percentile. It locates the Tth largest number in a set of N numbers, where N may be very large.

A large data set is defined as one with enough points that analysis by conventional methods is either very difficult or impossible. The number of points may range from thousands to millions or more. Although the algorithm described here works for a smaller num- ber of points, when the number is less than about 5000, it is more efficient to sort the entire set. This cutoff depends on the memory size involved.

There are several parameters (Table 1) involved in the algorithm that the user may vary to meet the particular needs of the problem. They are described in more detail in the text and are illustrated in Figure 1.

The algorithm that is recommended here for use on large data sets is a composite algorithm consisting of a variation of one by Floyd and Rivest (1975) and an algorithm by Blum, and others, (1973). The first step is to predict the Tth largest number or the P th per- centile, where P = 100*(N - T)/N. The prediction is made by taking a random sample of size KSA MP from the original N data points. For convenience, this could be a set of equally spaced points. The P th percentile in this sample is used as an estimate to determine the P th

CAGEO Vol. 9, No. 3--A 281

Page 2: A fortran 77 program for computing percentiles of large data sets

282 J .A. HOWELL

1. Sample the data as shown:

< X

1

X X X X X >

N

2. Sort the sample as shown:

smal les t P l a r g e s t

< ( x ) >

1 I I I KsA p CUTOFL CUTOFH

P th

Percentile

3. Locate the P th percentile in the sample. 4. Locate cutoffs at - (IO0*PCNTL)% and + (IO0*PCNTH)O/o in the sample. 5. Use CUTOFL and CUTOFH to extract data from the N numbers. Extract all points x such that

CUTOFL < = x < = CUTOFH.

<

extract for search 1 points A A

t ~ r

( ) >

CUTOFL CUTOFH N

6. Search the extracted data for the t th largest point using the Blum algorithm.

Figure 1. Outline of the algorithm.

percentile in the original data. K S A M P should be small enough to be able to sort and obta in the P th percentile in the sample, and it should be large enough so that the P th percentile of the sample approximates the P th percentile of the large data set. Obviously, the

Table 1. Algorithm parameters

KSA MP PCNTL

PCNTH

INDEXL INDEXH CUTOFL

CUTOFH

1

IC TR

KB UF

Parameter Description

N An integer; the number of input data points.

T An integer; we compute the Tth largest number from the large data set.

P P = 100*(N - T)/N; P is a percentile. Y An array of length KSAMP of data points

that have been sampled from the large data set. An integer; the length of the array Y. A number from 0.00 to 1.00; used to con- struct a window in the large data set. A number from 0.00 to 1.00; used to con- struct a window in the large data set. A~n integer index into the array Y. An integer index into the array Y. Y(INDEXL); defines the lower bound for the data to be extracted. Y(INDEXH); defines the upper bound for the data to be extracted. An integer; the number of data points larger than CUTOFH. An integer; the number of data points extracted. An integer; the size of the buffer area used to move the large data set in and out of memory (typically 5000 to 10,000).

size of K S A M P depends on the data. A sample of size 1000 works for many problems where N is approxi- mately 150,000. It is easy to imagine a da ta set of such a character tha t one must pick a sample of size 5000 to get a reasonable representat ion.

In the program described here, the array conta in- ing the sample is called Y. Assuming tha t Y is sorted from smallest to largest, the indices I N D E X L and I N D E X H are computed as

I N D E X L = max ( ( N - T ) / N * K S A M P

-- P C N T L * K S A M P , 1)

and

I N D E X H = min ((N - T ) / N * K S A M P

+ P C N T H * K S A M P , K S A M P ) .

These indices mark posi t ions (100*PCNTL ) °/o below and ( IO0*PCNTH) " / a b o v e the est imated percentile. / o

For example, if one selects a sample of 1000 points from a large set of 100,000 data points, and wants to find the 50,000th largest n u m b e r (the 50th percentile), then

N = 100000,

T = 50000,

K S A M P = 1000, and

P = 50.

Page 3: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets

Let P C N T L = P C N T H = 0.05, then I N D E X L and I N D E X H are, respectively, 450 and 550. The max and min are included to prevent the indices from pointing outside of the array Y, that is, to exclude nonpositive indices or indices larger than K S A M P .

Next, cutoff values are computed using the indices

and

C U T O F L = Y ( I N D E X L )

C U T O F H = Y ( I N D E X H ) .

These values are used as cutoffs on the original large data set to extract some data that is searched then for the true percentile.

In this example, the number Y(500) is the estimate for the 50th percentile for the large data set. The upper and lower cutoffs are Y(550) and Y(450) which we use to extract data from the large data set. Section B gives details of this extraction and location of the true percentile.

In order to be reasonably sure that the P th percentile for the large data set lies in the interval (CUTOFL, CUTOFH) , some care must be exercised in the selection of P N C T L and P C N T H . These values can be chosen by constructing a distribution-free confidence interval for the P th percentile in the sample (David, 1970). If PR is the probability that the percentile P of a set of size N lies in the interval (CUTOFL, C UT OFH) , then

mDEXH I N \ PR = Z 1 . |P'(1 - p)N ~.

i= INDEXL ~ I /

Using the approximation,

PR = 1/SQRT(2*PI) e-,2,,2 dt

where

a = ( I N D E X L - N * P ) / S Q R T ( N * P * ( 1 - P))

and

b = ( I N D E X H - 1 - N * P ) / S Q R T ( N * P * ( I - P)),

a table of values (Table 2) is constructed to use for P C N T L and P C N T H for the case K S A M P = 1000.

283

Floyd and Rivest (1975) describe a recursive pro- cedure for selection of P C N T L and P C N T H . How- ever, the simpler method described here seems prefer- able.

B. Locating the true percentile

From the original data, all data points that are greater than or equal to C U T O F L and also less than or equal to C U T O F H are extracted. There are consid- erably fewer data points in this extracted data set than in the original set. If the sample of size K S A M P

closely resembles the original large data set in distri- bution, then the extracted data probably will contain the true percentile. That is, the interval (CUTOFL,

C U T O F H ) contains the P th percentile of a large data set. One then can use this smaller set to begin a search for the true percentile.

The number T is adjusted by the number of points in the original data set that were larger than CUT-

OFH. That is, if I is the number of points larger than CUTOFH, then let t = T - I and search for the t th largest number in the extracted data set. If the extracted data set is still too large to fit in memory, the process of extraction can be applied again until the number of data points becomes manageable, say, 500(~10000. Figure 1 outlines the algorithm to this point.

The Blum algorithm as applied to the extracted data is summarized briefly here. More complete discussions of this algorithm are given in Aho, Hop- croft, and Ullman (1974) and Knuth (1973).

First, arrange the data into 2q +1 rows of 7 elements each where q is some integer. After sorting and determining the median of each row, find m, the median of the 2q +1 medians. If the rows are rearranged so that the medians are sorted, one can see that the data has been partitioned into three groups as shown in Figure 2. Group A consists of data known to be less than m, whereas Group B contains data known to be greater than m. The remaining data are unknown in their relation to m. By making 4q additional comparisons, one can determine which of these unknown elements are to be grouped with A or B.

The data set now is partitioned into two sets, one of which contains the t th element. One simply must repeat the process on one of these two sets. Because this set is determined easily, the problem can be replaced with a new, smaller one.

Table 2. Suggested window values

KSAMP P INDEXL INEEXH PR PCNTL PCNTH

1000 0.50 468 532 0.95 0.032 0.032 1000 0.75 720 775 0.95 0.030 0.025 1000 0.90 881 919 0.95 0.019 0.019 1000 0.95 936 964 0.95 0.014 0.014 1000 0.98 970 989 0.95 0.010 0.009 1000 0.99 984 998 0.96 0.006 0.008

Page 4: A fortran 77 program for computing percentiles of large data sets

284

Small • Large

Group A Small

GrOUD B

Figu~ 2. The Blum algorithm.

Large

J. A. HOWELL

percentile in the extracted data. That is, I > T or I + ICTR < T. At this point, the user can increase K S A M P in the hope of getting a more representative sample or increase PCNTL or PCNTH. It is easy to determine by how much the percentile has been missed, and in which direction it was missed. The program included here gives this information.

EXAMPLE

In this section, results obtained using the program are presented. The underlined portions are those parts typed by the user, and the rest is output from the program. The data used in this example were collected by Geometrics, Inc. (1978) over that portion of the Rawlins 1 : 250,000 NTMS quadrangle surveyed with rotary wing aircraft. This sample interactive terminal session shows the computation of the 95th percentile for ~14Bi.

LIMITATIONS OF THE ALGORITHM

A. The character o f the data If the data are known to be periodic, care must be

taken to choose K S A M P so that the interval between the sampled points does not coincide with the period of the data. A good example of periodic data is the aerial radiometric data collected as part of the NURE project. NURE is a project of the DOE's Grand Junc- tion Office to acquire and compile geologic and other information with which to assess the magnitude and distribution of uranium resources and to determine areas favorable for the occurrence of uranium in the United States. Under this project, parts of the coter- minous United States and Alaska have been surveyed. The projects have been flown on the basis of the National Topographic Map Series (NTMS) I: 250,000-scale quadrangles; flight lines and tie lines in east-west or north-south directions only, at a nom- inal altitude of 400 fL the flight line spacings ranging from I to 12 miles. Data are recorded every one second and consist of multichannel observations in the 7-ray portion of the spectrum from which the contribution of uranium, thorium, and potassium to the total activ- ity can be estimated. The length of a flight line within a quadrangle is such that there are often 2600 to 3000 data points per line and 20-30 flight lines per quad- rangle.

If K S A M P were chosen so that these data are sam- pied every 3000 points, the sampled data would lie in a nearly straight line on the map and, therefore, would not give a good representation of the data on the rest of the map.

One easily obtained piece of information is a histo- gram of the sample. This would give some indication of the character of the data in the sample that one can compare with knowledge one might have of the char- acter of the large data set.

B. Error conditions Several error conditions are signaled in the program

by an error message. One of these is a missed true

INPUT FORMAT FOR READING DATA

(I6, 40X, 14, 18X, F9.1)

N = 56039 INPUT T 2802

INPUT PCNTL AND PCNTH 0.014, 0.014

NO. OF DATA POINTS IN SEARCH GROUP = 1313

THE 2802 TH LARGEST NUMBER IS

0.3530E + O2 THERE ARE 2787 DATA POINTS LARGER

THAN T TH AND 27 DATA POINTS EQUAL TO T - T H

As an example of how one can display percentiles, Figure 3 shows those areas in which the 2~4Bi values are greater than or equal to the 95th percentile. Figure 4 displays the same information, but as a surface plot. All values less than the 95th percentile have been set to 0.0, with the remaining values displayed as peaks.

The following example shows the computation of the 99th percentile. Figures 5 and 6 display values above this percentile and are similar to Figures 3 and 4. INPUT FORMAT FOR READING DATA (I6, 40X, 14, 18X, F9.1)

N = 56039 INPUT T 560

INPUT PCNTL AND PCNTH 0.006, 0.006

NO. OF DATA POINTS IN SEARCH GROUP = 1004

THE 560 TH LARGEST NUMBER IS 0.5600E + 02

THERE ARE 559 DATA POINTS LARGER THAN T - T H

AND 3 DATA POINTS EQUAL TO T - T H

Page 5: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets 285

o

U3 CD

-d

V V ~ V V V V V V

V V V V V V

V V V V

V V V V

V V V V V

V V V VW VW V

V V V VV V V V~V

V V Wv~mmmm~01mmm~7v ~ v v v v

V v~ v ~ v v v

v v vvv v v v

V V WV

V V V

V V

V

V

V V

V V V V

V V V ~7

V

107.74 107 .49 107 .24 106 .99 106.74 106 .49 106.24

LONGITUDE

Figure 3. Spatial distribution of bismuth data above the 95th percentile-Rawlins quadrangle.

v

v

v

V

!

105.99

1~.~

10"/.~

\ ~ 10e.~o i t o d e

Figure 4. Surface plot of bismuth data above 95th percentile Rawlins quadrangle.

Page 6: A fortran 77 program for computing percentiles of large data sets

286 J . A . HOWELL

LJ 0

GZ d

8

V

V

V

V V

V V

V

V

V

V

V ~ ~ V V

V

107.74 I I I I l I /

107.49 107.24 106.99 106.74 106.49 106.24 105.99 LONGITUDE

Figure 5. Spatial distribution of bismuth data above the 99th percentile Rawlin's quadrangle.

101.~

\ . i t o d e

Fig. 6. Surface plot of bismuth data above the 99th percentile-Rawlin's quadrangle.

Page 7: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets

Acknowledgments--I wish to thank M. E. Johnson for his comments and F. L. Pirkle for his encouragement and interaction.

REFERENCES

Aho, A. V., Hopcroft, J. E., and Ullman, J. D., 1974, The design and analysis of computer algorithms: Addison-Wesley, Reading, MA, p. 97-99.

Blum, M., Floyd, R. W., Pratt, V., Rivest, R. L., and Tarjan, R. E., 1973, Time bounds for selection: Jour. Computer and System Sci., v. 7, p. 448~,61.

Chen, T. C., Lum, V. Y., and Tung, C., 1978, The rebound sorter: an efficient sort engine for large files: Proc. of Fourth International Conference on Very Large Data Bases, p. 312-318.

David, H. A., 1970, Order statistics: John Wiley, New York, p. 13-15.

287

Dobkin, D., and Munro, J. I., 1981, Optimal time minimal space selection algorithms: Jour. ACM, v. 28, p. 454-461.

Floyd, R. W., and Rivest, R. L, 1975, Expected time bounds for selection: Comm. ACM, v. 18, p. 165-173.

Geometrics, Inc., 1978, Aerial gamma ray and magnetic survey Rock Springs, Rawlins, and Cheyenne quad- rangles, Wyoming, and the Greeley quadrangle, Colo- rado: US Department of Energy, GJBX-17(79), Open-File Report, v. 1.

Knuth, D. E., 1973, The art of computer programming, vol. III: Addison-Wesley, Reading, MA, p. 216-217.

Nicholson, W. L., 1979, Workshop IIl, analysis of large data sets: Proc. of 1979 DOE Stat. Symposium, Oak Ridge National Laboratory report CONF-791016, p. 197-199.

Weide, B. W., 1978, Space-efficient on-line selection algo- rithms, in Gallant, R. A., and Grieg, T. M., eds., Pro- ceedings of Computer Science and Statistics l lth Annual Symposium on the Interface: N. Carolina State Univ., p. 308-311.

APPENDIX

The program contained here was written in FORTRAN 77. Most of the

new features of this version of FORTRAN have been avoided in the

interest of por tabi l i ty . The user may wish to change the manner in

which the data are entered. This code is contained in the subroutine

READAT and also in the f i r s t few lines of the program. The beginning

code contains a free-format read which may not be available in some

versions of FORTRAN.

The program's DIMENSION statements need not be changed for

different size applications since the data are buffered into temporary

storage. Some of the arrays may seem to have unusual sizes (e.g.,

10003, 8001). This is becuase of the need to have arrays whose

lengths are a multiple of 7 for the Blum algorithm.

Computer program

C C C C C C C C C C C C C C C C C C C C C

PROGRAM PCT DIMENSION X(I0003)

COMMON /BUF/ Y(24000) DIMENSION LL(8000), JJ(8OOO),IFMT(8) EQUIVALENCE (JJ(1),Y(8001)), (LL(1),Y(16001)) INTEGER T

DATA KBLUM,LT,ACC,KBUF,KSAMP / I0000,I000,I.E+7,10000,I000 /

THIS PROGRAM COMPUTES PERCENTILES FOR A SET OF DATA.

FILES USED: FILE I - THE LARGE DATA SET FILES 2,4,7 - BINARY SCRATCH FILES FILE 5 - INPUT FILE (TERMINAL) FILE 6 - OUTPUT FILE (TERMINAL)

CONSTANTS: KBLUM -

LT ACC -

KBUF -

KSAMP -

MINIMUM NUMBER OF POINTS REQUIRED TO SWITCH TO BLUM ALGORITHM; MUST BE <= 10000. CUTOFF FOR SWITCHING TO SORTING. A DATA DEPENDENT CONSTANT; LARGER THAN THE ABSOLUTE VALUE OF ANY DATA POI NT. BUFFER SIZE FOR DATA STORED ON SCRATCH FILES; MUST BE <: 10000. SAMPLE SIZE; MUST BE <= 1000 UNLESS THE ARRAYS OF SIZE 1000 IN SAMP ARE CHANGED.

Page 8: A fortran 77 program for computing percentiles of large data sets

288 J. A. HOWELL

WRITE (6,230) READ (5,240) (IFMT(1),I=I,8)

C READ INPUT DATA FROM TAPE5 TO TAPE6. CALL READAT (I,2,N,KBUF,IFMT) WRITE (6,150) N WRITE (6,160) READ * T ISAVET'=T ISAVEN=N WRITE (6,210) READ * PCNTL ,PCNTH

10 IUNIT=2 IF (N.GT.KBLUM) GO TO 20 ICTR=N GO TO 4O

C PICK A KSAMP-WORD SAMPLE AND COMPUTE UPPER AND C LOWER CUTOFFS.

20 JUNIT=2 30 CALL SAMP (JUNIT,N,T,PCNTL,PCNTH,KSAMP,IR,IQ,CUTOFL,CUTOFH,KBUF,

+ ACC) C EXTRACT DATA WHICH IS BETWEEN LOWER AND UPPER C CUTOFFS. THIS DATA IS TO BE SEARCHED FOR THE C PERCENTILE.

CALL XTRACT (JUNIT,4,CUTOFL,CUTOFH,N,IR,IQ,ICTR,ISPOT,KBUF) N=ICTR IF (ISPOT.GT.T) GO TO 120 IF (ISPOT+ICTR.LT.T) GO TO 130 T=T-ISPOT WRITE (6,170) N IUNIT=4 JUNIT=7 IF (N .LE. KBLUM) GO TO 40 CALL SAMP(4,N,T,PCNTL,PCNTH,KSAMP,IR,IQ,CUTOFL,CUTOFH,KBUF,ACC) CALL XTRACT (4,JUNIT,CUTOFL,CUTOFH,N,IR,IQ,ICTR,ISPOT,KBUF) N=ICTR IF (ISPOT .GT. T) GO TO 120 IF (ISPOT+N .LT. T) GO TO 130 T=T-ISPOT WRITE (6,170) N IUNIT=7 IF (N .LE. KBLUM) GO TO 40 GO TO 30

40 READ (IUNIT) (X(1),I=I,ICTR) C DATA IS NOW UNDER 10000 LONG.

CALL BLUM (ICTR,X,T,AT,LT,KBUF,ACC) T=ISAVET N=ISAVEN WRITE (6,180) T,AT IR=MOD(N ,KBUF) IQ= (N-IR)/KBUF REWIND 2 ICTR=O JCTR=O IF (IQ.EQ.O) GO TO 80

C CHECK ANSWER. DO 70 K=I,IQ

READ (2) (X(J),J=I,KBUF) DO 60 I=I,KBUF

IF (X(I).LE.AT) GO TO 50 ICTR=ICTR+ I GO TO 60

50 IF (X(I).LT.AT) GO TO 60 JCTR=JCTR+I

60 CONTINUE 70 CONTINUE

IF (IR.EQ.O) GO TO 110 80 READ (2) (X(J),J:I,IR)

DO 100 I=I,IR IF (X(1).LE.AT) GO TO 90 ICTR=ICTR+I GO TO I00

90 IF (X(I).LT.AT) GO TO 100 JCTR=JCTR+I

100 CONTINUE

Page 9: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets 289

110 WRITE (6,190) ICTR,JCTR REWIND 2 GO TO 140

120 WRITE (6,200) GO TO 140

130 WRITE (6,220) 140 CONTINUE

C 150 FORMAT (' N='I6) 160 FORMAT (' INPUT T') 170 FORMAT (' NO. OF DATA POINTS IN SEARCH GROUP :'15) 180 FORMAT (' THE 'I7,'TH LARGEST NUMBER IS 'EI0.4) 190 FORMAT (' THERE ARE 'I7,' DATA POINTS LARGER THAN T-TH'/

I' AND ', I7,' DATA POINTS EQUAL TO T-TH') 200 FORMAT (' T-TH ELEMENT NOT IN EXTRACTED DATA, INCREASE PCNTH') 210 FORMAT (' INPUT PCNTL AND PCNTH') 220 FORMAT (' T-TH ELEMENT NOT IN EXTRACTED DATA, INCREASE PCNTL') 230 FORMAT (' INPUT FORMAT FOR READING DATA') 240 FORMAT (BA4)

END C

SUBROUTINE READAT (INTAPE,OUTAPE,N,KBUF,IFMT) COMMON /BUF/ X(20000) INTEGER OUTAPE, QUAL, IFMT(8)

C C THIS ROUTINE READS DATA FROM FILE INTAPE TO C FILE OUTAPE. OUTAPE IS WRITTEN USING KBUF WORD C BINARY WRITES. THIS ROUTINE IS DATA DEPENDENT C AND SHOULD BE REWRITTEN TO SUIT THE C USERS DATA FORMAT, BUT MAINTAINING THE C RESULT OF KBUF WORD BINARY WRITES. C

REWIND INTAPE REWIND OUTAPE N=O K=O READ (INTAPE,30) IS

10 READ (INTAPE,IFMT,END=20) IFL,QUAL,B IF (IFL.EQ.999999) GO TO 10 IF (QUAL.NE.O) GO TO 10 K:K+I X(K)=B IF (K.LT.KBUF) GO TO 10

C WRITE OUT DATA KBUF AT A TIME IN BINARY FORMAT WRITE (OUTAPE) (X(I),I:I,KBUF) K=O N=N+KBUF GO TO 10

20 IF (K.NE.O) WRITE (OUTAPE) (X(I),I=I,K) C RETURN N=NUMBER OF DATA POINTS

N=N+K REWIND INTAPE REWIND OUTAPE RETURN

C 30 FORMAT (27X,I3)

END C

SUBROUTINE SAMP (ITAPE,N,T,PCNTL,PCNTH,KSAMP,IR,IQ,CUTOFL,CUTOFH, ÷ KBUF,ACC) COMMON /BUF/ X(lO000), Y(IO00), JJ(lO00), LL(IO00) INTEGER T

C C THIS SUBROUTINE SELECTS A KSAMP-WORD SAMPLE OF C EQUALLY SPACED DATA. THE LOCATION OF THE T-TH C LARGEST ELEMENT IS GUESSED FROM THIS SAMPLE. C THEN, CUTOFFS OF 100*PCNTH PERCENT ABOVE C AND I O0*PCNTL BELOW THIS GUESS ARE C COMPUTED AND RETURNED AS CUTOFH AND CUTOFL. C

REWIND ITAPE IR=MOD(N ,KBUF) IQ= (N-IR)/KBUF INC=N/KSAMP

Page 10: A fortran 77 program for computing percentiles of large data sets

290 J. A. HOWELL

ICTR:I L=O IF (IQ.EO.O) GO TO 30 DO 20 I:l,IO

READ (ITAPE) (X(J),J=I,KBUF) 10 L=L+I

Y(L)=X(ICTR) ICTR=ICTR+INC IF (ICTR.LE.KBUF) GO TO 10 ICTR=ICTR-KBUF

20 CONTINUE IF (IR.EO.O) GO TO 50

30 READ (ITAPE) (X(J),J=I,IR) 40 IF (L.GE.KSAMP) GO TO 50

L=L+I Y(L)=X(ICTR) ICTR=ICTR+INC IF (ICTR.LE.IR) GO TO 40

50 CALL QQSORT (L,Y,JJ,LL,LL) SAMPL=KSAMP INDEXL=MAXO(IFIX(FLOAT(N-T)/FLOAT(N)*SAMPL-PCNTL*SAHPL),I) INDEXH=MINO(IFIX(FLOAT(N-T)/FLOAT(N)*SAMPL+PCNTH*SAHPL),KSAMP) CUTOFL=Y(INDEXL) CUTOFH=Y(INDEXH) IF (INDEXH.EQ.KSAMP) CUTOFH=ACC IF (INDEXL.EQ.I) CUTOFL=-ACC REWIND ITAPE RETURN END

SUBROUTINE XTRACT (INTAPE,OUTAPE,CUTOFL,CUTOFH,N,IR,IQ,ICTR,ISPOT, + KBUF) COMMON /BUF/ X(IO000), Y(IO000) INTEGER OUTAPE

C C THIS SUBROUTINE EXTRACTS A SUBSET OF DATA FROM C FILE INTAPE AND WRITES IT TO FILE OUTAPE, IN BINARY C KBUF-WORD WRITES. DATA EXTRACTED IS BETWEEN C CUTOFL AND CUTOFH. C

REWIND INTAPE REWIND OUTAPE ICTR=O L=O ISPOT=O IF (IQ.EO.O) GO TO 50 DO 40 I=1,IO

READ (INTAPE) (X(J),J=I,KBUF) DO 30 K=I,KBUF

IF (X{K).LT.CUTOFL.OR.X(K).GT.CUTOFH) GO TO 20 IF (L.LT.KBUF) GO TO 10 L=O ICTR=ICTR+KBUF WRITE (OUTAPE) (Y(J),J=I,KBUF)

10 L=L+I Y(L)=X(K)

20 IF (X(K).GT.CUTOFH) ISPOT=ISPOT÷I 30 CONTINUE 40 CONTINUE 50 IF (IR.EQ.O) GO TO 90

READ (INTAPE) (X(J),J=I,IR) DO 80 K=I,IR

IF (X{K).LT.CUTOFL.OR.X(K).GT.CUTOFH) GO TO 70 IF (L.LT.KBUF) GO TO 60 L=O ICTR=ICTR+KBUF WRITE (OUTAPE) (Y(J),J:I,KBUF)

60 L=L+I Y(L):X(K)

70 IF (X(K).GT.CUTOFH) ISPOT:ISPOT+I 80 CONTINUE 90 WRITE (OUTAPE) (Y(J),J=I,L)

ICTR=ICTR+L REWIND INTAPE

Page 11: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets

REWIND OUTAPE RETU RN END

SUBROUTINE BLUM (N ,X ,T ,AT ,LT ,KBUF ,ACC) COMMON IBUFI Y(2860), JJ(IOO00), LL(IO000), ZL(IO000), ZG(IO000) DIMENSION X(10003) INTEGER T, OLDIGB, OLDILB

C C THIS SUBROUTINE FINDS THE T-TH LARGEST NUMBER C OF A SET USING THE BLUM ALGORITHM C (J. COMPUT. AND SYS. SCI., 7,1973, P. 448-461). C

OLDIGB=O OLDILB=O NN=N IF (NN.LE.LT) GO TO 100

C SORT GROUPS OF 7 10 IR=MOD(NN ,7)

IQ= (NN-IR)/7 DO 20 I=I,IQ

J=71I-6 CALL SORT7 (X(J))

20 CONTINUE IF (IR.EQ.O) GO TO 40 INEXT=IR+ I DO 30 I--INEXT,7

INDEX=NN+I-IR X(INDEX) =-ACC

30 CONTINUE NN=NN+7-1R CALL SORT7 (X(NN-6))

C PUT MEDIANS INTO Y-ARRAY 40 NMED--NN/7

DO 50 I=I,NMED INDEX=I*7-3 Y(1)=X(INDEX)

50 CONTINUE C SORT THE MEDIANS TO FIND THEIR MEDIAN

CALL QQSORT (NMED,Y,JJ,LL,LL) INDEX= (NMED+ I)/2 AMED=Y(INDEX)

C DISTRIBUTE INTO 2 SETS, ONE >= AMED AND ONE < AMED CALL TWOPOT (ILBIN,IGBIN,NMED,AMED,X) IF (T.NE.IGBIN) GO TO 60 AT=AMED RETURN

60 IF (T.GE.IGBIN) GO TO 80 IF (OLDIGB.EQ.IGBIN) GO TO 100 NN=IGBIN DO 70 I=I,IGBIN

X(I)=ZG(I) 70 CONTINUE

OLDIGB=IGBIN OLDILB=ILBIN IF (NN.GT.LT) GO TO 10 GO TO 100

80 IF (OLDILB. EQ.ILBIN) GO TO 100 NN=ILBIN T=T-IGBIN DO 90 I=I,ILBIN

X(I)=ZL(I) 90 CONTINUE

OLD ILB-- ILBIN OLDIGB=IGBIN IF (NN.GT.LT) GO TO 10

100 IF (NN.GT.KBUF) GO TO 110 CALL QQSORT (NN,X,JJ,LL,LL) ZNDEX=NN-T+ I AT=X(INDEX) RETU RN

110 WRITE (6,120) 120 FORMAT (' TOO MANY LIKE VALUES')

STOP END

291

Page 12: A fortran 77 program for computing percentiles of large data sets

292 J .A . HOWELL

SUBROUTINE SORT7 (X) DIMENSION X(7), Y(7)

THIS SUBROUTINE SORTS A SEQUENCE OF 7 NUMBERS USING NO MORE THAN 13 COMPARISONS

STAGE I IF TEMP:X(1) XCI):X(2) X(2)=TEMP

10 IF (X(3).LE.X(4)) TEMP=X(3) X(3):X(4) X(4)=TEMP

20 IF (X(5).LE.X(6)) TEMP=X(5) XCS):X(6) X(6)=TEMP

STAGE 2

(X(1).LE.X(2)) GO TO 10

GO TO 20

GO TO 30

30 IF (X(1).LE.X(3)) GO TO 40 Y(1):X(3) Y(3):X(1) GO TO 50

40 Y(1):X(1) Y(3)=X(3)

50 IF (X(2).LE.X(4)) GO TO 60 Y(2)=X(4) Y(4)=X(2) GO TO 7O

60 Y(2):X(2) Y(4):X(4)

70 IF (Y(2).LE.Y(3)) GO TO 80 TEMP=Y(2) Y(2)=Y(3) Y(3):TEMP

80 IF (X(5).GT.X(7)) GO TO 100 Y(5):X(5) IF (X(6).GT.X(7)) GO TO 90 Y(6)=X(6) Y(7):X(7) GO TO 110

90 Y(6):X(7) Y(7)=X(6) GO TO 110

100 Y(5):X(7) Y(6)=X(5) Y{7):X(6)

STAGE 3 110 I=I

J:5 IPTR= I

120 IF (Y(I).LE.Y(J)) GO TO 140 X(IPTR):Y(j) IPTR=IPTR+I J=J+1 IF (J.LE.7) GO TO 120 ITEMP: I-IPTR DO 130 K=IPTR,7

INDEX=ITEMP+K X(K) =Y(INDEX)

130 CONTINUE RETURN

140 X(IPTR) :Y(I) IPTR=IPTR+I I=I+l IF (I.LE.4) GO TO 120 ITEMP= J-IPTR DO 150 K=IPTR,7

INDEX=ITEMP+K X(K) =Y(INDEX)

150 CONTINUE RETURN END

Page 13: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets 293

SUBROUTINE TWOPOT (ILBIN,IGBIN,NMED,AMED,X) COMMON /BUF/ Y(2860), JJ(10000), LL(IO000), ZL(IO000), DIMENSION X(I0003)

C C THIS ROUTINE ASSUMES INPUT DATA IS IN ARRAY X. C EACH ROW OF 7 HAS BEEN SORTED. THIS PARTIALLY C SORTED DATA IS DIVIDED INTO TWO SETS. ZL C CONTAINS DATA LESS THAN AMED AND ZG CONTAINS DATA C GREATER THAN OR EQUAL TO AMED. ILBIN IS C THE LENGTH OF ZL AND IGBIN IS THE LENGTH OF ZG. C

ILBIN=O IGBIN=O JCT=-7 DO 130 I=I,NMED

JCT=JCT+7 IF (X(JCT+I).LT.AMED) GO TO 20 DO 10 J=I,7

IGBIN=IGBIN+I INDEX=JCT+J ZG(IGBIN) =X(INDEX)

10 CONTINUE GO TO 130

20 IF (X(JCT+?).GE.AMED) GO TO 40 DO 30 J= I ,7

ILBIN= ILBIN+ I INDEX=JCT+J ZL(ILBIN) =X(INDEX)

30 CONTINUE GO TO 130

40 IF (X(JCT+4).GE.AMED) GO TO 50 KCT=JCT LCT=JCT GO TO 70

50 KCT=JCT-4 LCT=JCT+4 DO 60 J=1,4

IGBIN=IGBIN+ I INDEX=JCT+J+3 ZG(IGBIN) =X(INDEX)

60 CONTINUE GO TO 9O

70 DO 80 J:l,4 ILBIN=ILBIN+ I INDEX=LCT+J ZL(ILBIN) =X(INDEX)

80 CONTINUE 90 IF (X(KCT+6).GE.AMED) GO TO 110

ZL(ILBIN+ I) =X(KCT+5) ZL(ILBIN+2) =X(KCT+6) ILBIN=ILBIN+2 IF (X(KCT+7).GE.AMED) GO TO 100 ILBIN=ILBIN+I ZL(ILBIN) =X(KCT+7) GO TO 130

100 IGBIN=IGBIN+ I ZG(IGBIN) =X(KCT+7) GO TO 130

110 ZG(IGBIN+ I) =X(KCT+6) ZG(IGBIN+2) =X(KCT+7) IGBIN=IGBIN+2 IF (X(KCT+5).GE.&MED) GO TO 120 ILBIN=ILBIN+ I ZL(ILBIN) =X(KCT+5) GO TO 130

120 IGBIN=IGBIN+ I ZG(IGBIN) =X(KCT+5)

130 CONTINUE RETURN END

SUBROUTINE QQSORT (N,X,J,L,Y) DIMENSION X(1), J(1), L(1), Y(1)

ZG(IO000)

Page 14: A fortran 77 program for computing percentiles of large data sets

294 J .A . HOWELL

C C THIS ROUTINE SORTS N REAL NUMBERS INTO ASCENDING ORDER. C J IS TEMPORARY STORAGE ARRAY OF DIMENSION N. C L IS TEMPORARY STORAGE ARRAY OF DIMENSION N. C Y IS THE SAME ARRAY AS L BUT IS A REAL VARIABLE. C THIS ROUTINE IS PART OF THE LOS ALAMOS NATIONAL C LABORATORY PROGRAM LIBRARY. C

I1:1 I2:N

C MAIN LOOP. 10 XM=X(II)

XN= XM L(II):O IN=II+I

C DETERMINE MAXIMUM AND MINIMUM.

DO 20 I:IN,I2 L(I)=O XM=AMAXI(XM,X(I)) XN:AMINI(XN,X(I))

20 CONTINUE IF (XN.EQ.XM) GO TO 160

C DETERMINE PROJECTOR TABLE J AND SCORE L. XI=(I2-II)/(XM-XN) XM=II+.O0001 DO 30 I=I1,I2

K=(X(I)-XN)IXI+XM J(I)=K L(K)=L(K)+I

30 CONTINUE IS=L(II) L(II)=II-I DO 40 I=IN,I2

IT=L(I) L(I)=L(I-I)+IS IS=IT

40 CONTINUE C DETERMINE NEW J TABLE.

DO 50 IT=II,I2 I=J(IT) L(I)=L(I)+I J(IT)=L(I)

50 CONTINUE C SORT X TABLE.

DO 60 IT= I I , I 2 I=J(IT) J(IT)=O Y(I) :X(IT)

60 CONTINUE DO 70 I=I1,I2

X(I)=Y(I) 70 CONTINUE

C RESET L. DO 80 I=I1,I2

K=(X(I)-XN)*XI+XM J(K)=J(K)+I

80 CONTINUE I=I1 DO 90 IT=II,I2

L(I)=J(IT) I=I+J(IT)

9O CONTINUE C CONTROL LOOP FOR FINDING POCKETS OF UNORDERED SETS.

100 IS=It 110 IF (L(IS)-2) 150,140,120 120 IF (L(IS).EQ.3) GO TO 130

C RETURN TO MAIN LOOP FOR SETS LARGER THAN THREE. It=IS I2=II+L(IS)-I GO TO 10

C SETS OF THREE.

Page 15: A fortran 77 program for computing percentiles of large data sets

A FORTRAN 77 program for computing percentiles of large data sets

130 XN=AMINI(X(IS),X(IS+I)) XI=AMAXI(X(IS),X(IS÷I)) XM:AMINI(XI,X(IS+2)) X(IS÷2)=AMAXI(XI,X(IS÷2)) X(IS)=AMINI(XN,XM) X(IS+I)=AMAXI(XN,XM) IS=IS+2 GO TO 150

C SETS OF TWO. 140 XI=X(IS)

X(IS)=AMINI(XI,X(IS+I)) X(IS+I)=AMAXI(XI,X(IS+I)) IS=IS+I

150 IS=IS+I IF (IS.LT.N) GO TO 110 RETURN

C SKIP SETS OF EQUAL NUMBERS. 160 II=I2+I

IF (II.LT.N) GO TO 100 RETURN END

295