using the gamma-poisson model to predict library circulations

7

Click here to load reader

Upload: quentin-l-burrell

Post on 06-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using the Gamma-Poisson model to predict library circulations

Using the Gamma-Poisson Model to Predict Library Circulations

Quentin L. Burrell Statistical Laboratory, Department of Mathematics, The University of Manchester, Manchester Ml3 9PL, United Kingdom

Recent work has questioned the appropriateness of the gamma mixture of Poisson processes to model the cir- culation of books in a library. The purpose of this article is to argue that, for all its perceived defects, the model can be used to make predictions regarding future circu- lations of a quality adequate for general management requirements. The precise mathematical form of the model allows the consideration of any number of pos- sible future developments. The use of the model is ex- tensively illustrated with data from the University of Saskatchewan, Canada, and the University of Sussex, England.

Introduction

The idea that library book circulation (i.e., external or checked-out borrowing) might be modeled by an appropri- ate mixture of Poisson processes has an extensive history. An elementary presentation may be found in Burrell (1980), while the paper of Burrell and Cane (1982) is fol- lowed by a wide ranging discussion of the model. When the mixing distribution is taken to be of gamma form, the resulting Gamma-Poisson (GP) process is well known in the statistical literature as a flexible model for many phe- nomena and is particularly easy to handle, the resulting distribution over any given time period being of negative binomial (NB) form. Such a NB distribution has been con- sidered by many authors in the bibliometric context and we merely mention as a well-known example the work of Ravichandra Rao (1980).

More recently Burrell has extended the model to incor- porate the notion of aging of material and has investigated consequences and suggested possible uses in a series of papers (Burrell, 1985; 1986; 1987). In the first of these (p. 107) he noted a systematic deviation of the reported

Received January 26, 1988; revised March 14, 1988; accepted March 18, 1988.

0 1990 by John Wiley & Sons, Inc.

frequency-of-circulation (FOC) distributions, based on data from the library of the University of Sussex, from the theoretical NB form and was obliged to concede that “we can never expect an empirical distribution of this type to conform exactly to an assumed theoretical form. . . .” In seeking to assess the validity of this GP model with aging, Tague and Ajiferuke (1987) made use of an ll-year data- base compiled at the University of Saskatchewan and were forced to conclude that, at least so far as statistical good- ness-of-fit tests were concerned, the predicted NB form of the FOC distribution did not fit the data. At about the same time Gelman and Sichel (1987) opined that no mixture of Poisson processes was appropriate for the modeling of book circulation data and instead advocated a beta mixture of binomial distributions which certainly achieved a much improved fit, in terms of x2 values, for a variety of re- ported FOC distributions.

At first sight it may seem, therefore, that the GP model, with or without aging, is somewhat discredited and that its use is inadvisable. In this article we make some observa- tions on these critical works and seek to demonstrate that if the purpose of bibliometric modeling is to provide useful information to assist the library manager in determining fu- ture strategy, then the GP model remains a very powerful and flexible tool.

Drawbacks of the GP Model

As previously noted, the GP model leads to a negative binomial form for the borrowing distribution over a given period of time and it is this distributional form which has been questioned. Tague and Ajiferuke (1987) used the x2 goodness-of-fit criterion to show that the NB form did not adequately model the presented data from the University of Saskatchewan. Although one can get better fits to these data by using different estimation techniques for the NB parameters rather than the method of moments employed by the authors, it must be conceded that even then the x’ test is failed lamentably.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 41(3):164-170, 1990 CCC 0002-6231190/030164-07$04.00

Page 2: Using the Gamma-Poisson model to predict library circulations

In his original presentation of the GP model with aging, Burrell (1985) noted that the FOC distributions for the University of Sussex data consistently tailed off more rap- idly than would be expected using a NB distribution. He suggested that a possible cause in this case was that many of the most heavily used texts had been transferred to a “short loan” collection and hence were not included in the reported FOC data.

In their article, Gelman and Sichel (1987) observe that this rapid tailing off is a feature of most, if not all, empiri- cal FOC distributions and point out that it cannot be mod- eled by any mixture of Poisson processes. The explanation suggested in Gelman and Sichel (1987) for this tailing-off is that because any loan results in a finite loan period dur- ing which the item cannot again be borrowed, there is nec- essarily an effective maximum number of times that any item can be borrowed during the period of study. Although this maximum number of possible borrowings, denoted S, is related to loan policy and perhaps many other factors it is by no means clear how it is to be determined a priori and in Gelman and Sichel (1987) an ad hoc method requir- ing inspection of the data is adopted. This interpretation of S as the maximum possible number of loans in a period, although intuitively appealing, is not altogether straightfor- ward, however. For instance, in the University of Pitts- burgh data given by Gelman and Sichel the value of S for the 1974 data is 22, while for the 1969-1975 data it is 60. In other words, the maximum possible number of loans of an item is suggested to be 22 in a single year but only 60 in six years. In a similar way, the University of Sas- katchewan data in Tague and Ajiferuke (1987) reveals that

the maximum number of loans observed in each year for the collection under study gradually declines over the ll- year period from 19 in 1968-1969 to just 6 in 1977-1978. During this time there was apparently no change in loans policy so other factors must affect the observed maximum number of loans and hence the perceived maximum pos- sible number, S. Clearly a much greater understanding of the nature of S is required before it can be built into a pre- diction model.

Predictions Using the GP Model with Aging

One of the most remarkable features of the Tague and Ajiferuke article, at least to this author, was the quality of the “predictions” achieved with the GP model with aging. Based on the FOC data for 1968-1969, together with a measure of the observed overall decline in borrowings from that year to 1969-1970, the authors estimated the re- quired model parameters and were able to produce ex- pected FOC distributions for each of the subsequent years of the study. Their results for 1977-1978, together with the baseline data for 1968-1969, are given in Table 1. If one considers the general features rather than the fine de- tail, note that Tague and Ajiferuke’s calculations allow one to “predict,” for instance, that

(1) about 63,000 items will not circulate at all, (2) no items will circulate more than six times, (3) the collection as a whole will only generate about

7,000 loans.

TABLE 1. Frequency-of-circulation distributions, University of Saskatchewan Library (Tague and Ajiferuke. 1987).

Number of circulations

r

1968-1969 1977-1978

Observed Estimated” Observed Estimated

0 51992 51090 I 7614 8728 2 3576 3808 3 2087 1999 4 1250 1138 5 748 678 6 485 416 7 320 260 8 181 166 9 115 106

10 73 69 11 39 45 12 23 30 13 12 20 14 5 13 15 6 9 16 3 6 17-19 I 9

Total 68590 68590

63251 63110 3976 4450

997 802 260 174

67 41 34 10

5 3

68590 68590

“Both estimated columns from Tague and Ajiferuke (1987).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1990 165

Page 3: Using the Gamma-Poisson model to predict library circulations

Given the assessed inadequacy of the model, the accuracy of these statements is notable. That they were “made” eight years ahead of events is surely worthy of comment.

In both Burrell (1985) and Tague and Ajiferuke (1987) the approach to prediction is to look at the development of the GP process with aging for particular parameter values determined from the first years FOC distribution and an es- timate of the aging factor. In particular note that the initial data is only used to estimate the model parameters so that any initial data leading to the same estimates would lead to the same predicted development. In the following our pre- dictions are made using the method given in the appendix, which is based on conditional distributions given the initial data and hence incorporates any peculiarities of these data, including departures from the theoretical NB form.

In Table 2 we give the observed and predicted circula- tion frequencies for the University of Sussex long-loan li- brary using data reported in Burrell (1985). This covers a collection of 242,075 items available for loan throughout the period of four academic years 1976-1980. These pre- dictions are slightly better than those originally given by Burrell (1985), reflecting the fact that the observed data for the initial year 1976-1977 feature in more than just the parameter-estimation stage, but may still be felt to be not too impressive.

Similarly we may calculate predicted FOC distributions based on the observed circulation data reported by Tague and Ajiferuke. These relate to a collection of 68,590 items each of which had been borrowed at least once during 1967-1968 and which were then tracked over the years 1968-1969 to 1977-1978. One must admit that these pre- dicted distributions would undoubtedly fail to “fit” the data according to x2 or other standard statistical criterion. This

is almost to be expected when working with such large data sets. Any theoretical model can only be regarded as an approximation to reality, to the extent that any differ- ences between the model and the reality will inevitably be revealed by, e.g., a x2 goodness-of-fit test given a suffi- ciently large sample, and our sample sizes here are very large. On the other hand, it is not really our aim to seek out an “optimal” model but rather one that catches the es- sential features of the data and provides useful information for management purposes. This is discussed in the next section.

Using the Model in Reality

That there is nothing so difficult to predict as the future is a truism and any forecasts regarding future behavior of a system incorporating randomness must be regarded as be- ing to some extent speculative. One would therefore wish to have a certain robustness in the forecasting procedure so that not only are the predictions sufficiently good in an ideal situation but that similar sorts of forecasts result even if external circumstances change or if initial assumptions prove to be not quite correct. In this section we look fur- ther at the University of Saskatchewan data previously dis- cussed by Tague and Ajiferuke.

The quality of any model predictions will only be as good as the quality of the assumptions on which the model is based and the parameter estimates used to drive the model. In the case of the gamma-Poisson model with ag- ing, the essential assumptions are that (1) library loans fol- low a mixed (nonhomogeneous) Poisson process, (2) aging occurs at the same exponential rate for all items. So far as (1) is concerned, we have already discussed in the first two

TABLE 2. Observed and predicted frequency-of-circulation distributions, University of Sussex Library (Burrell, 1985). (Base year = 1976-1977, p = 0.4116, u = 0.4596, 19 = 0.9167.)

Number of 1977-1978 1978-1979 1979-1980 circs.

r Observed Predicted Observed Predicted Observed Predicted

0 165533 164352 170120 168070 175982 171724 1 39979 43066 39326 42397 36705 41622 2 18725 17995 16226 17042 15283 16063 3 9364 8410 8113 7653 7113 6915 4 4782 4109 4304 3587 3694 3104 5 2237 2047 2171 1713 1881 1418 6 897 1029 1128 825 866 653 7 316 520 425 399 357 303 8 126 263 148 194 114 141 9 57 134 60 95 47 66

10 19 69 29 47 16 32 11 8 36 10 24 12 16 12 6 19 5 12 1 8 13 2 10 0 7 1 4 14 4 6 2 4 0 2 15 4 3 1 2 1 1

216 16 7 7 4 2 3

166 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1990

Page 4: Using the Gamma-Poisson model to predict library circulations

sections reservations on this point. Regarding (2), Burrell (1985) referred to earlier studies that had detected a grad- ual aging of any given body of library material and had supposed this aging to occur at an exponential rate, so that the aging factor, denoted 0, is constant. Whether this is a reasonable assumption for the case of Saskatchewan we can assess by reference to Table 3, which gives the ob- served aging factors year-by-year throughout the period of study.

Clearly one would hesitate to describe this set of figures as being approximately constant, even if one ignores the rather unusual borrowing behavior in 1970- 197 1, which greatly influences the second and third figures. Indeed it would be rather unwise to base one’s assessment of 19 on a single year-by-year change in borrowing. In the case of the Saskatchewan data an important reason for the general suc- cess of the predictions given in Table 1 is the fortuitous similarity between the aging factor based on the first two years of study, 13 = 0.8230, and the average over the full 10 years, 0 = 0.8284. One cannot of course rely on such good fortune so that as recommended in Burrell(l985) it is as well to make use of past records over several years to make an estimate of 19. The smoothing effect of including more than one year is illustrated in Table 3 where, because of the availability of data we are obliged to use “future” rather than past records.

In view of the variability in the aging factor year-by- year, it is important to know how crucial is the particular value on the basis of which calculations are based. One of the chief attractions of a parametric model such as the gamma-Poisson with aging is that it is relatively straight- forward to follow through the calculations with a range of values of possible interest and this is illustrated in Table 4.

We have still to take account of the importance of the assumed values of the other parameters, in this case the es- timates of the NB parameters p and v for the initial year’s data. Tague and Ajiferuke used the method of moments to find p = 0.298 and v = 0.243. If instead we use the mean together with the proportion of zeroes, a method often ad- vocated for highly skewed data as being more efficient than the method of moments (Johnson and Katz, 1969, p. 133), we findp = 0.270 and Y = 0.211. Other methods, such as maximum likelihood or minimum x2, will lead to yet other values. Once again we can take advantage of the

parametric form of the model to carry out the calculations for a range of “reasonable” values of both p and V, the re- sults of some of which are also included in Table 4.

The first point to note is that, whichever of the assumed aging factors is used, the sets of predictions based on each of the three NB parameter pairs are very similar, so that the model is not heavily dependent on the estimation pro- cedure adopted.

The second major point to be made from Table 4 is to note how the differences between the predicted FOC distri- butions resulting from different aging factors are most pro- nounced in the tails of the distributions, with smaller values of 0 corresponding to more rapid shortening of the tail.

Concluding Remarks

The library manager, faced with the observed FOC data for 1968-1969 presented in Table 1 may well have been dismayed by the fact that 75.8% of the collection did not circulate at all, but perhaps consoled by those few items circulating a dozen or more times. The “predictions” for 1977-1978 given in Table 1 could have been dismissed as being unduly pessimistic. However, if the various analyses leading to Table 4 had been presented, surely some cause for concern would have been noted. According to these, even if 0 could be increased, (i.e., even if aging could be slowed down by some means) we could still expect 88% or more to be uncirculated in 1977-1978 and nothing to be circulated more than eight or nine times. Faced with such prospects it may well be considered the duty of manage- ment to make some positive reaction, whether by seeking means to eliminate aging or, perhaps more urgently, by asking whether such large quantities of noncirculating ma- terial can justifiably be given valuable space on the open stacks.

For all its deficiencies and theoretical drawbacks, the gamma-Poisson model can give the library manager useful guidance in decision making. It may not be the correct model or even the best, but in general terms it works!

Acknowledgement

Most of this work was carried out while the author was visiting the School of Library and Information Science at

TABLE 3. Aging factor: University of Saskatchewan data (Tague and Ajiferuke, 1982).

6' assessed year-by-year. Year pair 68-69/69-70 69-70/70-71 70-71/71-72 71-72/72-73

0 0.8230 0.6155 1.0970 0.7946 72-13173-74 73-74174-75 74-X/15-76 76-17177-78 71-X/78-79

8 0.7926 0.9166 0.9956 0.8733 0.6589

0 assessed over increasing period. Years 68-69169-70 68-69/70-71 6%69171-72 68-69172-73 68-69173-74

0 0.8230 0.7117 0.8221 0.8152 0.8106 68-69/74-X 68-69175-76 6%69176-17 68-69177-78

0 0.8274 0.8495 0.8525 0.8284

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1990 167

Page 5: Using the Gamma-Poisson model to predict library circulations

TABLE 4. F’redicted FOC distributions for the University of Saskatchewan Library us- ing various parameter values. (Base year = year 1 = 1968-1969).

Number of circulations p = 0.25885

r v = 0.20

A. Aging factor = 0 = 0.773 Year 4 = 1971-1972

p = 0.27755 v = 0.22

p = 0.29533 Y = 0.24

Observed IlUIlltMX

0 57765 1 6730 2 2333 3 970 4 430 5 196 6 90 7 41 8 19 9 9

10 4 11 2

212 1 Year 7 = 1974-1975

0 62261 1 4835 2 1093 3 290 4 80 5 23 6 6 7 2

33 1 Year 10 = 1977-1978

0 65232 1 2921 2 372 3 55 4 8 5 1

26 1

57603 5745 1 58073 6927 7114 5305 2345 2356 2467

956 942 1242 417 404 644 187 179 394 85 80 214 38 36 111 17 16 62 8 7 41 3 3 16 2 1 10 1 1 5

62205 62152 60565 4920 5ooo 5177 1080 1068 1753 280 271 688

76 72 263 21 19 90 6 5 36 2 1 13 1 1 5

65216 65202 6325 1 2949 2974 3976

364 356 997 52 50 260

8 7 67 1 1 34 0 0 5

B. Aging factor = 0 = 0.823 = observed value Year 4 = 1971-1972

0 56465 56263 1 7091 7321 2 2660 2686 3 1203 1192 4 582 569 5 291 280 6 147 140 7 75 70 8 40 35 9 19 18

10 10 9 11 5 4

212 4 3 Year I = 1974-1975

0 60275 60161 1 5814 5941 2 1652 1645 3 556 542 4 198 189 5 12 67 6 26 24 7 9 9

28 5 5

56072 58073 7540 5305 2710 2467 II81 1242 555 644 269 394 133 214 66 Ill 32 62 16 47 8 16 4 10 3 5

60070 60565 6073 5177 1639 1753 529 688 181 263 64 90 22 36

8 13 4 S

168 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1990

Page 6: Using the Gamma-Poisson model to predict library circulations

TABLE 4. (continued)

Number of circulations

r p = 0.25885

” = 0.20 p = 0.27155

v = 0.22 p = 0.29533

v = 0.24 Observed number

Year 10 = 1977-1978 0 63219 1 4286 2 840 3 192 4 46 5 I1

~6 3 C. Aging factor = 13 = 0.873 Year 4 = 1971-1972

0 55184 1 7378 2 2955 3 1432 4 747 5 403 6 221 7 122 8 67 9 37

10 20 11 I1

212 13 Year 7 = 1974-1975

0 58049 1 6641 2 2258 3 920 4 399 5 178 6 80 I 36

28 26 Year 10 = 1977-1978

0 60565 1 5679 2 1565 3 511 4 176 5 62

26 33

63170 63132 6325 1 4350 4411 3976

828 816 997 185 178 260 43 41 67 10 9 34 3 3 5

54939 54707 58073 7639 7888 5305 2998 3037 2467 1427 1422 1242 733 720 644 390 378 394 211 202 214 115 109 111 63 59 62 34 32 47 19 17 16 10 9 10 12 10 5

57896 57751 60565 6830 7011 5177 2268 2276 1753

906 892 688 387 374 263 169 162 90 75 71 36 33 31 13 26 23 5

60475 60391 6325 1 5804 5923 3976 1557 1549 977 497 484 260 168 161 67 58 54 34 30 28 5

the University of Western Ontario, Canada. He is most grateful to Professor Jean Tague and her colleagues for their kind hospitality and to the British Council and the Universities of Manchester and Western Ontario for finan- cial assistance.

Appendix

We adopt the notation of Burrell(l985, 1986, 1987) and make use of the results established there. It is assumed that frequency-of-circulation (FOC) data is collected yearly for a fixed collection of books all of which are potentially bor- rowable (i.e., there are no ‘dead’ items in the collection) and that years are numbered sequentially 1,2,3. . . Year 1 is the base year from which we are to make predictions re- garding the FOC distributions for future years. According

to the gamma-Poisson model with aging: (1) individual items are borrowed as nonhomogeneous Poisson processes having the same exponential decrease in their rates, (2) the initial borrowing rates vary according to a gamma distribu- tion. With these assumptions it is shown that the FOC distri- bution in any year is of negative binomial (NB) form with constant index but changing parameter. Let us write

Y,, = number of borrowings of an item in year n, n= 1,2,..., Y = NB index for the Y’s, p = NB parameter for Y,, 0 = ageing factor

= mn+,IImJ, n= 1,2,....

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1990 169

Page 7: Using the Gamma-Poisson model to predict library circulations

It can then be shown that the NB parameter for the (uncon- ditional) distribution of Y, is

pn = l/(1 + r$V’) where $J = E[Y,]/v = (1 - p)/p.

This is the distribution used in Burrell (1985) and Tague and Ajiferuke (1987) to predict future circulations.

However, once we have the base year’s data we know, for each item in the collection, the value of Y, and we wish to predict future circulations given this value. Hence for an item which has circulated r times in the base year, the probability distribution of its number of circulations in year n is given not by P(Yn = k) but by P(Y,, = k 1 Y, = r).

If we denote the observed circulation frequencies in the base year by f(r) = number of items borrowed r times in base year, r = 0, 1,2,. . . , then the expected number of these circulating k times in year n is given byf(r)P(Y, = kl Y, = r) and hence our predicted (i.e., expected) FOC distribution for year IZ is given by:

Predicted number of items circulating k times in year n

= 2 f(r)P(Y,, = k/ Y, = r), for k = 0, 1,2,. . , (Al)

where the summation in each case is over all r. Using the results in the appendices of Burrell (1986,

1987) we know that the conditional distribution of Y,, given that Y, = r is NB of index u -t r and parameter

p(n 1 r) = (1 + $)/(l + d, + +@-‘))

so that

P(Yn = k/Y, = r) = ( k+YiJr-‘)p(n 1 r)(“+r)( 1 - p (n 1 r))k.

Given values of y, u, and 8, together with the observed circulation frequencies f(O), f( I), f(2) . . , it requires just a simple program to evaluate the summations giving the pre- dicted numbers as defined by the above expression (Al).

References

Burrell, Q, L. (1985). A note on ageing in a library circulation model. Journal of Documentation, 41, lOO- 115.

Burrell, Q. L. (1986). A second note on ageing in a library circulation model: the correlation structure. Journal of Documentation, 42, 114- 128.

Burrell, Q. L. (1980). A simple stochastic model for library loans. Jour. nal of Documentation, 36, 115-132.

Burrell, Q. L. (1987). A third note on ageing in a library circulation model: applications to future use and relegation. Journal of Documen- tation, 43, 24-45.

Burrell, Q. L., & Cane, V. R. (1982). The analysis of library data. (With discussion). Journal of the Royal Statistical Society, Series A, 145, 431-471.

Gelman, E.. & Sichel, H. S. (1987). Library book circulation and the beta-binomial distribution. Journal of the American Society for Infor- mation Science, 38, 4-12.

Johnson, N. L., & Kotz. S. (1969). Distributions in Statistics: Discrete Distributions. Boston: Houghton Mifflin.

Rao, I. K. R. (1980). The distribution of scientific productivity and social change. Journal of the American Sociel): for Information Science, 31, 111-122.

Tague J., & Ajiferuke, I. (1987). The Markov and the mixed-Poisson models of library circulation compared. Journal ofDocumentation, 43, 2 12-23 1.

170 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1990