an empirical examination of lotka's law

8
An Empirical Examination of Lotka’s Law Miranda Lee Pao Matthew A. Baxter School of Information & Library Science, Case Western Reserve University, Cleveland, OH 44106 There were 48 sets of author productivity data tested against Lotka’s Law of x”. y = c. Overwhelming conformity to the law was found. However, only seven data sets fitted the inverse square law. For future tests, representative coverage and good sampling techniques should be adhered to in data compilation. A method is suggested to compute the values of n and c from the data. Empirical confirmations of Lotka’s law using author productivity data from various subjects have been criti- cized [l-.5]. As in the case of Bradford’s law, there is no generally accepted approach to test this law [6]. A variety of methods have been used to collect data, to calculate the two constants, and to test the conformity of the observed data to the theoretical distribution. The subject matter studied ranged from such broad areas as physics to the lit- erature on a single drug, or from the physical sciences to the humanities. The scope of the bibliographic searches for data sets varied widely from works taken from selected quality bibliographies to items listed in a single journal. Some distributions spanned the entire history of the sub- ject, whereas others took works from a single year. Still others could not agree on the unit of publication credited to each author. Although Lotka assigned each publication to only the senior author, ignoring all coauthors, others have argued that “full productivity” or authorship should be used. This method credits the individual author with every publication in which his or her name appears. In subjects in which coauthoring is intense, discounting co- authors would eliminate a substantial portion of authors. Counting authorship appeared to be reasonable, yet some investigators made no apparent distinction between senior author count and authorship count in data collection. Despite the 1974 publication in which Vlachy noted that the value of the slope n varied according to the characteris- tics of the author population, many simply took the inverse square law as the theoretical distribution without consid- Received March 6, 1985; revised May 21, 1985; accepted June 13, 1985 01986 by John Wiley & Sons, Inc. ering the difference between a theoretical distribution cal- culated from the given data and one calculated with n = 2 [7]. Other studies offered the values of the two constants without stating how these were derived. Finally, the rela- tive merits of the chi-square and the Kolmogorov-Smimov goodness-of-fit tests have not been resolved to everyone’s satisfaction. In this context, it may be useful to retest a sample of data sets using a common replicable method. The aim is to focus on the parameters of the data sets and their theoreti- cal formulations and to observe the effects of, or correla- tion between, the characteristics found in the data and the resulting degree of conformity or nonconformity to Lotka’s distribution. Potter’s review noted that subjects, languages, and formats of publication in selecting author data may be important variables in determining the con- formity of Lotka’s law [S]. Vlachy’s extensive works have delved into such factors as time span and the scope of the author communities without a definitive conclusion [2,7]. In a recent article, Pao suggested a methodology to test Lotka’s law [8]; its procedures closely followed those used by Lotka himself. A systematic comparison of a reasonable sample of author data from various fields may contribute toward a better understanding of the sensitivity of some of the characteristics to Lotka’s theoretical construct. Data Sources It was decided to include as many published author pro- ductivity data sets as possible. Articles cited in Potter’s re- view article and Vlachy’s extensive bibliography were the starting points for the literature search [5,9]. Papers since 1978 that cited Lotka’s original 1926 article were added. This total pool of articles was drastically reduced by the need to limit the study to those articles with complete tabu- lar data on the number of authors, y, each contributing x number of publications. The massive data and extensive work done by Vlachy could not be used since his results had been presented in graphic and summary form [2,7,10]. Except for the highly productive authors, grouped data such as those given in Aiyepeku’s Table 1, which contained 66 authors, each credited with three to four papers, also JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 37(1):26-33, 1966 CCC 0002-8231/861010026-08$04.00

Upload: miranda-lee-pao

Post on 06-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

An Empirical Examination of Lotka’s Law

Miranda Lee Pao Matthew A. Baxter School of Information & Library Science, Case Western Reserve University, Cleveland, OH 44106

There were 48 sets of author productivity data tested against Lotka’s Law of x”. y = c. Overwhelming conformity to the law was found. However, only seven data sets fitted the inverse square law. For future tests, representative coverage and good sampling techniques should be adhered to in data compilation. A method is suggested to compute the values of n and c from the data.

Empirical confirmations of Lotka’s law using author productivity data from various subjects have been criti- cized [l-.5]. As in the case of Bradford’s law, there is no generally accepted approach to test this law [6]. A variety of methods have been used to collect data, to calculate the two constants, and to test the conformity of the observed data to the theoretical distribution. The subject matter

studied ranged from such broad areas as physics to the lit- erature on a single drug, or from the physical sciences to

the humanities. The scope of the bibliographic searches

for data sets varied widely from works taken from selected quality bibliographies to items listed in a single journal. Some distributions spanned the entire history of the sub- ject, whereas others took works from a single year. Still others could not agree on the unit of publication credited to each author. Although Lotka assigned each publication

to only the senior author, ignoring all coauthors, others have argued that “full productivity” or authorship should be used. This method credits the individual author with every publication in which his or her name appears. In subjects in which coauthoring is intense, discounting co- authors would eliminate a substantial portion of authors.

Counting authorship appeared to be reasonable, yet some investigators made no apparent distinction between senior author count and authorship count in data collection. Despite the 1974 publication in which Vlachy noted that the value of the slope n varied according to the characteris- tics of the author population, many simply took the inverse

square law as the theoretical distribution without consid-

Received March 6, 1985; revised May 21, 1985; accepted June 13, 1985

01986 by John Wiley & Sons, Inc.

ering the difference between a theoretical distribution cal- culated from the given data and one calculated with n = 2

[7]. Other studies offered the values of the two constants without stating how these were derived. Finally, the rela- tive merits of the chi-square and the Kolmogorov-Smimov goodness-of-fit tests have not been resolved to everyone’s

satisfaction.

In this context, it may be useful to retest a sample of data sets using a common replicable method. The aim is to focus on the parameters of the data sets and their theoreti- cal formulations and to observe the effects of, or correla- tion between, the characteristics found in the data and the resulting degree of conformity or nonconformity to Lotka’s distribution. Potter’s review noted that subjects,

languages, and formats of publication in selecting author data may be important variables in determining the con- formity of Lotka’s law [S]. Vlachy’s extensive works have delved into such factors as time span and the scope of the

author communities without a definitive conclusion [2,7]. In a recent article, Pao suggested a methodology to test Lotka’s law [8]; its procedures closely followed those used by Lotka himself. A systematic comparison of a reasonable sample of author data from various fields may contribute toward a better understanding of the sensitivity of some of the characteristics to Lotka’s theoretical construct.

Data Sources

It was decided to include as many published author pro-

ductivity data sets as possible. Articles cited in Potter’s re- view article and Vlachy’s extensive bibliography were the

starting points for the literature search [5,9]. Papers since 1978 that cited Lotka’s original 1926 article were added. This total pool of articles was drastically reduced by the need to limit the study to those articles with complete tabu- lar data on the number of authors, y, each contributing x number of publications. The massive data and extensive

work done by Vlachy could not be used since his results

had been presented in graphic and summary form [2,7,10]. Except for the highly productive authors, grouped data such as those given in Aiyepeku’s Table 1, which contained 66 authors, each credited with three to four papers, also

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 37(1):26-33, 1966 CCC 0002-8231/861010026-08$04.00

.

TABLE 1. Parameters obtained for Lotka’s Test.

Percentage K-S

Number Number of (0.01 of of Authors Standard level of

Authors Papers Included r2 Deviation n C significance) D,,,

Map librarianship (Schorr)

Map librarianship (coauthors)

Legal medicine (Schorr)

Legal medicine (coauthors)

History of technology (Murphy)

History of technology (coauthors)

Library science-L0 (Schorr)

Library science-LQ (coauthors)

Computational musicology (Pao)

Computational musicology (coauthors)

American Revolution (Pao)

American Revolution (coauthors)

American Revolution-Shy (Pao)

American Revolution-Shy (coauthors)

American Rev.-Shy (62-71) (Pao)

American Rev.-Shy (62-71) (coauthors

Ethnomusicology (McCreery)

Ethnomusicology (coauthors)

Ethnomusicology-journals

Chemistry (Lotka)

Chemistry-A (Lotka)

Chemistry-B (Lotka)

Physics (Lotka)

Drosophila (Hersh)

Human engineering (Mantell)

Cephalexin (Worthen)

Flurazepam (Worthen)

Schistosomiasis (Braga)

Econometrics (Leaven)

Library science-CRL (Schorr) Information science-1966 (Voos) Information science-1967

Information science-1968

Information science- 1969

Information science-1970 Computer science-AFlPS (Rao)

Computer science-IEEE

Computer science-CACM

Computer science-JACM

Computer science-CACM (Mantell)

29 journals (Mantell)

Applied mycology (Williams)

Applied entomology-1913 (Williams)

Applied entomology-1936 (Williams)

Finno-Ugric (Rudmann)

Univ. Illinois (Potter)

Univ. Wisconsin (Potter)

Library of Congress (McCallum)

306 463 98.4 97.2 0.1547 2.8370 0.8079 0.0932 0.0399 0.1601

326 463 98.5 97.0 0.1653 2.9292 0.8213 0.0903 0.0698 0.1436

997 1487 99.0 100.0 0.0126 2.4737 0.7397 0.0516 0.0228 0.1504

1010 1487 99.3 98.7 0.1031 2.7478 0.7932 0.0513 0.0338 0.1515 164 231 100.0 90.5 0.2695 2.6161 0.7691 0.1273 0.0354 0.1543 170 231 100.0 90.8 0.2678 2.6469 0.7750 0.1250 0.0333 0.1568 198 229 100.0 96.9 0.1752 3.0964 0.8446 0.1158 0.0645 0.3012 210 229 100.0 96.8 0.1811 3.1352 0.8494 0.1125 0.0649 0.3064 458 970 99.1 91.4 0.2313 2.1680 0.6616 0.0762 0.0238 0.0537a

544 970 99.1 88.9 0.2726 2.2725 0.6909 0.0699 0.0310 0.0526a 764 1075 100.0 92.0 0.2708 2.6321 0.7722 0.0590 0.0262 0.1905

790 1075 99.2 95.7 0.1951 2.8364 0.8078 0.0580 0.0158 0.1896 1316 2043 97.8 99.2 0.0712 2.4822 0.7416 0.0449 0.0197 0.1905 1356 2043 97.9 98.6 0.0964 2.5459 0.7551 0.0443 0.0203 0.1896 416 587 100.0 92.0 0.2695 2.7772 0.7981 0.0799 0.0217 0.1219 431 587 100.0 92.1 0.2725 2.8107 0.8036 0.0785 0.0426 0.1258 2269 4434 98.4 99.4 0.0620 2.2851 0.6942 0.0342 0.0237 0.1100 2422 4655 98.5 99.6 0.0511 2.2914 0.6959 0.0331 0.0291 0.1172 1266 2185 99.7 97.9 0.1234 2.4082 0.7248 0.0458 0.0133 0.1251 6891 22939 98.9 99.2 0.0331 1.8907 0.5679 0.0196 0.0315h 0.0288 1543 5355 97.5 97.6 0.0562 1.8981 0.5707 0.0415 0.0307 0.0854 5348 17584 98.7 97.7 0.0537 1.9125 0.5763 0.0223 0.0288b 0.0825 1325 3398 98.8 98.1 0.1011 2.0210 0.6151 0.0448 0.0236 0.254a 826 3662 96.7 92.9 0.1920 1.7828 0.5239 0.0567 0.0437 0.1018 2255 3182 100.0 96.3 0.2255 2.9828 0.8295 0.0343 0.0677b 0.1539 1198 630 99.6 99.0 0.0896 2.7559 0.7945 0.0471 0.0249 0.1617 432 262 99.8 93.0 0.2351 2.5386 0.7536 0.0784 0.0557 0.2014 1908 5285 97.8 95.8 0.1666 2.1998 0.6708 0.0373 0.0403b 0.0363a 721 1759 99.0 93.6 0.1645 1.9030 0.5727 0.0607 0.0613b 0.0395a 408 444 100.0 98.8 0.1278 3.6933 0.9034 0.0807 0.0353 0.3308 1282 1502 100.0 98.3 0.1485 3.4507 0.8830 0.0455 0.0061 0.2743 1339 1548 100.0 100.0 0.0281 3.4089 0.8791 0.0445 0.0105 0.2748 1666 2002 99.6 98.4 0.1540 3.7747 0.9094 0.0399 0.0366 0.2675 3206 3796 99.8 98.9 0.1122 3.2867 0.8667 0.0288 0.0167 0.2754 3512 4299 99.7 99.3 0.0921 3.3535 0.8736 0.0275 0.0069 0.2591 1021 1295 100.0 99.1 0.0897 2.9216 0.8208 0.0510 0.0209 0.2266 851 1112 100.0 98.4 0.1144 2.7664 0.7963 0.0559 0.0239 0.2123 599 471 100.0 99.0 0.1116 3.4880 0.8864 0.0666 0.0367 0.2418 301 266 100.0 97.3 0.1661 3.0442 0.8378 0.0940 0.0107 0.2227 97 164 100.0 90:1 0.2691 1.9828 0.6019 0.1655 0.1467 0.1412'

2461 2842 loo.0 99.5 0.0784 3.6669 0.9014 0.0329 0.0200 0.2734 1527 2229 100.0 94.7 0.2561 3.2833 0.8663 0.0417 0.0581b 0.1373 411 656 100.0 97.1 0.1494 2.5481 0.7556 0.0804 0.0621 0.1142 1534 2379 99.7 96.3 0.1748 2.7337 0.7906 0.0416 0.0983b 0.1139 1111 2291 99.6 91.7 0.2554 2.4462 0.7336 0.0489 0.1053b 0.1445 2345 13148 95.9 99.6 0.0456 2.1156 0.6458 0.0337 0.0136 0.0271a 2762 5276 98.3 99.6 0.0339 2.2375 0.6814 0.0310 0.0221 0.0854

695074 1336182 98.4 99.6 0.0500 2.3450 0.7096 0.0020 0.0423' 0.0825

“Values fall within the K-S statistics.

hValues exceed the K-S statistics.

could not be used [ 111. This resulted in a total of 48 data nated, crediting only senior authors as had been done by sets that included such diverse subject matters as drugs, Lotka. In several papers, this detail in data collection was computer science, and the humanities. unclear. Thus in only nine subjects were there both senior

Several adjustments were made to the data. From the author and coauthor distributions for comparison. articles in which the data was taken, a determination had In a study of authors in information science, Voos listed to be made as to whether each publication was credited an author productivity distribution for each of the 5 years only to its senior author; often, coauthors had also been included [ 12-141. These annual distributions were used; credited. Whenever possible, coauthor data was elimi- however, his total distribution was excluded, since it was a

JOURNAL OFTHEAMERICANSOCIETY FORINFORMATION SCIENCE-January1986 27

summation of those from each year without his having ac-

cumulated the same authors for successive years. Conse- quently, if an author had published one paper in 1967 and one in 1968, he would be counted as two distinct authors, each having contributed one paper to the subject.

Lotka’s original two sets of data on physics and chemis-

try were among the most comprehensive [l]. The physics data originated from a quality selected bibliography cov- ering the entire subject up to 1900. The chemistry data represented an unorthodox sampling in that only authors with surnames beginning with the first two letters of the al- phabet as listed in the 1907-1916 volume of Chemical Ab- stracts were included. Separate observed distributions for each of the two letter samplings were also given. There- fore, data for two subsets of the entire chemistry data were presented. Since only senior authors of each paper were

tabulated, the distribution for coauthors was unavailable. Schorr published four sets of author data including all

coauthors. His map librarianship data came from a bibli- ography covering 50 years [ 151. He also compiled a distri- bution from a comprehensive bibliography of the world

history of legal medicine, which was published in 1974

[16]. Coile has regrouped the map librarianship data, eliminating coauthors [4]. His distribution was included

in the calculation, although it was unclear from Schorr’s

original data how the elimination of coauthors could be ac- complished without a recount of the entries. From the in- formation in a footnote to the distribution for the history of legal medicine, data were extracted for senior authors only. The distributions for senior authors and for co- authors were tested for both subjects. Schorr also took papers published in the journals, College and Research Libraries and Library Quarterly published 1963 through 1967, to represent authors in library science [17]. For

each journal, he counted coauthors in two separate distri- butions. It was only possible to retaste the Libra y Quar- terly distribution for senior authors.

Murphy’s data on the history of technology was limited to a 10 year coverage of a single journal [ 181. All coauthors

were included, and Coile had recast the distribution, cred- iting only the senior authors [4]. Both senior and coauthor distributions were used.

Worthen studied the literatures of two drug products: cephalexin and flurazepam [ 191. Collaborative activities among authors were intense. Thus he reasonsed that

author productivity in these areas should include all co- authors. Although he used percentage figures to compare his data with Lotka’s formulation, it was possible to extract the cumulative 10 year data. These two unusual distribu- tions represented a specific type of technical literature.

Braga extended the study of schistosomiasis literature by Goffman and Warren [20] and included the Brazilian

literature from 1908 to 1974. It totaled 2600 papers by 1908 authors and included coauthors.

From an exhaustive bibliography on the genetics of Drosophila, which is a genus of fruit flies used extensively in experimental genetics, Hersh published an author-pro- ductivity distribution [21,22]. No mention was made as to

whether coauthors were included. However, the original bibliography did not provide a separate entry for each coauthor, even though a cross reference was made to each coauthored work. It was assumed that only senior authors were included. Leavens offered a coauthor productivity distribution for econometricians [23]. Data were taken from the first 20 annual volumes of Econometrics, 1933-1952, and from papers presented at the meeting of their society. David Rudmann’s unpublished data in the field of Finno-Ugric was lifted from Rao’s article [24]. No information was available as to how the data were col-

lected or what years were covered. In Mantell’s work, he listed three distributions [25].

The first was taken from the Human Engineering Bibliog- raphy from the Office of Naval Research, which covered one year’s publications. The second was from the Com- munications of the Association for Computing Machinery Author Index, 1958-1961. The third was from a sample of 20 learned and technical journals compiled primarily from annual indexes. This composite pool of authors wrote in psychology, physics, chemistry, mathematics, ceramics, and other engineering and scientific subjects. Very little detail on the data was given and, as far as can be dis- cerned, only senior authors were included.

Similarly, Radhakrishnan and Kernizan listed 4 distri-

butions representing 4 separate computer science journals in a 5 year period [26]. Each coauthor was credited with the full publication. The two investigators also included

two other distributions in their paper that consisted of ran- dom samples of authors from the first four groups. Since this procedure was radically different from all other sampling methods, these distributions were excluded from consideration.

William published a paper in 1944 that contained two

sets of data on entomology, one from the 1913 volume of the journal, Review of Applied Entomology and another from its 1936 volume [27]. He also included the distribu- tion of papers published in Review of Applied Mycology in the year 1935. The latter distribution was reported in an earlier paper by Dufrenoy. These three sets of data repre- sented author productivity found in a single journal lim- ited to a single year. These were the three most restrictive data sets in the sample.

The last group of distributions was taken from subjects in the humanities. Data on computational musicology (25 years), ethnomusicology (10 years), history of the Ameri-

can Revolution (10 years), and a quality selected bibliogra- phy on the history of the American Revolution for an ex- tended period of 190 years were available for both senior

authors and coauthors [28-311. In addition, a subset of the quality American Revolution data corresponding to the same decade of the general bibliography of the same sub-

ject was also available for analysis. Altogether there were 11 sets of data on 3 humanistic subjects.

To these were also added three related distributions found in Potter’s references [32,33]: A random sample of 2345 personal names in the University of Illinois card catalog in which was tabulated number of monographs at-

26 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1986

tributed to each, a random sample from the University of Wisconsin’s card catalog, and a list of 1969-1979 MARC records distributed by the Library of Congress. The num- ber of occurrences of distinct personal name headings in the MARC records were presented in tabulated form. Even though those names occurring more than 10 times were grouped, they consisted of only 1.6% of the total 695,074 name headings. These three distributions were included to represent three large samples of author pro- ductivity of monographs, even though it was unclear as to the inclusion of added entries. The Library of Congress distribution included all entries of all forms of publica- tions, books, serials, maps, and firms. Although the ma- jority of the data with name headings were from books, those from other forms may have skewed the distribution. However, the extremely large sample represented a gross picture of author productivity of books.

At least 16 distributions listed authorships rather than the number of papers, since each coauthor was credited with the same paper. Several other publications simply made no mention of the unit of measurement chosen. Therefore, the column under “Papers” in Table 1 listed several authorship figures rather than the number of

papers.

Method of Analysis

Lotka’s law states that an inverse exponential function,

x” *y = c (1)

is a close description of the relation of x andy, in whichy is the number of authors in a subject, each credited with x number of publications. To test a given set of author data, the first concern is to find the optimal theoretical distribu-

tion from the data set in order to conduct a statistical good- ness-of-fit test. That is, the values of n and c must be com- puted. The method chosen for this experiment consisted of

procedures used to calculate the values of n and c and ap- plication of the Kolmogorov-Smirnov test of conformity [8]. However, there was one troublesome aspect directly

traceable to Lotka’s original methodology. The loga- rithms of x andy were plotted to determine visually the ap- proximate number of points to be included in the calcula-

tion of the regression equation [l]. Naturally it would be desirable to replace the visual inspection with a more precise method in order to find the optimal cutoff for the

distribution. The following attempts were made.

Determination of the Number of Data Points To Be Used in the Computation of the Regression Equation

Although it has been known for sometime that Lotka’s law as stated in Eq. (1) requires some adjustment for the

high producers in the group, no clear cutoff has been of- fered [34]. Price discussed the square root law of elitism

[35] and suggested that any population of size Ey contains an effective elite of size cy who are the high producers

in the field, implying that these authors do not conform to the inverse exponential relationship of productivity. On the other hand, Yablonsky noted that the distribution of low producing scientists is best described by the frequency approach of Lotka and that the high producers are best characterized by the rank approach of Zipf [36]. Thus

scientists with 6 or more publications are considered high producers, where y1 is the number of single-paper authors. Productivity i = G becomes the “watershed”

separating the heavy weights from those of low productiv- ity. Without any theoretical basis upon which to rely, these two measures appear to be practical guides in the initial

determination of the author group to be used in computing the slope of the regression line.

Statistical Indicators for the ‘Best”Fit of the Regression Line

In statistics, a regression equation describes the func-

tional relationship between the two variables, x andy. The aim is to make an estimate of one variable from the other. The standard error of estimate measures the divergence of the actual values of y, i.e., the dependent variable, from

the computed values. This measure gives the degree of de- pendability of estimates made by the regression equation in

absolute terms. Lastly, r2, or the coefficient of determina- tion, enables one to state the relative amount of variation in y that has been explained by the regression equation. This is obtained by taking the square of the coefficient of corre- lation r between the two variables [37]. The extent of cor- relation r is also known as the Pearson product-moment

correlation coefficient and is directly derivable from the

standard error of estimates. The object of the first procedure is to find the “best”

regression line for the data set. Given an approximate area at which the distribution should be truncated, a regression

equation can be calculated for each version of the distribu- tion by removing one data point at a time from the high pro- ductivity end of the distribution. That is, the value of the slope can vary depending on the number of data points used in the computation of the regression equation. In most cases the value of n tends to stabilize around the cutoff [8].

The value of n will take on a small variation unless the removed data point deviates substantially from the line. In computing the regression equation, one could identify those points by noting that they are points with large stan- dard residuals. In other words, those points with large deviations from the expected values can be singled out as

those observations with two or more standard deviations (a) from the regression line. Thus, if these were values for the more productive members of the population, they could be safely eliminated from the calculation to form a closely fit- ted line. On the other hand, if these points are points for authors with fewer publications, the resultant regression

line would probably not be descriptive of the population as a whole. In sum, a regression line is sought that can give the best fit to the set of data.

In computing the regression equation by means of sta-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1986 29

tistical packages such as MINITAB, the output printout identifies each individual observation together with its ex- pected value and its standardized residual. Those with two or more standard deviations are tagged. Thus in comput- ing several possible regression equations for each data set with a different number of points from the high producing end removed, it was possible to find regression lines without any point of large standard deviation, or with the least number of such points. Only 6 of the 48 sets of data needed to include points with large deviations. All four distributions taken from Lotka’s article had at least one such point; the chemistry data had three points. The dis- tributions from the Library of Congress data and schisto- somiasis literature each had one such point.

Lastly, the coefficient of determination r2 is the propor- tion of the variation explained by the regression line. The

object was to have the highest possible value of r2 so that the proportion explained by the regression line would be as close to 100% as possible. It could be computed as either

the square of the coefficient of correlation r or the sum of the square of the deviations of the estimatedy values from their own mean, which is also the mean of the actual y

values, i.e., C(y, - jQ2. In the experiment, several regression lines were calcu-

lated from different cutoff points of each set of data. The one with as few large deviations as possible was selected. At the same time, this was balanced by the guideline that the number of authors excluded from consideration in this

regression equation must lie within the larger of the two values suggested by Price and Yablonsky: X& square root of the total number of authors in the group, or the excluded

authors must each publish more than &publications. In cases of ties, the regression line with the larger coefficient

of determination was preferred. Thus it was possible to ob-

tain the best value for n, the slope of the regression line, with the best fit for the author data.

Calculation of the Constant C

With the value of iz calculated, the value of C was com- puted by substituting the value of n into the following formula:

1

c= 19

cl/x”+ l&z - 1)(20Tap’)+ 1/[2(20”)]+rz/(24 X 19”+‘) I

The derivation of this formula has been given elsewhere [8].

Statistical Test of Conformity

The appropriate columns in Table 1 show the optimal values of the slope n and the constant C that characterized each set of data. At this point the Kolmogorov-Smirnov goodness-of-fit test was applied to each of the data sets with theoretical distribution as suggested by Coile’s biblio- metric study [4]. Lotka’s expected distribution was calcu- lated in each case by substituting the computed values of n

and C. Under the condition that grouped data were in- evitable in the experiment, the Kolmogorov-Smirnov test is more powerful than the chi-square test [8]. In each case,

the maximum deviation between the cumulative propor- tions of the observed and estimated y were noted. Since

each of the data sets contained more than 35 authors, the critical values at the .Ol level and the .05 level were calculated by 1.63/G and 1.36/G, respectively. Cy is the total population under consideration. Results of the

tests are summarized in Table 1.

Discussion

Conformity

The 48 sets of author data represented 20 subject fields and 3 large research library catalogs. Not surprisingly, overwhelming conformity with Lotka’s law was found. A wide range of subjects were represented. Nine data sets did

not fit the law. The Kolmogorov-Smirnov statistics at both the .Ol and .05 levels were applied. Of the 39 remaining

data sets, only 3 lay between these critical values. Although it is entirely possible that nine subjects did

not fit the law, irregularities were found in the data of these distributions. Of the nine nonconforming distributions, two were taken from Lotka’s chemistry data. These figures were collected by a highly irregular sampling technique, which has been noted earlier [5]. The author’s names

listed under the A’s and B’s were given individually and were combined for these two alphabets. They were tested to observe if there were any differences between subsets

versus the parent set. Although the distribution from the names starting with A fit the theoretical distribution, the other two distributions did not. Because of the obvious ir- regularity, conclusive evidence was lacking for the data from chemistry.

A lack of detail in the data set may be attributed to the nonconformity of the distribution taken from the Library of Congress personal names file. Names with more than 10 publications were presented in a grouped format, thus

precluding the application of the method to finding the best regression line to fit the observed data. In using only the first 10 points given, more authors had to be excluded

than the guidelines set by Price or Yablonsky allowed. Therefore, it was not possible to test this distribution using the same method as the others.

Of the six remaining nonconforming data sets, there was little information on Rudmann’s data or Mantell’s HumaFz Engineering Bibliography. There was no mention at all on the nature of the data nor the manner in which these were collected in the article by Rao. It was only known that the human engineering bibliography was com- piled by the Office of Naval Research and that the items may have come from a single year.

It was noted that the law appears to be insensitive to a variety of time frames. Some data sets were collected from over 100 years; others were taken from as few as a single

30 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1986

TABLE 2. Correlations of parameters in distributions.

Number

of

Authors

Percentage

Number of

of Authors Standard K-S D,,,,, Papers Included r2 Deviation I) C Statistics Dmclx (12 = 2)

Number of authors

Number of papers Percentage of Authors Included

r* R-Squared

Standard Deviation

II c

K-S Statistics (.05 level of

significance)

D ,n,ax D "lax bl = 2)

1.00 -0.131 0.148 -0.184 -0.095 -0.073 -0.274 0.021

- - -0.137 0.149 -0.187 -0.102 -0.081 -0.276 0.021 - - -0.211 0.412 0.601 0.631 0.360 0.178

-0.905 0.270 0.228 -0.594 -0.442

- 0.061 0.117 0.619 0.462

- 0.969 0.060 -0.158 - - - - 0.103 -0.150

0.447

-0.135

-0.141

0.537

0.289

-0.030

0.848 0.811

0.217

-0.162

year’s publication period. However, of those from a single

year’s output, most were published in a variety of sources. The five annual distributions for information science were compiled from the Information Science Abstract, which includes many primary sources. However, the two noncon- forming distributions published in William’s article on

mycology and entomology contained articles on their re- spective subjects in a single year’s papers found in a single journal. For mycology, he took the data from an earlier paper by Dufrenoy. Intuitively, Lotka’s law has been con-

sidered to be a description of the papers contributed by authors in a field. A reasonable time span is needed to give authors the opportunity to produce the representative

sample. In all likelihood, most authors tend not to send all their papers to a single journal in any given year. There- fore, if a shorter period is used, items should be collected at least from most of the possible primary sources. The dis- tribution would reflect a better representation of the con- tributions in a subject. In those cases where only a single

journal such as the Communication of the Association for

Computing Machinery or IEEE was chosen, each appears to have been a major journal in its respective field, and sev- eral years were included in order to capture a representa- tive percentage of contributions by their contributors.

One finds it more difficult to explain the causes for the nonconformity of the data from the Brazilian schistosomi-

asis literature and the literature on econometrics. The schistosomiasis bibliography spans a long period and data were taken from numerous sources. However, criteria of inclusion is questionable: the papers had to be published in a Brazilian journal, published in Brazil, or the author

had to be Brazilian. Therefore, many items included were by Brazilians published in non-Brazilian journals, al- though other papers on schistosomiasis in these same jour- nals were excluded. This selection criteria may have bias that may have adversely influenced the shape of the distribution. There is a similar clue in Leaven’s article on the econometric data. It is taken from 20 annual volumes of a major journal in the field; however, it also includes papers presented at meetings of the Society for Economet-

rics in which a single contribution may be a full paper, an

abstract, or even a mere listing of the title. One can only speculate on the effect caused by such techniques of data collection.

Other than the obvious correlation between the two constants n and C, there was little correlation between the other parameters (Table 2). The most surprising fact was that the data compiled from a short period, of a sin-

gle year, had larger values of n. The commonly held no- tion that longer periods of data offer a fairer representa- tion of an author groups is warranted. It is clear, however, that a more conservative method should be em- ployed in the compilation of data to avoid any possible problems.

Coauthors Versus Senior Authors

Each of the nine sets of author data was tested using only senior authors and also tested again using all co- authors. Thus it was possible to compare the differences in

the testing of Lotka’s law. Obviously, there were more single-paper authors in the coauthor distribution. Using the same number of points to compute the value of II, the slope of the regression line had to be larger. For the nine sets, the maximum deviations from the two ways of data compilation were not statistically significant enough to cause nonconformity with the law. Only one of the nine

sets produced a maximum deviation somewhere between the .Ol and .OS levels. On the other hand, it must be pointed out that none of the nine data sets was representa- tive of the scientific or biomedical literatures in which co- author activity was significantly higher as shown in the two

drug literatures. The average number of papers per author in these two literatures was substantially lower than the rest of the sample since all coauthors were counted. There- fore, more tests in the scientific literature are needed for further study.

Population Size

One of the most important factors influencing the con- formity of Lotka’s law is the total population of authors

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1986 31

under consideration. Since all author data exceeded a total of 35 authors, the critical values at the .OI and .05 levels were calculated by 1.63/=~ and 1.36/~‘&, respec-

tively. Thus the critical value is an inverse square function

of the total population. When a truly large population size of half a million is taken, the critical value decreases dra- matically, making the allowable maximum deviation smaller. That is, the condition for conformity is more re- stricted. Unfortunately, the two large populations in the

sample, the 6891 authors from Chemical Abstracts, and the 695,074 personal names from the Library of Congress file, did not meet the prerequisite condition for proper

testing. Although the conformity of the law tolerates sub- stantial differences from a population of 97 to 3512 in this sample, truly comprehensive population size awaits

testing.

Subset Versus Parent Set

It was not possible to draw much meaningful conclu- sions from one of the two subjects tested. Since the results from Chemical Abstracts had to be disregarded, com-

ments are limited only to the subject of the American Rev-

olution. The Shy bibliography is truly a comprehensive and quality selection from the total literature. Its subset is limited only by a shortened time period and can be consid-

ered a true subset of the total set, using only one-third of the total data. Except for the steeper slope, it was found

that the subset is almost a replica of the original set. On this basis, it was concluded that true subsets from properly compiled data can be representative of the population.

Quality Versus Quantity Selection

Except for the subject of the American Revolution, there is no direct parallel between quality and quantity on

the same subject. When representative selections were made based either on quality or on comprehensiveness, such as those on the American Revolution, there appeared to be very little difference in the values of the slope or the constant C. The maximum deviations were quite similar.

Although there was insufficient information found in

the source papers on the selection criteria applied, those selected by quality, such as Drosophila, map librarian- ship, physics, and legal medicine, were taken from a long time span. The slope tended to be in the range of 1.76 to 2.54. Their maximum deviations also tended to be smaller in value. One can tentatively conclude that a representa- tive selection based on quality or on quantity reflects a true cross section of the total population. Therefore, in testing Lotka’s law, either criteria may be utilized.

The “Best ” Regression Line

One of the key problems in testing Lotka’s law is to find the “best” regression line. In the past, the observed data distribution has been truncated by a visual inspection of the graph from plotting the logarithms of x andy; these ex- cluded points representing those highly productive

authors, which usually fluctuate wildly. Nevertheless, a precise method to determine the exact cutoff would be more desirable. It was found that a fairly substantial varia-

tion in the value of n can produce a maximum deviation of the cumulative percentage of authors, which still lies within the critical value for a test of statistical significance. On the other hand, those nonconformity distributions in the sample deviated widely from the theoretical distribu-

tion such that, no matter how the distribution was trun- cated, the resulting regression line did not lie within the critical region for conformity. The reason is obvious. In fit-

ting the regression line to the observed distribution by sys- tematically deleting highly productive authors, one finds that the slope stabilizes after the few end points are re- moved. Therefore, in practice, a visual inspection to deter- mine the cutoff point is a close first approximation. From there, several calculations may be made to determine the

best regression line. All the data sets were also tested against the inverse

square relation, i.e., n = 2 (see Table 1). Only 7 of the 48 distributions fit the theoretical distribution with n = 2. In

6 of the 7 cases, the value of n computed from actual data was 2.00 + 0.20. The exception was 2.272 from the co-

author data in computational musicology. Despite Coile’s arguments that each set of data ought to be tested against the theoretical distribution based on n = 2, a more general case could be made from the results of this experiment.

The value of n ranged from 1.7828 taken from Drosophila to a high of 3.7747 from the 1968 information science data. When the parameters for each case were extracted from the observed data, each distribution conformed to the law. Clearly, Lotka’s law can be more accurately de- scribed as an inverse exponential function rather than an

inverse square function.

Conclusion

The principal aim of this paper was to empirically ex-

amine author productivity data to determine if there were characteristics that influenced the conformity to Lotka’s

law. The findings indicated conclusively that most of the

data did not fit the inverse square function. The two con- stants in Lotka’s formulation, the slope n and the constant C, must be derived from the observed distribution, and the

inverse square law, thought to apply to scientific litera- ture, was but a special case of the inverse exponential rela- tionship, In general, data from nonscientific subjects col- lected with quality or with comprehensiveness as selection criteria deviated substantially from the inverse square re-

lationship. However, no relation was found between the value of n and the so-called “hardness of science.”

Over 80% of the data sets conformed to Lotka’s law. From the empirical evidence, it is recommended that data

should be compiled from a comprehensive source to cap- ture a true representation of the target population. Either quality or quantity may be used as selection criteria. If only a single major primary journal is used to collect data, a longer period of coverage is advised.

32 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1986

The focus of this method of testing was to fit as many

data points as possible to the regression line, from which the theoretical distribution was computed. The maximum high-producing authors in the group as allowed by the guidelines set by Price and Yablonsky were excluded.

Within this limitation and from the possible regression equations, the one selected had the largest coefficient of determination. In other words, the equation should ex- plain the maximum proportion of the data points and thus minimize the standard error of estimate. These considera- tions should identify the best regression line that gives the optimal fit to the set of observed data. This method comes close to a truly replicable method to test Lotka’s law.

Acknowledgments

The author gratefully acknowledges the support of two

grants awarded by the National Library of Medicine, NIH Grants ROl-LM04177 and K04-LMO0078. The comput-

ing cost was generously given by Case Western Reserve

University. Help in the data analysis was provided by Ida Hutomo and Teresa W. T. Fok.

References

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

Lotka, A. J. “The frequency distribution of scientific productiv-

ity.” Journal qf the Washington Academy of Science. 16(12):317-

323; 1926. Vlachy, J. “Distribution patterns in creative communities.”

World Co/zgress ofSociology, Toronto; 1974; l-20.

Hubert, J. J. “Letter to the Editor-Lotka’s law in the humani-

ties.” Journal of the American Society for Information Science.

28(1):66; 1977.

Coile, R. C. “Lotka’s frequency distribution of scientific productiv-

ity.” Journal oj the American Society for Information Science.

28(6):366-370; 1977.

Potter, W. G. “Lotka’s law revisited.” Library Trends. 30(1):21- 39; 1981.

Drott, M. C.; and Griffith, B. C. “An empirical examination of

Bradford’s law and the scattering of scientific literature.” Journal

qf the American Society for Information Science. 29(5):238-246; 1978.

Vlachy, J. “Time factor in Lotka’s law.” Probleme de Informare si

Documentare. 10(2):44-87; 1976.

Pao, M. L. “Lotka’s law: A testing procedure.” Information Pro-

cessing arzd Management. 21(4):305-320; 1985.

Vlachy, J. “Frequency distribution of scientific performance: A

bibliography of Lotka’s law and related phenomena.” Scientomet-

rics. 1:109-130; 1978.

Vlachy, J. “Evaluating the distribution of individual

performance.” Scientia Yugoslavica. 6( l-4):267-275; 1980.

Aiyepeku, W. 0. “The productivity of geographical authors: A

case study from Nigeria.” Journal of Documentation. 32(2):105- 117; 1976.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

Voos, H. “Lotka and information science.” Journal of the Ameri-

can Society for Information Science. 25(4):270-272; 1974.

Coile, R. C. “Letter to the Editor-Lotka and information

science,” Journal of the American Society for Information Science.

26(2):133-134; 1975.

Voos, H. “Letter to the Editor-Author’s reply,” Journal of the

American Society for Information Science. 26(2): 134; 1975.

Schorr, A. E. “Lotka’s law and map librarianship.” Journal of the American Society for Information Science. 26(3): 189-190; 1975.

Schorr, A. E. “Lotka’s law and history of legal medicine.”

Research in Librarianship. 5(30):205-209; 1975.

Schorr, A. E. “Lotka’s law and library science.” Reference Quar- terly. 14(1):32-33; 1974.

Murphy, L. J. “Lotka’s law in the humanities?” Journal of the

American Society for Information Science. 24(6):461-462; 1973.

Worthen, D. B. “Short-lived technical literatures: A bibliometric analysis.” Methods of Information in Medicine. 17(3):190-198;

1978.

Braga, G. M. “Dynamics of Scientific Communication: An Ap-

plication to Science Funding Policy.” Unpublished dissertation.

Cleveland, OH: Case Western Reserve University; 1977.

Hersh, A. H. “Drosophila and the course of research.” Ohio Jour-

nal of Science. 42: 198-200; 1941.

Muller, H. J. Bibliography on the Genetics of Drosophila. Edin-

burgh, Scotland: Oliver & Boyd; 1939.

Leavens, D. H. “Letter to the Editor.“Econometrica. 21:630-632;

1953.

Rao, 1. K. R. “The Distribution of scientific productivity and social change.” Journal of the American Society,for Information Science.

31(2):111-122; 1980.

Mantell, L. H. “On laws of special abilities & the production of

scientific literature.” American Documentation. 17(1):8-16; 1966. Radhakrishnan, T.; and Kemizan, R. “Lotka’s law and computer

science literature.” Journal of the American Society for Znforma-

tion Science. 30(1):51-54; 1979.

Williams, C. B. “The numbers of publications written by biolo-

ists.” Annals of Eugenics. 12:143-146; 1944.

Pao, M. L. “Bibliometrics and computational musicology.” Co[-

lectiorz Management. 3(1):97-109; 1979.

Pao, M. L. “Collaboration in computational musicology.” Journal of the American Society for Information Science. 33(1):38-43;

1982.

Pao, M. L. “Characteristics of American Revolution literature.”

Collection Management. 6(3):119-128; 1984.

McCreery, L. S. “Bibliometric Study of Ethnomusicology, A

Humanities Subject.” Unpublished dissertation. Cleveland, Ohio:

Case Western Reserve University; 1984; also data collected during

the study. Potter, W. G. “When names collide: Conflict in the catalog and

AACR2.” Library Resources & Technical Services. 24:7-16; 1980.

McCallum, S. H.; and Godwin, J. L. “Statistics in headings in the

MARC file.” Journal of Library Automation. 14(3):194-201; 1981.

Price, D. de S. Little Science, Big Science. New York, N.Y.: Col-

umbia University; 1963.

Price, D. de S. “Some remarks on elitism in information and the in-

visible college phenomenon in science.” Journal of the American

Society for Information Science. 22(2):74-75; 1971.

Yablonsky, A. 1. “On fundamental regularities of the distribution

of scientific productivity.” Scientometrics. 2(1):3-34; 1980.

Craxton, F. E.; and Cowden, D. J. Applied General Statistics. Englewood Cliffs, N.J.: Prentice-Hall; 1966; 454-463.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1986 33