[handbook of statistics] bioenvironmental and public health statistics volume 18 || 9...

78
P. K. Sen and C. R. Rao, eds., Handbook of Statistics, Vol. 18 0 © 2000 Elsevier Science B.V. All rights reserved. ,J Non-parametrics in Bioenvironmental and Public Health Statistics Pranab Kumar Sen 1. Introduction Some of the major interdisciplinary issues pertaining to the consolidated bioen- vironment and public health disciplines have been addressed in Sen (2000), and statistical perspectives in this context have also been appraised. It may be noted that in bioenvironmental and public health studies, the statistical triplet: planning, modeling and analysis schemes play a fundamental role; because of various ex- traneous factors, the setups may be quite different from the conventional exper- imental ones, and hence these schemes are generally of nonstandard types, Therefore, it may be wiser to examine critically the suitability of standard para- metrics, and appraise how far non-parametrics (comprising both semiparametrics and nonparametrics) can be advocated as a better alternative. In this sense, nonparametrics refers to 'beyond parametrics' that emerged from the parametrics and carried over the torch well into the domain of modern nonparametrics and semiparametrics as well. The current study relates to this appraisal task with special reference to some of the basic statistical problems that are relevant in this broad field of investigation. As some of these problems have been studied in detail in some other accompanying articles in this volume, we shall avoid duplication to a greater extent by suitable cross-referencing to them. Statistical planning, modeling and analysis have made remarkable progress in various areas in bioenvironmental and public health studies. In some of these areas, specific models have emerged on natural or underlying background factors, and hence, relevant statistical analysis schemes have been posed as reasonably model oriented or parametric in flavor. In some other areas, the situation is somewhat different: A specific parametric model based statistical analysis, though may be chosen for convenience and simplicity, may not be very appropriate for adoption in the particular application. Non-parametrics may particularly be rel- evant in the latter context: Whereas parameterics may not have sufficient strength from validity and robustness perspectives, non-parametrics may fare better in these respects. For this reason, we cover the following broad areas, contrast the parametrics with non-parametrics, and illustrate the effective role of the latter. 247

Upload: pranab-kumar

Post on 22-Feb-2017

237 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

P. K. Sen and C. R. Rao, eds., Handbook of Statistics, Vol. 18 0 © 2000 Elsevier Science B.V. All rights reserved. , J

Non-parametrics in Bioenvironmental and Public Health Statistics

Pranab Kumar Sen

1. Introduction

Some of the major interdisciplinary issues pertaining to the consolidated bioen- vironment and public health disciplines have been addressed in Sen (2000), and statistical perspectives in this context have also been appraised. It may be noted that in bioenvironmental and public health studies, the statistical triplet: planning, modeling and analysis schemes play a fundamental role; because of various ex- traneous factors, the setups may be quite different from the conventional exper- imental ones, and hence these schemes are generally of nonstandard types, Therefore, it may be wiser to examine critically the suitability of standard para- metrics, and appraise how far non-parametrics (comprising both semiparametrics and nonparametrics) can be advocated as a better alternative. In this sense, nonparametrics refers to 'beyond parametrics' that emerged from the parametrics and carried over the torch well into the domain of modern nonparametrics and semiparametrics as well. The current study relates to this appraisal task with special reference to some of the basic statistical problems that are relevant in this broad field of investigation. As some of these problems have been studied in detail in some other accompanying articles in this volume, we shall avoid duplication to a greater extent by suitable cross-referencing to them.

Statistical planning, modeling and analysis have made remarkable progress in various areas in bioenvironmental and public health studies. In some of these areas, specific models have emerged on natural or underlying background factors, and hence, relevant statistical analysis schemes have been posed as reasonably model oriented or parametric in flavor. In some other areas, the situation is somewhat different: A specific parametric model based statistical analysis, though may be chosen for convenience and simplicity, may not be very appropriate for adoption in the particular application. Non-parametrics may particularly be rel- evant in the latter context: Whereas parameterics may not have sufficient strength from validity and robustness perspectives, non-parametrics may fare better in these respects. For this reason, we cover the following broad areas, contrast the parametrics with non-parametrics, and illustrate the effective role of the latter.

247

Page 2: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

248 P. K. Sen

1. Quantitative bioassays: Dose-response regression functions; 2. Quantal bioassays: Dose-response models; 3. Generalized linear and additive models; 4. Correlated polychotomous response data models; 5. Multivariate models in biostatistics; 6. Longitudinal data models; 7. Robust statistical inference in general linear models; 8. Nonlinear and nonparametric regression analysis; 9. Clinical trials and survival analysis;

10. Design and analysis of bioenvironmental studies; 11. Case-control studies; 12. Molecular biology and genetics.

With respect to the last four items, it might be appropriate to point out that in these contexts, the primary emphasis is often placed on the risk assessment fol- lowing proper identification of the sources of hazards, levels of exposure to such hazards, and study of the dose-response relationships pertinent to the specific context. We refer to a very useful introduction to risk assessment (characteriza- tion) in a decision making setup by Ohanian et al. (1997). Most of the areas referred to above have been reviewed, in a nontechnical manner, in Sen (1999a), and hence, we deal only with their statistical aspects here, and also relate them to some other accompanying articles in this volume where there is generally a more natural application oriented emphasis. In this way, the current write-up provides a complementary methodological account of some useful non-parametrics that are potentially useful in bioenvironmental and public health statistical modeling and analysis.

2. Quantitative bioassays

In a bioassay, usually a new and an old, termed respectively, the test and standard preparations, are applied to some living organism, and based on their responses, one may want to assess the relative potency of the test preparation with respect to the standard one. Thus, typically, bioassays for assessing such relative potencies relate to clinical therapeutic equivalence trials, in the sense that the preparations need not be different forms of administration of a common drug. In contrast, there may be other types of bioassays that deal with bioavailability and bio- equivalence studies where the relative bioavailability of different formulations (viz., tablet vs. liquid dose/time and frequency of application etc.) of essentially a common drug are used; the emphasis being more on the equivalence or corre- spondence of doses etc. for such different formulations with respect to the pri- mary response variable as well as concomitant ones. For this reason, statistical modeling and analysis schemes for bioassays in the conventional case and bio- equivalence studies may not be isomorphic. We pay here more emphasis on the conventional bioassays while refer to the article by Chinchilli et al. (2000) in this

Page 3: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 249

volume for some discussion of bioequivalence trials. Bioassays may either be quantitative or of quantal type. Quantitative bioassays are of two different types: Direct and indirect assays. In a direct assay, the dose required to produce a given response is stochastic, and from the dose distributions we desire to draw statistical conclusions on the biological study. Generally these assays are used for animal studies where the dose relates to a toxic substance, and the response relates to death or failure of certain type. In a simple setup, ifF(x), x >_ 0 relates to the dose (tolerance) distribution, we may define the median lethal dose (LDso) or median effective dose (EDso) as the median of the distribution F. In a similar fashion, LDloo~ or EDloo~ can be defined for any e c (0, 1); often, the ED90 or ED95 are of some importance in therapeutic studies.

It is generally the case that F is highly skewed, so that normality of F may not be very reasonable. Sometimes, some dosage based on Box-Cox type transfor- mations (viz., log-dose) are used to induce more symmetry in the tolerance dis- tribution, and yet the normality assumption may not be very appropriate. Sometimes, logistics distributions are chosen as appropriate for such dosage, but that might result in a similar nonrobustness as in the normal case. Further, in the absence of a precise tolerance distribution, the mean-dose or other parametric formulations of the central tendency of F may not have much appeal from practical adoption standpoint. In this respect, the conventional nonparametric methods fare well. Recall that the median or a percentile of the original F and of the dosage tolerance distribution (say, F*), are related by the same functional relationship that ties the dosage to the dose, or, in other words, these are equi- variant under strictly monotone dosage transformations. This feature may not be generally true for the mean: the mean of the log-dose values is the log-geometric mean of the original dose values, not necessarily the log-arithmetic mean, and hence, the equivariance may not be tenable. On the other hand, ranks of the observations are invariant under any strictly monotone (not necessarily linear) t ransformation on the dose and hence, estimates based on suitable rank statistics share this invariance property as well.

Motivated by such considerations, we present the case of a direct bioassay involving a test (T) and a standard (S) preparation, and we denote the respective dose-tolerance distribution by Fr(x) and Fs(x), which are both defined on R + = (0, ec). In a typical dilution assay, it is assumed that the test preparation behaves as if it is a dilution (or concentration) of the standard one. This feature can be statistically represented as

F r ( x ) = F s ( p x ) , V x C R +, p > 0 , (2.1)

where p is termed the relative potency of the test preparat ion with respect to the standard one. This constitutes the fundamental assumption of a direct dilution assay. The two main problems of interest are (i) to test for the validity of the fundamental assumption, and (ii) to draw statistical conclusion on the relative potency. Standard parametrics rest on the basic assumption that Fs is normal (or the log-dose for the standard preparat ion has a normal distribution). In the former case, the ratio of the means for the two preparat ion provides the estimate

Page 4: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

250 P. K. Sen

of p, while in the later case, the difference of the means of the log-doses provides an estimator of log p, and these estimates are not necessarily interrelated by the same dose to dosage transformation. Though the classical Fieller theorem pro- vides a parametric resolution for drawing statistical inference on the ratio of parameters, it is subjected to numerous shortcomings; lack of robustness to model departures being one of the major ones. No wonder there has been a spur of research activities on this topics where more and more emphasis is being laid on Bayes, empirical as well as hierarchical Bayes procedures. Nevertheless, these procedures may not possess a basic invariance property that statistical conclu- sions should not be affected by the choice of a particular dosage (dose trans- formation) or a response metameter. Nonparametr ic procedures satisfy such an equivariance property to a greater extent.

Let us work with the log-dose (= dosage), so that the two tolerance distribu- tions (say, F~,F~), for the standard and test preparations then differ by the shift parameter log p. Let X~*(=logX/),i = 1 , . . . , m stand for the dosage of the m subjects in the test preparat ion group, and let Y//* = (log Yi),i = 1 , . . . , n be the dosage for the n subjects used in the standard preparation. Consider the set of mn paired differences

Zij = ~i* --Xj*., for i = 1 , . . . , n ; j = 1 , . . . , m (2.2)

and denote their ordered values by Z(,), i = 1 , . . . , N = ran. Then a distribution- free point estimator of log p, based on the classical two-sample Wi lcoxon-Mann- Whitney statistic, is given by (Sen, 1963):

log Pm,~ = Median{Z(k) : 1 < k < N} . (2.3)

A distribution-free confidence interval for log p can similarly be obtained in terms of two complementary quantiles Z(r), Z(N-r+I), where r is so chosen that under the null hypothesis log p = 0, the Wilcoxon statistic lie between the corresponding critical values with a confidence coefficient equal to 1 - ~N that is not smaller than the desired level 1 - ~. For large m, n, ~N can be well approximated by e. It also follows from the above that these estimates are equivariant under any common strictly monotone transformation on the dose - a feature that is not shared by the parametric point and confidence intervals which are based on the Fieller theorem. Similarly, a test for the validity of the fundamental assumption can be based on a Q-Q plot of the two tolerance distribution, or based on the constancy of the pj-quantile differences for the two log-dose distributions, for a finite set of pj values.

Let us next consider the case of an indirect assay. Here the dose levels are nonstochastic while the response variable is stochastic. Through a choice of some dosage and response metameter, usually a linear regression relationship is as- sumed to be true. There are two popular types of indirect quantitative bioassays, namely the parallel line and slope ratio assays which adapt well to log-dose and power-dose transformations respectively. In the former case, the regression lines for the standard and test preparations are assumed to have a common slope, and

Page 5: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 251

the difference of their intercepts (adjusted by the slope) defines the log-relative potency. In the latter case, the two dosage-response regression lines are assumed to have the common intercept and the ratio of the slopes define a power of the relative potency. Viewed from this angle, we encounter nonlinear functions of the parameters of two independent regression lines, and in a parametric setup, if the errors can be assumed to be normally distributed, the Fieller theorem can be called on to provide the desired inferential tools. However, the normality as- sumption may be quite crucial in this context, and possible departures from this basic distributional assumption is likely to have serious lack of robustness im- pacts. In this respect, nonparametric and semiparametric methods are more robust, and can even be quite efficient.

For a set of observations (Yi, ti), i = 1 , . . . , n, the Sen-Theil estimator, a simple nonparametric estimator of the slope based on the Kendall tau coefficient, is given by the median of the divided differences:

~n = median{(Yj - Yi ) / ( t j - ti) : tj ¢ ti, 1 <_ i < j < n} , (2.4)

and a distribution-free confidence interval for the slope can be obtained in a similar fashion in terms of suitable quantiles of these divided differences (Sen, 1968).

In a parallel line assay, we have two dosage-response regressions

Ysi = as + flsXi + esi; YTz = c~r + f lrxi + eri; flS = f ir = fl (2.5)

where/~ is the common slope, and the intercept parameters satisfy the constraint that

log p = (c~r - ~s ) / f i , p(>_O) , (2.6)

and p is the relative potency of the test with respect to the standard preparation. Keeping this in mind, the divided differences from each preparation are pooled

^

into a combined set, and ~o, the median of this combined set is then taken as the estimator of the common slope. Further, for each preparation, residuals (~rsi = YSi - - ~ ° X i , YTi = YTi - - ~°Xi) are then obtained by using this common slope estimator, and for each preparation, we compute the median of the midranges of the residuals, namely,

med{(l~si + Y s j ) / 2 : i <_ j } , med{(l?~ + % ) / 2 : i _< j} , (2.7)

which are used as the estimator of the intercept parameter (Sen, 1971); this es- timator is based on the well known Wilcoxon signed-rank statistic relating to the residuals for the respective preparation. These estimates are then used to draw statistical conclusions on the relative potency measure. We refer to Sen (1971) for details.

For the slope ratio assay, in the dose-response regression in (2.5), the intercepts are the same, that is as = ccr = ~, while the relative potency p can be expressed as

p;~ =/tr/Ps, ,~ > 0 , (2.8)

Page 6: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

252 P. K. Sen

where 2 relates to the power-dosage (i.e., dosage = (dose)X). For each prepara- tion, we estimate the slope based on the estimator mentioned above in (2.4). Then we compute the residuals for each preparat ion separately. Afterwards, we pool these residuals into a combined set, and compute the median of midranges from this pooled set. This is the Wilcoxon score estimator of the common intercept parameter. Then statistical conclusions are drawn from the two slope estimators and the common intercept estimator, and suitable aligned rank tests are used for hypotheses testing problems. For details, we refer to Sen (1972).

In passing, we may remark that the use of the Wilcoxon signed-rank and two-sample rank-sum statistics, and the Kendall tau statistic has a special ap- peal; the solutions are closed, simple and quite robust for nearly normal error distributions; they are asymptotically optimal when the underlying error dis- tributions are logistic. It may be quite tempting to use either general linear and signed-rank statistics based on appropriate scores (such as the log-rank and normal scores) that would retain efficiency to a greater extent without much compromise on the robustness properties of the derived estimates and tests. However, for such general rank statistics, generally the solutions are to be ob- tained by some iterative procedures, and in that way the solutions prescribed here can be used as the preliminary estimates in this venture. The same remark pertains to robust estimators in linear models that are based on suitable M-statistics or L-statistics. At the present time, there is a vast literature on such robust statistical estimates and tests (viz., Jure6kov/t and Sen (1996) for an up- to-date treatise of these developments) that can be tapped to bioassay problems as well. However, much of the computat ional simplicity could be lost in this way, and f rom actual applications perspectives, they may have therefore much less appeal to users.

3. Quantal bioassays

In many biometric and animal studies (dosimetric experiments), typically, the response variable is binary, while as in the case of an indirect assay, the dose levels are nonstochastic. Such a response variable is often characterized as quantal, that is, all or nothing, and bioassays pertaining to such studies are therefore termed (indirect) quantal assays. For each preparat ion (T and S), we conceive of a number of dose levels (along with other concomitant information), and at each level, out of a number of subjects administered a (random) number respond positively, while the rest not. Thus, we conceive of a binomial model where the probabili ty of a positive response depends on the administered dose level, and we are interested in estimating the quantal response relation from such binary data models (with a view to studying the relative potency of the test preparat ion with respect to the standard one). The classical probit (normit) and logit models are the most commonly used ones.

For a given dose level d and other concomitant variates, denoted by x, the probability of a positive response for S and T preparations are denoted by

Page 7: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 253

7cs(d, x) and ~ r ( d , x) respectively. As in the case of a quantitative (indirect) assay, we conceive of a model

~s(d,x)=~(~s+fisd+~'x), ~r(d,x)=~(c~r+fird+~,'x), (3.1) where re(t), t C R is a proper distribution function. In a parallel line assay model, we assume that

] ? s = / ~ r = ~ , ~ r - C ~ s = l ~ l o g p , (3.2)

while in a slope-ratio assay, we let

C~s = c~r - c~, p2~ = f i r / ~ s , (3.3)

where 2 appears in the power-dose transformation. In probit analysis, ~(t) is taken to be a normal distribution, while in logit models, it is taken to be a logistic distribution. In the latter case, we define the logit entities as

(s (X) = log {Tzs(d, x)/(1 - ~zs(d, x))} = as + fis d + 7'x , (3.4)

and a similar expression for the test preparation. For some discussion of the logit analysis (not necessarily in the context of a bioassay), we refer to the article by De Long and De Long (2000) in this volume. We therefore skip some of these details.

Consider first a parallel line assay conducted in a 2k-point design, where for each preparation there are k different dosage levels. For the standard preparation, at the j th dosage level dsj, let there be nsj subjects administered, and let the number of positive response be denoted by rsj, and we let Psj = rsj/ns2, j = 1 , . . . , k ; we denote the corresponding entities for the test preparation by drj, nrj, rrj and Prj, for j = 1 , . . . , k. Based on the two sets of sample logits

Zsj = l o g { P s i ~ ( 1 - Psj) }, Zrj = log {PTj / (1 -- Prj) }, j = 1 , . . . , k ,

(3.5)

along with their asymptotic normality and estimated variances, we consider the weighted sum of squares due to residuals:

k

Q(c~s, c~r, fl, 7) = Z [nsjpsj(1 - p s j ) ( Z s j - c~s - fldsj - ~/Xsj) 2 j--1

+ nTjPTj(l - -PT j ) (Zr j -- C~T -- fldrj - ~¢txrj)2] . (3.6)

We minimize this with respect to the unknown parameters, and obtain the esti- mating equations for this logistic regression model based weighted least squares estimators. These in turn provide the estimate of the relative potency p. Also, large sample tests for suitable hypotheses can be based on this set of estimators. A similar case holds for the slope-ratio assay (where the parametric restraints are different).

For probit analysis, we choose the transformation qb -1(psi) and qb I (PTj), for j -- 1 , . . . , k, and side by side, consider the weighted sum of squares due to re- siduals (for the parallel line assay):

Page 8: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

254 P. K. Sen

Q(eS,~T, fi, Y) = ~ [ nsj~2[~-l(psj)]j=l Psi(1 --Psi) (Zsj - C~s - fldsj-y'Xsj) 2

nv+2[, -i (Pv)] ] ( Z T j - - - - f i d r j - 2 j , (3.7)

where ~b(.) stands for the standard normal density function; a similar case holds for slope-ratio assays as well. Minimizing this with respect to C~s, er, fi and 7 we obtain the estimating equations that provide the BAN estimators of these pa- rameters, and these in turn provide the estimator of the relative potency. Though the estimating equations for the probit analysis are comparatively more complex than for the logit model, as extensive tables are available for the normal density and quantile functions, numerical approximations are not difficult to adopt.

The classical stochastic approximation methodology, developed by Robbins and Monro (1951) and Kiefer and Wolfowitz (1952) may be quite amenable to quantal bioassays, and the vast research literature on stochastic approximation can be incorporated to facilitate such adaption. However, in that sense, the choice of successive dose levels depends on the outcome of the preceding stage dose and response levels, and thereby results in stochastic dose as well as response levels. This may be difficult to administer effectively in practical applications.

There are other types of bioassays, namely, radioimmunoassays and immuno- radiometric assays where antigen and antibodies are labeled with (usually small) doses of radio-isotopes; such assays are based on radiation counts at various doses in a fixed time period. The logistic curve has been found to be quite sat- isfactory for such dose-response patterns, though the doses may generally be quite low. However, unlike the logistic distribution function, we may need to use a version of it that may have lower asymptote different from 0 and the upper one less than one. Generalized linear models are also often found to be appropriate for such models, and we shall deal with that later. Finally, there may be other complexities in statistical modeling and analysis of bioassays. These may be due to (i) possible censoring that may not conform to the usual Type I, Type II or random censoring type (as for example, informative censoring), (ii) possible measurement errors that may either be differentiable or nondifferentiable, (iii) stochastic compliance of dose, and (iv) multiple end-points that in quantal assays may lead to correlated binary or polychotomous responses. For measurement error models, we refer to the accompanying article by Lyles and Kupper (2000) in this volume. Nondifferentiable compliance error models have also been consid- ered (Chen-Mok and Sen, 1999) to cover some measurement error models where there is a stochastic compliance factor that relates the administered (mostly nonstochastic) dose levels to the actual intake levels (stochastic). Use of historical control on the stochastic compliance and logit model for the response variable have been incorporated to modify the classical logit models to suit statistical modeling and analysis better. Some other aspects relating to these issues, will be

Page 9: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-pararnetrics in bioenvironmental and public health statistics 255

discussed later on. For multiple end-points, a competing risk setup can not also be ruled out (DeMasi, 1999), and more detailed treatments are provided in some other accompanying articles in this volume.

4. General ized l inear models

In biostatistical applications, dose-response relations dominate the scenario. In a conventional setup, a linear model is generally adopted. In this setup, if Y stands for the response variable and x for the dose variable, then it is assumed that

Y = ]l'x + e , (4.1)

where I1 stands for the vector of (regression) parameters, and the error component e satisfies the following:

(i) e ~ g # ( 0 , o-2), 0 < a 2 oe;

(ii) for different observations the errors are independent, and (iii) homoscedasticity of the errors at all levels of x.

In actual practice, in addition to the basic linearity of the model, a departure from the model based assumptions can take place in one or more of the above three clauses. The delicate role of normality etc., of the errors, and additivity of the model may therefore need a critical appraisal. In biometric studies, often, the response variable is a nonnegative random variable that has typically a highly skewed distribution; we have already commented on it in the previous two sec- tions. In such a case, often a logarithmic or power transformation, known as the Box-Cox transformation, is advocated so as to induce more affinity to a normal distribution for the errors; however, such a transformation may also affect the linearity of the model, and thereby call for some nonlinear models to have meaningful statistical interpretations. Similarly, the dose variables may also be subject to suitable transformations, termed dosage, so as to induce more linearity in the resulting model. Nevertheless, f rom robustness and validity perspectives, the classical normal theory linear model may not appear to be very appropriate in many biostatistical applications, and therefore alternative approaches have been developed to suit specific types of applications. Generalized linear models (GLM) explore the potentiality of linear models through appropriate transformations (known as the link functions) that facilitate the formulation of suitable estimating equations (EE) that may work out even without the basic normality assumption for the error component.

The genesis of G L M lies in the so called exponential family of densities. Let I11,..-, Yn be n independent random variables, and assume that Y/has the density function

f/(y, Oj, O) - c(y, Oj, 4) exp{(y0j - b(Oj))/a(dp)}, j = 1 , . . . , n , (4.2)

where a(.), b(.) and c(.) are functions of known forms, ~b is taken as a nuisance (scale) parameter, while the unknown parameters 0 / m a y depend on some con-

Page 10: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

256 P.K. Sen

comi tan t variates in a suitable pa ramet r ic form. Fo r example, the logistic re- gression model considered in the previous section can be character ized as a m e m b e r of this family, for which the logit 0j is expressed as [~Pxj for suitable dosage xj-; we shall e laborate this later on. I t is easy to verify that

EYj = #j(Oj) = b'(Oj), var(Yj) = a(c~)b"(Oj), j = 1 , . . . , n ; (4.3)

The last equat ion also enables us to write b"(Oj) = (~/~0j)#j(0j) = vj(#j(Oj)), for j = 1 , . . . , n which are known as the variance functions. For the par t icular case o f a no rma l density, b(O) = ½02, so that b"(O) = 1. As such, taking clue f rom the normal density, we can conceive of a lower dimensional pa rame te r (p-vector) p and a specification matrix X (of order n x p) such that

0 = ( 0 1 , . . . , 0 n ) ' = X p . ( 4 . 4 )

Based on such a formula t ion , we m a y also conceive of a t rans format ion , called the link function, 9(#j), J = 1 , . . . , n, such that

G, = ( g ( # l ) , . - - , g (# , ) ) ' = X[~ . (4.5)

Therefore letting g(#j) = x}p, we express

Oj ~- (g" # ) - 1 (X}~), j = 1 , . . . , n . (4.6)

In part icular , if #(.) is a mon tone function, we can choose 9(.) = #-1 (.), so that the 0j are themselves linear in p; this corresponds to the case of a canonical link function. In this special case, the est imating equat ions for the M L E of p reduces to

n Z { Y j - b '(x)lI)}xj = 0 , (4.7) j= l

so that the M L E can be expressed as

(4.8)

that resembles the classical linear model MLE. The EE m a y become more complex when we do not have a natura l link function. Generalized estimating equations (GEE) based on the weighted least squares me thodo logy are advocated, and the G E E can be expressed as

X ' D ~ l ( ~ ) r ~ ( p ) { ~ g = 0 , (4.9)

where

rn(P)=(YI-#I(P),...,Y,- #~(~))', D~(p) =diag(g'(#l(p))Vl(p),. . . ,g'(#n(p))v~(~)).

(4.10)

Page 11: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 257

Since these two expressions in (4.11) involve the unknown parameters, an inter- ative solution of the GEE may generally be needed. It is worth noting in this context that there is a dominant asymptotic flavor in the GEE (as (4.11) involves unknown weights), and extra regularity assumptions are usually needed to es- tablish the usual asymptotic properties of the estimators derived from the GEE. These regularity assumptions have been studied in detail in Section 7.4 of Sen and Singer (1993).

Important applications of GLM's in biostatistics include the Poisson regression model as well as the logistic regression model; we have already discussed the latter model in the preceding sections, and we briefly introduce the former model here. Consider the simple exponential family in (4.2), where Y/ has a Poisson distri- bution with mean 2~, so that 0i = log 2~, which may depend on some design and concomitant variates in a functional way. For example, we may consider either of the following two models:

O i = l o g , ~ i = ~ + [ l l e i , i = 1 , . . . , n , (4.11)

2i = exp Oi = ~ + Pl ei, i = 1 , . . . , n ,

subject to the constraints that all the 2i are nonnegative. In the first case, we have a canonical link function, while in the later case, we have 9(z) = z. The former model corresponds to the usual log-linear model for Poisson counts, while the other case corresponds to a linear regression model. Such Poisson models are now used in various spatial models arising in environmental and epidemiologic studies (such as the e-mapping for pollution or disease-incidence mapping, where the ei refer to demograpic or other variates. It is easy to verify that the usual regularity conditions needed for the asymptotics hold here.

Wedderburn (1974) considered a quasi-score or quasi- l ike l ihood es t imat ing equation (QLEE) approach that works out under less specific regularity as- sumptions. Suppose that EY~ = #i with the #i depending on Ii as in the case of the simple GLM. Further, assume that Var(Yi)= a2V~(#i) for some (possibly un- known) scalar a2(>0), where as before V/(.) is a completely known variance function. However, apart from these first and second moment conditions, no specific distributional assumption is made for the Y/. Then, we are not in a po- sition to incorporate the likelihood function in the formulation of the usual EE's on which the GLM methodology rests. Nevertheless, in the spirit of the weighted least squares methodology, we may consider the EE:

~ d # i -1 i=1 d-~(V/(#i)) (Y~ - # i ) = 0 , (4.12)

and inserting the assumed relationship between #i and II we obtain some GEE for solving for II. Here also, the dependence of the Vii(.) on 11 (through the #i) may make it necessary to use an iterative procedure for the solution.

Liang and Zeger (1986) formulated another extended G LM approach which allows the variance functions Vii, though of known functional forms, to have

Page 12: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

258 P. K. Sen

some unknown parameters. Such a situation is more likely to arise in a multi- variate situation where the Vi are matrices. They suggested that the variance matrix be decomposed into a correlation matrix and a diagonal matrix of com- pletely known variances. Then under suitable correlation or dependence patterns, such as the intra-class correlation or the autoregressive model, the correlation matrix can be estimated in an iterative manner (in conjuction with the GEE for the iterative solution of II), and that provides an extension of the GEE meth- odology to a more complex setup. It is no surprise that asymptotic distributional problems are even more complex for such extended models, and in addition, their adaptability in small to moderate sample sizes may often be questionable. Moreover, f rom robustness perspectives there is even a greater concern for such extended methodology, as here in addition to the appropriateness of the chosen link function(s), plausible departures from the assumed correlation pattern may also affect the validity and efficiency of statistical procedures based on such GEE's .

Such Poisson regression models paved the way for the so-called Cox (1972) proportional hazards model (PHM). In a general setup, for a nonnegative random variable Y having an absolutely continuous distribution function F with a density function f(x), we define the survival function S(x) = 1 -F(x) , and the hazard function h(x) as

h(x) = - ( d / d x ) l o g S(x) = f(x)/S(x), x > 0 . (4.13)

I f F is an exponential distribution with mean 0(>0), then h(x) = (0) -1, for all x > 0, so that we have a constant hazard or failure rate. There are other families of distributions for which h(x) is not a constant (for all x), and in that context, the increasing failure rate (IFR) and decreasing failure rate (DFR) family are par- ticularly important. The Weibull distribution for which S(x) = exp{-px~}, x _> 0 belongs to the I F R or D F R class according as the shape parameter y is greater than 1 or lies in (0, 1). Similarly, a gamma distribution with scale parameter 0 and shape parameter ~, both nonnegative, belongs to the I F R or D F R class according as ~ is less or greater than 1; for both the Weibull and gamma distributions, 7 = 1 relates to the simple exponential model. Let us now consider two such distribu- tions, say F and G, and denote the corresponding hazard functions by hF(X) and hG(x). I f both F and G are exponential then obviously hF(x)/ha(x) = constant, for all x, so that the two hazards are proportional to each other. This feature may not generally hold if F, G are not exponential, even when they belong to a common I F R or D F R family. On the other hand, in most biostatistical applications (specially in survival analysis), it may not be very reasonable to assume a constant hazard function. Moreover, there are usually some concomitant variables that may influence the hazard function. Motivated by this feature, Lehmann (1953) considered a model where the two hazard functions hF and hG, though not nec- essarily constant, are proport ional to each other (albeit his formulation was somewhat different and in a different context too). Led by this simple formula- tion, Cox (1972) considered a general conditional setup and established the basic

Page 13: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 259

concept of the PHM in a very innovative manner. Let us consider a model where the primary variate Y is accompanied by a concomitant variate (vector) Z that may also contain the design variates (as dummy variables). Then conditional on Z = z, the hazard function of Y at y is denoted by h(ylz ). We denote the baseline level for the concomitant z by 0, and also let h(y]0) = ho(y). Cox (1972) allowed the baseline hazard function h0 (y) to be quite arbitrary (nonnegative) and assume that

h(ylz)/ho(y) = 9(z), Vy, z , (4.14)

where 9(.) is nonnegative and of a parametric form. Specifically, he let

9(z) = exp{p'z} , (4.15)

where II stands for the (hazard) regression parameter on the concomitant variates. In particular, if we let z to be binary (i.e., 0 or 1, according as the subject belongs to the placebo or treatment group), we have the Lehmann model described ear- lier. This specific choice of 9(.) allows it to be nonnegative and also leads to the following log-hazard regression model:

log h(ylz ) = log ho(y) + II'z , (4.16)

and this brings the relevance of GLM in a broad sense. In this sense, it may also be tempting to prescribe this PHM for indirect quantitative bioassays described in Section 2. If we denote the two hazard functions for the test and standard preparation and corresponding to a given dosage x by hr(ylx) and hs(y[x) respectively, we let

hr(ylx) ----h0(y)exp{~r + II~x}, hs(ylx) =h0(y)exp{~s + II)x} , (4.17)

where we may put the homogeneity constants on the parameters lit, [~s or c~r, es depending on the parallel-line or slope-ratio assay model. This G LM approach (Sen, 1996b, 1997) allows the relative potency to be interpreted in terms of the parameters in the two log-hazard regressions, though the nice interpretation we had in the dilution assay model (based on the location-scale family of distributions) may no longer be tenable under this PHM (as the lo- cation-scale model may not amend readily to log-hazard linear regression models). From this point of view, for bioequivalence models the adoption of such a P HM may be more appropriate than in the classical bioassay models. The statistical analysis of such PHM based bioassay models may no longer be as simple as in the conventional case treated in Section 2. Instead of the likelihood function conventionally adopted in drawing statistical conclusions, here we have to go for some partial likelihood function formulations. These may require in general a martingale approach that rests on a relatively more sophisticated counting processes methodology. We will review this in greater detail in a later section.

Page 14: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

260 P. K. Sen

Though the primary emphasis in dose-response regression models has been the incorporation of suitable transformations and link functions that renders a linear model, there are situations where it may be quite difficult to have a reduced linear model or a G L M in a broad sense. In a general nonlinear model, we conceive of a response variable Y and a set of related (dose or concomitant) variates x, and consider a stochastic model

Y = g ( x ) + e , xESF , (4.18)

where the error component follows a given distribution F (that generally involves some unknown parameters), and the form of the regression function g(.) is as- sumed to be given (though possibly nonlinear), and it also involves some un- known parameters (which appear as algebraic constants in its functional form). The predominant parametric flavor of such a typical nonlinear model is clearly perceptible. A semiparametric formulation, along the lines of the PHM, is con- ceivable in either of the two-ways: (i) retain the parametric flavor of the regression function g(.) but allow the distribution F to be rather arbitrary, and (ii) allow the distribution F to be of a given parametric type, while letting g(.) to be of non- parametric form (i.e., quite arbitrary). If we allow both g(.) and F to be non- parametric, we have a genuinely nonparametric regression model. Let us illustrate this situation with a bioassay model similar to the ones treated earlier. For a given dosage (and design) variate (vector) x, we denote the distribution function of the test and standard preparation response variable Yr, ITs by Fr(ylx) and F s ( y l x )

respectively. We also consider the corresponding regression functions gr(x) and gs(x) and express

FT(ylx) - F(y - gr(x)), Fs(ylx) = F ( y - gs(x)), y _> O, x c ~ ,

(4.19)

where the distribution F may have an assumed parametric form (such as the logistic, normal, double exponential distribution), while the two regression functions satisfy the same fundamental regularity condition of a parallel-line or slope-ratio assay but otherwise need not be linear. For example, in a parallel-line assay setup, we may let

g r ( x ) - g s ( x ) = c ¢ °, Vx C :T , (4.20)

though neither one is deemed to be a linear regression function. Taking clue from this two-sample model, it is possible to conceive of a more

general regression model involving some design variables ei and other (possibly stochastic) concomitant variables Xi along with the primary response variables Y,-, for i = 1 , . . . , n, and consider the following model:

Y / = g l ( c i ) + g 2 ( x i ) + e i , i = 1 , . . . , n , (4.21)

where the errors e~ can be assumed to be independent and identically distributed with a distribution F, while much more flexibility can be introduced with respect to the two regression functions g~(.) and g2(.). For example, with respect to the

Page 15: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-pararnetrics in bioenvironmental and public health statistics 261

nonstochastic el, it may be quite reasonable (following appropriate transfor- mations, if necessary) to assume that 91(el) is of a linear parametric form in- volving a finite dimensional (unknown) parameter. But, with respect to the stochastic concomitant variates Xi, sans appropriate multinormal laws, it might not be very reasonable to assume a linear regression pattern, homoscedasticity and other conventional regularity conditions that underlie the usual linear models. Often, more complex nonlinear models are therefore advocated for gz(x) as well as the errors ei. I f the ei can be regarded as i.i.d.r.v.'s with a finite variance, then it may be quite reasonable to consider the total sum of squares due to errors, namely,

t /

Z { Y / - 91 (el) - 92(xi)} 2 , (4.22) i-1

based on some assumed parametric forms for 91 (.) and g2(.), and to minimize this with respect to the unknown parameters that appear in the expressions for gl (-), g2(-). This simple least squares estimation (LSE) methodology, in a general nonlinear model, yields suitable estimating equations for which solutions may not always have closed algebraic expressions, and thereby may require iterative procedures. Moreover, the rationality of this LSE methodology may not be to- tally tenable if the errors are not i.i.d.; of course, it is possible to adopt here the quasi-likelihood principle that has been presented earlier, and to obtain relatively more efficient estimates and test statistics that allows for some relaxation of the i.i.d, clause for the errors. However, the formulation of the variance function may be a delicate staff, and may also involve additional nuisance parameters. In either way, these statistical inference procedures may lack robustness against plausible departures from the model based assumptions, and are thereby often judged unsuitable for adoption in specific biostatistical applications. Kim and Sen (2000) have considered some robust statistical procedures in bioassays that allow for some arbitrariness in the functions 9r(-), 9s(.) in the case where the dose levels are themselves stochastic. They incorporated suitable conditional quantile processes in the formulation of robust estimators and test statistics as well. Nevertheless, that may generally entail slower rates of convergence (similar to the smoothing methods in statistical inference).

It may be intuitively more appealing to consider a semiparametric G L M in such a mixed model statistical analysis; we can consider a suitable link function that leads to a linear parametric form for 91 (-), while we may consider a non- parametric form for the concomitant function 92(.). We shall discuss some of these later in connection with A N O C O V A models with mixed effects (Sen, 1996a).

The G L M have also found their utility in case-control studies and in some other related areas. We shall briefly discuss this area in a later section. Also, we shall provide a treatise of generalized additive models in a later section; there is some need to introduce the nonparametric regression models, and we shall con- sider them in that order.

Page 16: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

262 P. K. Sen

5. Correlated polychotomous response data models

First, we consider the case of multiple dichotomous attributes. This typically arises when there are multiple characteristics with each of which there is a binary response variate that signifies a positive (yes) or negative (no) response, and these binary outcome variables are generally not statistically independent. The p (>2) dichotomous attributes can be represented by a vector j = (J'l, . . . ,jp)r, where each ji can be either 0 or 1, for i - - 1 , . . . , p . Note that j can take on 2 e possible realizations, and we denote this set by J . Consider next a random (/)-)vector X = ( X 1 , . . . ,Xp) ' , such that

7 ; ( j ) = P { X = j } , j E J . (5.1)

Therefore, the probability law is defined on a 2P-simplex:

7c(j)>_O, V j e J , ~ 7 ; ( j ) = l . (5.2) j c j

In this way, we have a general probability model involving 2 p - 1 unknown parameters; with increasing p, the dimension (2p - 1) of the parameter space becomes very large and that creates some problems with the adoption of standard statistical analysis tools. For example, when p = 4, we have 15 unknown pa- rameters, and in order that each of the 16 possible realizations (j E J ) has ade- quately large cell count, we need to have a much larger sample size compared to the case of p = 1. Moreover, with so many parameters, we may not have an estimator that is uniformly better than others, or a test that is uniformly most powerful for all alternatives (or even some subclass of the same). Further, our primary intertest may be confined to a suitably chosen subset of parameters, and in that case, we may have a better prospect for drawing statistical inference. For these reasons, often a reparametrization is advocated, and this is incorporated in the reduction of the high-dimensionality. Bahadur (1961) considered an elegant reparametrization that we find it very useful in this context. The roots of the Bahadur representation lie in the earlier work of M. S. Bartlett and S. N. Roy on interpretations of higher order interactions in high dimensional tables; a detailed account may be found in Roy (1957). In the sequel, we refer it as the Bahadur- Roy reparametrization.

First, we consider the p marginal parameters

7; ) = P { X j = i } , i = 0 , 1 ; j = l , . . . , p . (5.3)

Note that rc~il~ ),_ + 7;~])~_ = 1, for all j ( = 1, . . . ,p), and hence, there are only p un- known quantities among these parameters. We denote by

, (~(1) ~(p) , r (5.4) = \ , (0 ) , ' " , ,(o))

Next, for every l : 2 < 1 < p , and 1 _< j l " ' " < jl _<P, we define an lth order as- sociation parameter 0/~,..,/~ in the usual way; note that there are (P) such associ-

Page 17: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 2 6 3

ation parameters of the /th order. We denote the set of all such association parameters by

0 = {Oj~,. . . , j ,: 1 < j l < " " < jl; 2 < l < p } . (5 .5)

Thus, the total number of association parameters (i.e., the cardinality of the set O) is equal to ~P=2 (P) = 2p - p - 1. This leads us to the reparametrization:

{~ ( j ) : j • J } -~ {~* ,0 } , (5.6)

connected by the relatior_ship (Sen, 1995b)

p 1X ~(i) +

= 1 1 ,u,I i-1

2 p

,(i,) 1 _<il <i2 _<p r = 1 s = 1 ,Toil ,i2

3 p

(_ 1)Ji, +J,~+J,~ Oi, i2i 3 E l ?-c (it) 1- I ?c(s) 11 ,(o) *(L) + Z 1 <i l <i2 <i3 <P r = 1 s = ] ,76il ,i2,i3

P ( 1 ~jl+'"+jp Ll _(r) + . . . + , - - , ul...pI ] Vj j (5.7) Jc,(0),

r = l

For the case o f p = 2, we have two marginal probabili ty parameters and a single two-factor association parameter (2p - 1 = 3), and hence, there is no reduction in the dimension of the parameter space. On the other hand, whenever p _> 3, it may be reasonable to assume that only two-factor association parameters capture the association patterns among the p attributes, while higher order association pa- rameters can be taken as null; in this way, we would have p + (~) = (p~l) un- known parameters, and for this reduced parameter space more precise statistical conclusions can be drawn. Actually, when p is large, we may even include three or higher order association parameters and still have a reasonable degree of reduc- tion of the parameters space. Of course, the gain in the efficacy of statistical analysis based on such a reduced parameter space is contingent on the assumption that the neglected association parameters do not really contribute any significant statistical information (as regards the parameters retained in the reduced model), and a parametric orthogonality condition that is obtained as a by-product of the Bahadur -Roy representation generally conform to this expectation to a certain extent. In principle, this is quite comparable to (partial or total) confounding in factorial experiments: whereas confounding is mainly achieved by proper de- signing the experiment, the reduction of the parameter space is achieved through sacrifice of information on higher-order association measures.

Let us now illustrate the utility of the B a h a d u ~ R o y (Bahadur, 1961) repa- rametrization of multiple dichotomous response models in quantal bioassay models or in some other multiple end-point survival analysis models. For the

r 0) ma ginal probabilities lr i.~, introduced in (5.4), we can conceive of suitable logit •klj] or normit models; in the case of the logit model, we may therefore take

log-r~z (j) /~z (j) ~ = II~x, j = 1 ,p (5.8) t ,(1)/ ,(0)J , " " ,

Page 18: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

264 P. K. Sen

where x stands for suitable dosage levels (along with other concomitant variates), and IIj stands for unknown regression parameters that characterize the dose- response regression model. Suppose now that there are two preparations, say, a test (iv) and a standard (S), and for each one the response is multiple dichoto- mous. We can then formulate the marginal logits for each preparation as in above, and denote the corresponding parameters (vectors) by

Psi and IIrj , j = 1 , . . . , p . (5.9)

In addition, we can bring the association parameters of various orders, namely, Os and Or, defined as in the Bahadur reparametrization in (5.5); here also, we may neglect the higher-order association parameters, and only consider the lower- order ones. This way, we reduce the number of association parameters in the model. We may also assume the homogeneity of the association parameter vec- tors, namely that Os = Or = O, say. Moreover depending on the nature of the dosage and response metameters, we may (as in the parallel-line or slope-ratio assays) set suitable restraints on the Psi and [Irj that reflect the relative potency or bioequivalence properties in an interpretable manner.

In the context of the logit model we have considered the sample logits based on the observed proportions. In conformity with the notations made above, let us

sample counterpart of ~,~1)'" by p,U), and let q,~) = 1 -p,0), j = 1 , . . . ,p. denote the Then the marginal logits are defined as

log p.(/) - log q~) = Z U), j = 1 , . . . ,p . (5.10)

Also, we assume that corresponding to a dosage level xi there are ni subjects, and the responses refer to the proportions among them; we denote the corresponding Z0) by Z0)(xi), for j = 1 , . . . ,p; i = 1 , . . . ,k. Denoting the population counter- parts of the Z 0) (xi) by ((Y)(xi), we obtain by some standard manipulations that for large ni,

v ~ ( Z i - ¢ i ) ~ JVp(0,Fi) , (5.11)

where Z~ and ~i refer to the p-vector of the individual sample and population logits for the p coordinates, and the dispersion matrix Fi can be estimated by

(.U),,U)~-I V i = ( ( V j l , i ) ) , where vjj,i = ~,t'*i ~1,i ] , j = 1 , . . . ,p, and for j ¢ l = 1, . . . ,p,

1 1 pO, i'l) (Y, S) Vjl,i = Z ~-~ (-1)r+s (5.12)

(1.) r (1) ' r=0 ,=0 p,i ( )p,; (s)

where p~) (1) = 1 - p~)(0) = p~), j = 1 , . . . ,p; the joint proportion for the (r, s)th combination of the response of the j th and /th coordinates is denoted by p~1)(r,s) , j ¢ l = 1 , . . . ,p; r , s = 0, 1. Note that f o r j = l, these joint proportions reduces to the marginal ones when r = s, and 0, otherwise. This way, vjj,i can be obtained from the general expression for vjl,~ by letting j = l. Further, note that the ~i can be expressed in a linear form as

Page 19: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 265

~i = [ixi, i = 1 , . . . , k , (5.13)

where [i has the rows [i}j), j = 1 , . . . , p . Finally, for different i, the Zi are stochatically independent. Hence, using the general multivariate WLSE meth- odology, we may consider the quadratic norm:

k

O([i) = ~ ni( Zi - [ixi)'V~-l(Zi- [ix/) ; (5.14) i--1

we minimize this with respect to [i and obtain a set of estimating equation for the general multivariate logit analysis in a GLM setup.

In the context of a bioassay model, for each preparation (S or T), we will have a quadratic norm as in (5.14). We add them up to obtain a pooled quadratic norm. Moreover, we put whatever restrains are reasonable for the two matrices [is, [it, (depending on the design of the bioassay), and then incorporate the WLSE to obtain the appropriate estimating equations. Therefore the quasi-like- lihood methodology related to the GLM can again be adopted here to carry out the desired statistical analysis. The large sample flavor of this generalized logit analysis is quite apparent from the above; the degree of largeness will depend on p, the number of end-points, as well as the extent of their interdependence. Al- though the case of the multivariate probit analysis can be presented in a similar manner, the expressions for the associated asymptotic dispersion matrices (Fi) would be more complex, and as a result, the estimators Vi will also be more complex in nature. These in turn may require even relatively larger sample sizes to justify the asymptotics in actual practice.

We conclude this section with a discussion of the polychotomous response models covering both univariate and multivariate responses. These models are generalizations of the binary response logit models, and they allow for the esti- mation of unordered, polychotomous responses using either continuous or cat- egorical explanatory variables. Basically, we allow for J(>_2) categories for a response variable, so that there are J - 1 nonredundant logit equations; to accomplish this, we choose one of the categories as a baseline, and for each of these J - 1 nonredundant logit equations, we use a separate regression (Agresti, 1990). If we have a multiresponse model involving p traits (as has been discussed earlier), we could allow even the number of categories (i.e., J ) to be possibly different from one response to another. For the sake of simplicity of presentation, we consider first the uniresponse polychotomous logit model, and then append briefly the multiresponse case (along the lines of the binary model treated earlier).

For the j th category, and corresponding to a set xi of explanatory variables, let rcj(xi) be the probability of response, for j = 1 , . . . ,J; i = 1 , . . . ,I. Note that we may choose without any loss of generality, J as the baseline. Then, as in the binary case, by contrasting with the baseline, we define the logits by

log{Tcj(xi)/rCs(Xi)} = ~ji, say j - 1 , . . . , J - 1 . (5.15)

Next, for each j, we conceive of a possibly different regression equation, and set

Page 20: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

266 P.K. Sen

~ji j = l , . , J - l ; i~- 1, [ (5.16) ~ j X i , . . . . . , ,

the pj are unknown regression vectors. We can therefore express the where probabilities equivalently as

exp{lljx/} z , . . . . , ~zj(x/) ~lj__1 exp{p'lxi} j = 1,. ,J , i = 1,. , I (5.17)

where conventionally, we let pj = 0. With this modeling, we may virtually extend the same WLSE methodology

employed in the binary case; the only extra computation here is the estimate of the asymptotic covariance matrix of the sample counterpart of the vector ~ i = ( ~ 1 i , " • " , ~Ji) t, which is denoted by Zi. This is easily seen to be Vi that can be expressed as

( p j ( x i ) ) - l l l ' + d i a g ( ( p l ( X i ) ) 1 . . . , ( p j _ l ( X i ) ) 1) . (5.18)

Finally, for different levels i(= 1 , . . . , I), the Zi are independent. As such, we may consider the quadratic norm

I Z n i (Z i -- ~ i ) ' ( V i ) - i ( z i - ~i) , ( 5 . 1 9 ) i=1

which involves the unknown pj, j = 1 , . . . , J. Therefore, we are to minimize this quadratic norm with respect to these unknown parameters, leading to the esti- mating equations that yield the estimators of the unknown regression parameters for the polychotomous logistic model.

Let us now consider the multivariate polychotomous response data model. A generalization of the (Roy-) Bahadur (1961) representation from the binary to the polychotomous responses, though possible, would lose all its charm and may be quite complex. Rather, we adopt the spirit of identifying the interaction pa- rameters of various orders and for various combinations of the levels (j = 1, . . . , J - 1). If for each of the p responses, we define the ~ik, k = 1, . . . ,p and their sample counterparts as in above, we would have a compound covari- ance matrix for this entire vector whose diagonal matrices can be estimated by the Vik that are defined as in above but for the kth response variable, while the off- diagonal matrices have the elements as follows:

kq,i kq,i ~ (5.20) pU)p(l) pU)p(J) p(J)p(') ~(-£~(~(Y) '

ki qi ki qi Ici qi Pki Pqi

for j , l = 1 , . . . , J - 1; k ¢ q = 1 , . . . , p and i = 1 , . . . , I ; the p~) refer to the j th level, kth variate cell, for the dose level xi, while the P~'I refer to the corre- q, sponding joint cell proportions at the dose level xi. Again the WLSE methodol- ogy can be adopted to estimate the pJ~) efficiently, and to construct suitable tests statistics.

Page 21: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 267

6. Multivariate models in biostatistics

There is an abundance of multivariate models in biostatistical problems. Some of these are already discussed under biological assays and dosimetric (animal) studies, and generalized linear models, and some others will be discussed in the context of longitudinal data analysis, as well as, clinical trials. In the present section, we deal with some of the classical multivariate models, and discuss the salient features of nonparametrics and semiparametrics in these perspectives.

In comparing a treatment group with a control group when there are multiple response variables, we typically encounter a multivariate two-sample model. For the treatment group, let Xi, i = 1 , . . . ,m be m independent and identically dis- tributed random vectors (i.i.d.r.v.) with a distribution function F(x), x C R p, for somep >_ 1. Similarly, for the control group, let Yz, i = 1 , . . . , n be i.i.d.r.v.'s with a distribution function G(x), defined on R p. In a conventional parametric approach, it is generally assumed that

f ~-~ ,~f 'p(01, ]~1), G ~ J ~ p ( 0 2 , ~22) , ( 6 . 1 )

so that all the relevant statistical information is contained in the two mean vectors and the two dispersion matrices that are associated with the normal populations. In this setup, the two sample mean vectors (denoted by X, Y) and the dispersion matrices S1, $2 are jointly sufficient for the parameters. Therefore, for drawing statistical conclusions on the parameters, it may be convenient to construct suitable statistics that are functions of X, Y, $1, $2. The classical Hotelling T 2- test for the equality of the two mean vectors or the analysis of dispersion tests, studied extensively in the literature, are all functions of these (joint) sufficient statistics. Much of their operational simplicity and theoretical justifications would be lost when the underlying distributions are not multinormal. In many biometric studies, the response variables may be typically nonnegative, and even marginally they may have highly skewed distributions, so that assuming that their distri- butions are (multi-)normal may not be very prudent. Using the coordinatewise Box-Cox type transformations (that may not be isomorphic for all coordinates) may sometimes induce greater degree of symmetry in their marginal laws, though there is no guarantee that the joint distribution of the transformed variables would be actually (or even closely) multinormal. Thus, such parametric proce- dures may be quite vulnerable to plausible model departures, and the extent of this nonrobustness may be quite extensive compared to univariate situations. Moreover, because of some characteristic properties of multivariate normal dis- tributions, often, canonical reduction of the variates as well as the parameter space is advocated; from theoretical perspectives, such affine transformations are often used to simplify the theoretical results in a compact form. Yet in many biometric studies, the different responses may be recorded in different units of measurements, and there may not be enough rationality for such affine trans- formations. It may therefore be argued that invariance with respect to strictly monotone (not necessarily linear) transformations for each coordinate variable

Page 22: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

268 P. K. Sen

may be more prudent. Multivariate nonparametrics and semiparametrics fare better in this perspective. Coordinatewise ranks can be adapted for various other rank tests, and in that way, they also allow for arbitrary transformations from one coordinate to others. This flexibility, on the other hand, precludes affine- invariance, so that the canonical reduction usually employed for the study of distribution theory and optimality properties of linear statistical inference tools in the normal case may no longer be tenable for the general nonparametric case. Moreover, in the univariate case, such nonparametric procedures are genuinely distribution-free, but in a bonafide multivariate situation, they are permutation- ally (or conditionally) distribution-free; we refer to Chatterjee and Sen (1964, 1966) for the basic rank-permutation principle that renders this permutational distribution-freeness of coordinatewise rank based (such as the median and the Wilcoxon-Mann-Whitney type) tests; tests based on more general scores are reported in Puri and Sen (1971). We shall discuss here only the multivariate rank- sum and median tests for their simplicity in biostatistical applications.

We denote the coordinate elements of Xi (and Y/) by Xy ), (and Y/(J)), for j = 1, . . . ,p. Then, for t h e / t h coordinate, we have a set o f N ( = m + n) observa- tions, and we denote by R~} ) (and R/~ )) the rank o f X y ) (and Y/(J)) within this set, for i = 1 , . . . , m(n), and j = 1, . . . ,p. Recall that for each j , the elements (ranks) within that row are the numbers 1 , . . . , N permuted in some (random) order, so that the average rank within each row is equal to (N + 1)/2. We denote coor- dinatewise sample average ranks by

= - Xj--, R!!); /~¢2) /~J.!) 1 m 1 " mi__ ~ zj . ~ , . (6.2)

We also consider the rank-covariance matrix VN = ((VNjl)) with elements

( VNjl- (S 1) R/()) 2 2 ~ il

i=1 2 2 '

for j , l = 1 , . . . , p . (6.3)

Then the multivariate version of the rank-sum test statistic for the two-sample problem can be expressed as

~ N W = ml'l (R(1) _ R(2))tVN1 (R(1) _ R(2)) (6.4) N "

where R(k) = (/~Ik),... ,/~p(k)),, k = 1,2. Under the null hypothesis H0: F = G, the N vectors of the two sample observations are independent and identically dis- tributed, so that all possible N! permutations of themselves are conditionally equally likely. This generates the rank-permutation invariance structure, and a conditionally (permutationally) distribution-free test can be based on this law. For large values of m, n, under/4o, ~ N W has closely central chi square distribution

Page 23: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 2 6 9

with p degrees of freedom (DF). Also, for local alternatives, it has asymptotically noncentral chi square distribution with p DF and noncentrality parameter that depends on the chosen alternative.

Let us next consider the multivariate two-sample median test. We consider an integer a = IN/2], and define the scores

1 n 1 m T(2) _~-'~[(]?(2) .. . (6.5) ~Nj T(1) : - - Z / ( R } 1) _<a), ~Nj = nZ...~ ,--ij _<a), j = 1,. ,p m i = 1 i = l

Also, in the combined sample of size N, if we consider a 2 x 2 table for the ( j , / ) th coordinates, by counting the number of pairs, Ng that lie in the cell where both the ranks are < a, we may define the matrix VN with the elements

Njl ( a ) 2 v N j z - - N 2 ~ , j , l = l , . . . , p ; (6.6)

note that the diagonal elements are (nonstochastic) and all equal to a ( N - a ) / N 2, but the off-diagonal ones are stochastic. We let m = nl, n = n2 and T (k) __ {T (k) T(k)'ff --N - - ~ N 1 , ' ' ' , ' N p J, k = 1,2. Then the multivariate two-sample median test statistic can be written as

2

- - _ a 1 . S~M E n k [ ( T ~ ) a l ' ~ ' V - ' ( T ~ ) ~ )] (6.7) N ) N k = l

Here also, for small values of m, n, the permutation distribution of ~ N M can be generated by the N! conditionally equally likely permutations of the N stochatic vectors in the pooled sample; this generates the conditionally (permutationally) distribution-free test for the null hypothesis of homogeneity of the two distri- butions. For large sample sizes, this conditional null distribution can be ap- proximated well (in probability) by the central chi square distribution with p DF. Further, for local alternatives, noncentral chi square distributional approximat- ions are tenable.

Both the rank-sum and median tests extend to the general case of c(>2) samples (of sizes n l , . . . , n c ) drawn from continuous multivariate distributions (F1 , . . . ,Fc ) . Defining the average coordinatewise ranks RJ. k), k = l , . . . , c ; j = 1 , . . . ,p as in the two sample case, and also the rank-covariance matrix VN as in there, we may write the multi-sample multivariate rank sum test statistic (which is a direct multivariate generalization of the Kruskal-Wallis (1952) test statistic) as

[( ) ( )] ~'~NKW = nk 1/(~) N + 1 1 VN 1 R(k) N + 1 1 (6.8) k=l 2 2 "

Similarly, defining mJ k) = ~irk_l I(R}~ ) _< a), j = 1 , . . . ,p; k = 1 , . . . , c and the covariance matrix VN as in the two-sample multivariate median test, we have the following multi-sample multivariate median test statistic:

Page 24: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

270 P. K. Sen

~*aNMM = Z nk X--' N-" v(J0/ ' "J ml (6.9) Z.~ Z.~ N t n - - , /c=l j=l l=1 \ k \ nk

where ((VN~'))) = VN*. For both the test statistics, for small values of n l , . . . , nc , the permutational

(conditional) distribution can be generated by the N! conditionally equally likely permutations of the N pooled sample observations (vectors) among themselves, and this way, one can generate conditionally distribution-free tests for the ho- mogeneity of the ¢ distributions F1, . . . ,Fc. For large values of the sample sizes, their null hypothesis distributions can be approximated by central chi square distribution with p ( c - 1) DF. Noncentral chi square distributional approxima- tions remain tenable for local alternatives.

In terms of robustness properties, the median tests generally perform better than the rank-sum tests. However, in terms of (asymptotic) efficiency properties, particularly for nearly normal parent distributions, the rank-sum tests perform better than the median tests. We refer to Puri and Sen (1971, ch. 5) for detailed discussion of the asymptotic properties of these tests along with other multivar- iate nonparametric tests (that are generally computationally more cumbersome).

In a parametric setup, paired-sample tests based on the student t-statistics or their generalizations are usually advocated. Like the case of two or more inde- pendent samples, such parametric tests are vulnerable to plausible model de- partures, and therefore are not so robust. In the univariate paired sample case, the classical sign-test and the Wilcoxon signed-rank tests are the nonparametric an- alogues of the student paired t-test. In a similar manner, there are multivariate sign-tests and signed-rank tests that are the natural analogues of the classical Hotelling T2-test, and they are more robust. For the particular bivariate case (i.e., p = 2), the Sign test due to Chatterjee (1966) deserves special mention. Suppose that we have n bivariate o b s e r v a t i o n s (Y/(1),y/(2)), i = 1 , . . . ,n with a bivariate continuous distribution F ( x l , x2) , defined on R 2. Suppose that we want to test the null hypothesis that both the marginal distributions for F have the median zero (any other specified value can be reduced to the null one by simple translation). Let

n d = ZI(sign(X/(1)) = (-1)J, sign(X,. (2)) = (-1)I) , j , l = 1,2 . i - I

(6.10)

Note that nl. = n~l + n12 and n.l = n H + n21 refer to the number of X/0) and 0(i (2) that are negative. Further, n c = n l l + n 2 2 is the number of concordant pairs among the n observations, while nD = n12 + n21 = n - n c is the number of dis- cordant pairs. Chatterjee (1966) considered a conditional test, where given n c

(and nz)), he proposed the test statistic

( r t l l - - n22 ) 2 (n12 - - /721) 2 Y;vc -- + , (6.11)

n C tl D

Page 25: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-pararnetrics in bioenvironmental and public health statistics 271

and showed that under the null hypothesis, its conditional distribution can be readily obtained from the conditional law

k = O , . . . , n c , q = O , . . . , n 9 . (6.12)

He also showed that this conditional null distribution of ~NC, given nc, can be approximated well (in probability) by the central chi square distribution with 2 DF, when n is large. Motivated by this test, we may consider the general multi- variate case where Xi = (xz~l ) , . . . ,X/(p))t, i = 1 , . . . , n, and we want to test the null hypothesis that each of the p marginal distributions has the null median. We define

n + =~-~I(signX/(Y)= 1), j = 1 , . . . , p ; /=1

/7

njl = E I ( X / ( j ) > 0,X/(1) > 0), j ,Z = 1, . . . ,p; (6.13) i=1

.jt = n= ) ' j ' 1 = 1 , . . . ,p; v . =

Further, we define T, = (Tnl,.. . ,T,p)' with T,j = (n + - n / 2 ) / n , j = 1 , . . . , p . Then a natural analogue of the bivariate sign-test statistic in the general multi- variate case is the following:

~PnS = nTCnVn 1Tn . (6.14)

Here also, the conditional null distribution can be generated by the 2 n condi- tionally equally likely sign-inversions of the n observation vectors, while for large values of n, this conditional law can be well approximated, in probability, by the central chi square distribution with p DF. Though such (bi- and) multivariate sign tests are conditionally distribution-free and are robust, they may not be in general fully efficient (even asymptotically) for near (multi)normal F. For this reason, we shall consider next the multivariate signed-rank tests (Sen and Puri, 1967) that generally combine the robustness and conditionally distribution-freeness with good asymptotic efficiency properties.

A basic assumption in this context is the following. The distribution F is diagonally symmetric about its median vector. This means that under the null hypothesis that all the p marginal medians are null, X and ( -1 )X both have the same distribution (say F0), which is diagonally symmetric about 0. We therefore write

F ( x ) = F 0 ( x - 0 ) , V x E R p , (6.15)

where 0 - (01, . . . , 0p) t stands for the vector of marginal medians, and we frame the null hypothesis as H0 : 0 = 0 against the set of alternatives that 0 ¢ 0. Let us consider the vector W, of coordinatewise signed rank statistics:

Page 26: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

2 7 2 P.K. Sen

//

(J) (6.16) Wn(J) = n - 1 ~-~RniS i j , J = 1 , . . . , p ,

i=1

where Sij = sign(Xi(J)), and R ~ ) = ~ = l I ( [ X r ( j ) [ <_ [X/(J)I), for i = 1 , . . . , n ; j = 1 , . . . ,p. We also consider a p x p (stochastic) matrix V, whose elements are given by

n

Vnjl = n - 1 ~ " ~ D ( / ' ) D ( I ) o O . (6.17) ~...¢ ltni ltni ~oijOil , j , l = 1 , . . . , p i=1

Then the multivariate signed-rank test statistic can be expressed as

5('~v/= nW'nV21W,~ ; (6.18)

(Sen and Puri, 1967). Here also, the exact (conditional) distribution of £a~w can be obtained by enumerating all possible 2 n conditionally equally likely sign- inversions of the Xi, i = 1 , . . . ,n. Further, for large sample sizes, this null dis- tribution can be approximated in probability by the central chi square distribu- tion with p DF; for local alternatives, noncentral chi-square distributional approximations hold.

There are several important applications of the multivariate sign and signed- rank tests in biostatistical studies. One of the most important ones relates to the so called multivariate paired sample problems. For example, the status of a health disorder or disease may be assessed by means o f p response variables, so that for each observation, we have a set of 2/) responses, p before a treatment is initiated, and p after the course of the treatment is completed. Let us denote these by Xi and Yi, respectively, for i = 1 , . . . , n. Note that (Xi, Yi) has a joint distribution defined on R 2p. If the treatment is effective then we would expect that the distribution of Zi = Yi - Xi would be shifted in some way, while under the null hypothesis of no treatment effect, the distribution of Zi would be diagonally symmetric about 0. This way, we have a nonparametric analogue of the multivariate extension of the classical paired-sample t-test. Based on computational simplicity and robustness properties, both the multivariate sign and signed-rank tests are used in this context, and there are some SAS programs (known as the M-rank procedures) available in the literature.

These multivariate rank test statistics both in the one-sample and linear model perspectives are also useful in providing robust estimators of the parameters of interest. For example, for the multivariate location model (where the underlying d.f. F is taken as F0(x - 0) with F0 diagonally symmetric about 0, and 0 is treated as the unknown location parameter), we may consider the coordinatewise Wilcoxon-scores estimator

0,j = median{½(X/(j) + X y ) ) : 1 < i < l < n}, j = 1 , . . . , p ; (6.19)

these estimators have already been discussed in earlier sections. In a similar manner, for the multivariate two-sample location model, we may consider the coordinatewise R-estimators based on the coordinatewise two-sample rank-sum

Page 27: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

N o n - p a r a m e t r i c s in b i o e n v i r o n m e n t a l a n d p u b l i c hea l th s ta t i s t i c s 273

test statistics. For the multivariate linear models too, we may consider the co- ordinatewise regression and intercept parameters, and provide robust estimators of them based on appropriate rank statistics. We refer to Jure6kov/t and Sen (1996, ch. 6) for details. We shall discuss these in a more general context later on in the next section.

Another important extension of the multivariate one-sample (or paired-sam- ple) model is the so called multivariate paired comparison models. In a general setup of multivariate analysis of covariance (MANOCOVA) problems such models are treated in Sen (1995b), where references to the pertinent literature (covering simpler models as well) have also been cited. We consider t objects (players), forming (t) possible pairs; for the pair ( i , j ) : 1 < i < j < t, we denote

< - O) - (p) ' the response vector (judged on a preference scale) by X,"--~,U,j--(X. . . . . . , ) ( ' i j ) " , when there are p dichotomous attributes, where each X~ k) can only assume the values 0 and 1. Thus, the probability law of Xij is defined over the 2P-simplex, and it is denoted by rcij = {rcij(i):i E J } , and j is defined as in Section 5, with i = ( i l , . . . , ip)' and each ij assuming two possible values 0 and 1. In this way, we confront the set of probability laws

II = {=ij, 1 <_ i < j < t} , (6.20)

having the same discouraging factor (as in Section 5) that there are too many parameters (when p is large). Therefore, we are tempted to using the same (Bahadur-)reparametrization as formulated in (5.1)-(5.7). In view of multiple subcripts, here, we simplify the notations in (5.3) and (5.4) a bit more, and let

@C (1) 7"C (p) ~' 1 < i < j _< p (6.21) 7~*iJ = \ * i j ( O ) ' ' ' ' ' *ij(O)] ' - -

We also define the set of association parameters of various orders for the pair ( i , j ) by Oij, and make a basic assumption:

[ A ] : O i j = O , g 1 < _ i < j < p . (6.22)

a parallel with the univariate model, for the marginal probabilities rc (~) *ij(O) Drawing we set

~(k) _ c~ , 1 < i < j < p ; k = 1 , . . . , p , (6.23) *ij(O) ~ki ~- O~kj - - - -

where the c~k/are all nonnegative and

t

e k e = l , k = l , . . . , p . (6.24) r - - i

In this way, we reduce the number of linearly independent ek~ to p( t -- 1) (as it should be). The MANOCOVA paired comparison models basically address to the homogeneity of these c~ in a well defined manner, treating (under [A]) the as- sociation parameters as nuisance (but homogeneous). This formulation is more general than the Davidson and Bradley (1970) model, who in the setup of a

Page 28: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

274 P. K . S e n

mult ivar ia te analysis o f var iance ( M A N O V A ) model , addit ional ly imposed the restraints tha t all associat ion paramete rs involving three or more traits are null. Keeping the univar ia te paired compar i son models in mind, we set the componen t null hypothesis as

H~ k) : c~kl . . . . . c~kt, for k = 1 , . . . , p . (6.25)

Our M A N O C O V A model f f can be posed in a more general way than the M A - N O V A model. We let p = pl + p2 for some nonnegat ive integers pl ,p2, and define

P

:= = nHo2; k=]

Pt P

H02= (q k = l k = p l + l

(6.26)

In a general M A N O C O V A model , we want to test the null hypothesis that H01 holds, assuming that H02 holds (under [A]). This can be writ ten equivalently as

H~ =HoIH02 ~ HolIH02 . (6.27)

In part icular , if P2 = 0, the M A N O C O V A model reduces to the M A N O V A model . Hence, we describe here the general M A N O C O V A case. Basically, as in Sen (1995b), we extend the Chat ter jee 's (1966) bivariate sign-invariance argu- ments to the mul t ivar ia te paired compar i son models, and thereby obta in some condi t ional ly distr ibution-free tests that have simple large sample propert ies. Fo r each pair ( i , j ) : 1 < i < j <_ t, and each pair o f traits ( k , q ) : k , q = 1 , . . . , p , as in (6.10) (6.13), we denote by nij the n u m b e r of ( independent) observat ions (Xijl),

(k) (~) and let n ..... and n ..... be the number observat ions on the kth trait having *UW) *U~U ( (k) • •

the values 1 and 0 respectively (n,. .... + n ..... = nij, V(i,j)). Similarly, we pool all U~u) *U~lL_

the (;) tables into a combine table of n = )-£1<i<<t nij observat ions, and as in _ J - - ^

(6.10)-(6.13), for each pair (k, l) : k, l = 1 , . . . ,p, we obtain the est imate 0kl o f Okz.

The corresponding matr ix of order p x p is denoted by On. We then define

t

=" n i j { n , i j ( o ) , i j ( l ) J , , ' ' ' , . . . . ' P "

j = l , ¢ i

Further , we let

. . . T0'.)~ ' i = 1 , . . . , t . (6.29) Tni = (r,!? . . . . , ,

Then for the M A N O V A hypothesis (H0) testing problem, we consider the test statistic

t ,' ^ - 1 ~ n ~ = t -~ Z T n i O n Tni • (6.30)

i=1

Page 29: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 275

Under assumption [A] and the null hypothesis H0, the conditional distribution of 5~npc, given the vector of signs of the individual observations and their inversions, can be enumerated for small values ofnij, 1 <_ i < j < t, and in large samples, this can be well approximated (in probability) by the central chi-square distribution with p(t - 1) DF. In a similar manner, we consider the partition of the Tni relating to the first pl coordinates, and also the corresponding partition of the On of order pl × pl, and construct a test statistic similar to the test statistic considered above; we denote this by cp(1) Its distribution theory (conditional as well as asymptotic)

~ t ? P C •

then follows the same line as in above, but the DF would be equal to p l ( t - 1) instead o fp ( t - 1). Therefore, if we want to test only for the null hypothesis H01,

W(1) ignoring the remaining p2 traits, we could use ~ m as a M A N O V A test statistic. In a similar manner, to test the null hypothesis H02, ignoring Hol, we may consider

a similar test statistic, denoted by cp(2) which is solely based on the last p2 traits. ~ n P C ~

Finally, to test the null hypothesis H~ (treating//o2 as tenable), we consider the concomitant-adjusted (MANOCOVA) test statistic

a(ll2) = ~anp C -- ~22)pc (6.31) n P C

It is easy to show that c~a(ll2) is nonnegative, and its asymptotic null distribution is ~ n P C

central chi-square with pl (t - 1) DF. It is also possible to express c¢'(112) in terms ~ n P C

of a residual sign-statistics vector (Sen, 1995b), but computationally, the formula c,a(112) given above is simpler. A comparison of the M A N O C O V A test based on ~nec

c, a0) reveals the power-superiority of the with the M A N O V A test based on ~,,ec M A N O C O V A test to the M A N O V A test; we refer to Sen (1995b) for details.

This natural way of formulating a (M)ANOCOVA test statistic as a differ- ence of two test statistics, namely, the (M)ANOVA test statistic for the entire set of response and concomitant variates and a parallel test statistic only for the concomitant variates, has a far reaching impact in nonparametrics and robust- ness studies. This technique works out well for the usual (M)ANOCOVA models when the concomitant variates are stochastic, and it also holds for for semi- parametric and robust test statistics which will be considered in Section 8. For some motivation of these type of M A N O C O V A procedures, we refer to Sen (1984).

We may also illustrate the role of multivariate nonparametrics in another important area: Aligned rank tests for blocked designs. In one sense, paired comparisons designs relate to incomplete block designs of plot size equal to 2. To start with, we consider a complete block (randomised or two-way layout) design. Suppose that there are n blocks of t(_>2) plots each, where t different treatments are applied. Let Xtj be the response of the plot in the ith block receiving the j th treatment, for j = 1 , . . . , t; i = 1 , . . . , n. In a conventional normal error model, we let

X ~ j - - l a + f i i + z j + e i j , j = 1 , . . . , t ; i = 1 , . . . , n , (6.32)

where/~ is the overall mean effect, fli stands for the specific ith block effect, zj for the j th treatment effect, and the errors eij are assumed to be i.i.d, normal random

Page 30: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

276 P. K. Sen

variables with 0 mean and a finite (positive) but unknown variance o -2. The ad- ditivity of the block and treatment effects, and the independence, normality and homoscedasticity of the errors constitute the fundamental assumption for stan- dard parametric procedures. In practice, particularly in biomedical applications, one or more of these assumptions may not be tenable, and as a result, the standard parametric procedures may be quite nonrobust. In many biomedical and psychometric applications, instead of the observations X~j, j = 1 , . . . , t, we may have only a relative ranking of the t objects within each block. For example, the blocks may relate to independent judges who are to rank (independently) t players, so that the observed data relate to the set of ranking ri = ( r i l , . . . , r i t ) ~ made by the ith judge, for i = 1 , . . . , n. These rank vectors can be taken as in- dependent from judge to judge, and the hypothesis of no player (i.e., treatment) effect relate to the interchangeability of the within block ranks among themselves. Note that if there is no tie, the elements of each ri are the numbers 1 , . . . , t, permuted in some (stochastic) order. Therefore the null hypothesis of inter- changeability can be formulated in terms of all possible t! permutations of (1 , . . . , t) being equally likely for each ri i = 1 , . . . , n. This way, we generate a set of (t!) n equally likely realizations (intra-column permutations) of R = ( r l , . , . , rn), and a test based solely on this rank-collection matrix R (of order t × n) will be exact distribution-free (under the hypothesis of interchangeability). The classical Brown-Mood (1951) median test and the Friedman (1937) rank-sum tests are exclusively based on this rank collection matrix, and are therefore exact distri- bution-free. Such tests are also called intra-block rank tests, and the procedure of ranking objects within the blocks separately is termed the method of n-ranking. Such intrablock rank tests even allow the flexibility of having an error distribu- tion different from block to block (that is, possible nonadditivity of the block effects), and the hypothesis of interchangeability may also hold when the errors are not necessarily independent-this case arises frequently in mixed-models where the block-effects are possibly random while the treatments are not. In such a case, no specific distributional assumption may be needed on the block effects. Simi- larly, the spread or the variance (when the latter exists) of the errors may vary from block to block, so that the homoscedasticity condition as imposed in the normal theory case may not be needed here.

Corresponding to the numbers 1 , . . . , t , we introduce a set of scores a ( 1 ) , . . . , a(t) , and let

T~j = ~ a ( r i j ) , j = 1 , . . . , t . (6.33) i=1

I f we let ~ = t -1 ~ = 1 a(i) , we note that by definition, in the absence of any tie among the ranks, ~ = 1 T~j = nt~t. We also, denote by

t

A 2 = (t - 1) -1 Z ( a C j ) - a) 2 . (6.34) j = l

Page 31: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 277

Then a test statistic for testing the hypothesis of interchangeability may be posed as

t £Pna = (nA2) -1 Z ( r n j - net) 2 . (6.35)

j--1

The Brown-Mood test statistic corresponds to the score function a(j) = lor 0, according as j > _ a or not, where a : l < a _ < t , and typically a is taken as [(t + 1)/2]. The Friedman test is based on the score function a(j) = j, j = 1 , . . . , t. Though the Brown-Mood test performs better than the Friedman test when the underlying error distribution is Laplace (or double-exponential), the latter per- fores better than the former for nearly normal distributions, and for the logistic distribution it is asymptotically best within the class of tests based on the rank collection matrix R.

Inspire of having all such nice robustness properties, the intrablock rank tests may not incorporate the information contained in the interblock comparisons (particularly noticable when the block effects are additive), so that they are generally not fully efficient, even asymptotically and even for some specific error distributions. To illustrate this feature, let us consider the special case of blocks of size 2. In that case, the intrablock rank tests are essentially based on the sign statistic involving the signs of the intrablock contrasts (or paired differences). In the same setup, if instead of the sign test, we consider the Wilcoxon signed-rank test statistic, that, through the ranks of the absolute values of these differences, incorporates some interblock information, and as a result is typically more effi- cient than the sign statistic. Guided by this observation, one can consider a general class of aligned rank tests posed as follows.

We choose a measure of the central tendency in each block; the mean, me- dian, modified mean, trimmed mean, and Winsorized mean are typical examples of such a translation-equivarlant estimator. We denote such a measure for the ith block observations by ~ , i = 1, . . . ,n. We then define the aligned observa- tions by

Y i j = X i j - X i , j = l , . . . , t ; i = 1 , . . . , n . (6.36)

In an additive model, we may set ~ = 1 ~J = 0, so that the Y/j are free from the block effects, and also under the hypothesis of no treatment effect, for each i ( = l , . . . , n ) , (Y/I,. . . , Y,t) are exchangeable. Therefore, we may rank all the N ( = n t ) aligned observations among themselves (in the way we did for the Kruskal-Wallis test), and base a test on these aligned ranks. The basic difference between the two situations is that here, for each i(= 1, . . . ,n), the Y~j,j = 1 , . . . , t are dependent while in the c-sample problem all the observations are assumed to be independent. Adjustments for this dependence can be made with the multi- variate rank-permutation and asymptotic distribution theory (Sen, 1968b), and these we explain below.

Let Rij be the rank of Yij among the N aligned observations Yrs, s = 1 , . . . , t, r = 1 , . . . , n; by virtue of the assumed (absolute) continuity of the

Page 32: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

278 P. K. Sen

joint distribution of ( X / l , . . . ,X/t ) (for each i = 1, . . . ,n), ties among the Y/j can be neglected in probability, so the aligned ranks Rij, 1 < i < n, j = 1, . . . , t represent the numbers 1 , . . . , N permuted in some (stochastic) order. We write this aligned rank-collection matrix of order t × n as

RN = (R1, . . . ,Rn) , (6.37)

so that there are N! possible realizations of this aligned rank-collection matrix. We denote by R °, the reduced rank-collection matrix,that is obtained from RN by permuting the elements in each column in such a way that they are in natural order (though they need not be the consequtive integers). Note that on letting M = (N!)/(t!) ~, we can partition the totality of N! realizations of R N into M subsets, such that each of these subsets contain exactly (t!)" elements, and further, the conditional distribution of RN, given R ° , under the null hypothesis of inter- changeability, is discrete uniform over the (t!)" realizations with a common probability mass (t!) -". We denote this permutational (conditional) probability law by ~n, and construct a test statistic by incorporating this law. For this, we let TN = ( T N 1 , . , . , TNt)', where

n T N j = n - I Z R i j, j = 1 , . . . , t . (6.38)

i=I

Also, let/~i = t -1 ~ = 1 Rij, i = 1 , . . . , t, and

1 n t

- n ( t - . i=1 j= l (6.39)

Then, it follows that E{TN[~n} = ~ L , and

n E { ( T N N 2 1 1 t ) ( T N N 2 1 1 t ) ' ~ n } = ( I t - ~ l t l l t ) ~ Z ( ~ n ) •

(6.40)

This suggest the Kruska l Wallis type test statistic

±( )2 n N + 1 (6.41)

~.~NAKW -- ~2 ~ n ) TNj 2 j 1

The test statistic can be immediately extended to general score aligned rank test statistics by using scores aN(i), i = 1 , . . . , N instead of the natural numbers 1, . . . ,N (Sen, 1968b).

Such aligned rank tests are clearly conditionally (permutationally) distribu- tion-free under the null hypothesis of interchangeability, though their distribu- tion then would depend on the reduced rank collection matrix R ° that is held

Page 33: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 279

fixed. Though the task of enumerating this conditional distribution of ~NAKW is manageable for small values of n, t, the job becomes prohibitively laborious when n is not so small. However, for large n, the conditional (as well as the unconditional) null distribution of f~NAKW c a n be well approximated by the central chi square distribution with (t - 1) DF (Sen, 1968b). Various asymptotic properties of such aligned rank tests have also been discussed in Puri and Sen (1971, ch. 7). The main advantage of such aligned rank tests over their intra- block rank test statistic counterparts is their power superiority particularly when the block and treatment effects are additive and t is small. For example, if we compare the aligned rank-sum test and the Friedman rank-sum test, for nor- mally distributed errors, the asymptotic relative efficiency of the aligned test with respect to the other is equal to (t + 1)/t, so that for small values of t there is considerable gain in efficiency for the aligned one. A very similar picture holds for general scores rank tests. On the other hand, in terms of robustness to possible model departures (particularly to nonadditivity of block m effects), the intrablock ranking method has a distinct advantage over the aligned one. Therefore, in actual practice, we may decide on the choice between the two ranking methods on the ground of their validity, robustness and efficiency considerations.

Both the intrablock ranking and ranking after alignment methods have been extended to more complex models, including the following:

(1) Nonorthogonal designs arising due to possibly unequal number of observa- tions per cell.

(2) Incomplete block designs wherein not all treatments are applied in all blocks, (3) Factorial designs comprising two or more treatments each at more than one

level, and (4) Multiresponse designs including the types (1), (2) and (3) mentioned above.

We refer to chapter 7 of Puri and Sen (1971) where a detailed treatise of these procedures has been made, along with the citation of the original references.

Aligned rank tests and derived estimates also crop up in general linear models where part of the regression parameters appearing in the linear model are treated as nuisance, while the other part is of genuine interest. For example, in a multi- factor design, we may be principally interested in drawing statistical conclusions on the main effects of each factor as well as their two-factor interactions, treating all higher order interactions as nuisance. This problem has been attacked by adopting confounding or partial confounding tools that provide greater precisions for the parameters of interest at the cost of no or reduced precision for the parameters that are totally or partially confounded by skillful designs. For such confounded designs too intrablock ranking methods work out, and their aligned ranking counterparts also work well (Sen, 1970). We shall discuss some of these in Section 8.

There are other multivariate ranking methods which arise in connection with longitudinal data or growth curve models and repeated measurement designs. We shall discuss these in the next section.

Page 34: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

280 P. K. Sen

7. Longitudinal data models

In a longitudinal study model, typically, we have a set of repeated measurements on the same unit or individual over differing conditions or periods of time, so that we have a genuine multivariate model. On top of that we have generally some covariates or auxiliary variates. Since the measurements on the same individual are comparable (and they are typically stochastically interdependent), it may be possible to reduce the dimension of the parameters in the regression model by suitable constraints on these parameters, and also to impose some restriction on the interdependence pattern of the response variates. Generally, such longitudinal data models are more akin to m i x e d - e f f e c t s M A N O C O V A models. Note that in a M A N O V A model, we let

Y = ( Y 1 , . . . , Y ~ ) = [ ~ X + e , e = ( e l , . . . , e n ) , (7.1)

where the Yj are independent stochastic p-vectors, II is a p x q matrix of unknown parameters, X is a q × n matrix of known regressors or explanatory variables, and the ej are i.i.d.r, vectors with a distribution F, defined on R p. The g r o w t h c u r v e

models are the precursors of such repeated measurements models, and we refer to the article by Singer and Dalton (2000) in this Volume for a nice introduction to such models. Suppose that measurements (say, of weights) of an individual (say, a new born baby) are taken at p(>2) time points ti < t2 < . . . < tp, so that the columns of X relate to these time points along with other concomitant variates. When p is not small, it may be reasonable to assume suitable polynomial func- tions for the time-response regression, so that we may set

II = GO , (7.2)

where G is a known matrix of order p x r, and O is an unknown matrix of order r x q; r _< p, and typically r is small compared to p. Moreover, we can also conceive of suitable dependence pattern among the p error components of the el. For example, in a mixed-model setup, we may assume a stochastic individual effect eoi and assume that

ei = eo i l + e*, i = 1 , . . . , n , (7.3)

where the p elements of e* are independent; this is the analogue of interchange- ability in a conventional Gaussian error model, and in general, for possibly non- Gaussian errors, this is the conditional independence which is slightly more stringent than interchangeability of the error components. Likewise, we may use a Markov dependence model. In any case, for the usual multivariate model there are p x q unknown regression parameters (ll) and p(p + 1)/2 unknown parame- ters in the associated dispersion matrix, while for the above growth curve model there are r x q regression parameters and the number of unknown parameters in the associated dispersion matrix will also be typically less than p(p + 1)/2. This illustrates that there is (generally a considerable) reduction of the dimension of the parameter space, and hence, whenever this reduction is tenable in actual practice, we can draw statistical conclusions with relatively greater precision.

Page 35: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 281

There is a connection between the classical t ime-ser ies models and longitudinal studies model. In a time-series model, apart f rom the general trend, there is also considerable interest in the study of seasonal patterns as well as dependence patterns. For the later aspect, some specific schemes such as the moving average, autoregressive, and autoregressive moving average models are prescribed. Such dependence patterns are generally taken to be of stationary type while plausible nonstationarity in the model is generally observed in the trend components. On the other hand, stationarity of cross-correlations etc. may not be generally tenable in a longitudinal data model. Moreover, in the longitudinal studies, the preva- lence of concomitant variables at different time-points often makes it quite cumbersome to adopt (multivariate) time-series models, and the task of dimen- sion reduction of the parameter space remains as one of the top priorities. For these reasons, these two branches of statistical modeling and analysis have not merged into a single unified one. Specially in bio-environmental and public health studies where longitudinal data relate to nonstandard setups, it is more conve- nient to use some alternative approaches that we describe below. As we shall see later on, in simpler models, such as the growth curve models, such prescriptions can have some optimality properties, but in more complex setups, optimality considerations are to be compromised with robustness and practical adaptability (i.e., validity) considerations. Semiparametrics and nonparametrics are particu- larly more attractive from such considerations.

A simple way (though not necessarily optimal) to induce this reduction in the statistical modeling and analysis scheme is to work with the transformed data set (Potthoff and Roy, 1964):

Z = ( Z l , . . . , Z n ) = ( G ' G ) - I G ' Y , (7.4)

where we have

Z = O X + e ° ; e ° = (GtG)- lG 'e . (7.5)

Note that (7.5) corresponds to the conventional multivariate model, so that standard parametrics and their nonparametrics counterparts can be readily im- ported for drawing statistical conclusions on O. However, this reduction tech- nique generally incurr some loss of efficiency due to two major reasons:

(i) Statistical information contained in the complementary part not included in Z is not tapped by this method, and

(ii) if there were some structure underlying the covariance matrix of ei, one could look for a similar structure in the covariance matrix of e °, and this too could have reduced the number of unknown nuisance parameters, and thereby increased the precision of statistical conclusions.

To take into account the information contained in the complementary part, we may incorporate a M A N O C O V A model where we choose a p x (p - r) matrix H of known constants, such that GtH = 0, and let

W = ( W , , . . . , W n ) = H 'Y . (7.6)

Page 36: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

282 P. [<2. Sen

As such, we consider a p x n matrix partitioned into Z and W, where by the orthogonality of H to G, as adopted above, we have W = Hie and is free from p. Therefore, W qualifies as a matrix of concomitant variates, and in addition, e ° and W are uncorrelated too. Therefore, under multinormality of the el, W and e ° are mutually stochastically independent too. This enables us to incorporate the classical normal theory M A N O C O V A model in the statistical analysis of such growth curve models, resulting in some utilization of the concomitant informa- tion and greater precision too. For nonnormal errors, recall that uncorrelation may not imply independence, and hence, some further adjustments are generally needed in order to incorporate a suitable M A N O C O V A model. We shall discuss some of these in the next section. In the rest of this section, we consider an alternative procedure (Ghosh et al., 1973) that works out well in many longitu- dinal data models (though it might not be the most efficient way of handing such complex models). For further work along this line, we refer to Sen (1973) and Kozial et al. (1981).

By (7.1) and (7.2), denoting the columns of X as xi, i = 1 , . . . ,n, we have

Yi = GOxi + el, i = 1 , . . . , n , (7.7)

where we may even impose suitable structures on the covariance matrix of e~, and where r is typically small compared to p. I f the covariance matrix of e~ is known upto an unknown scalar constant (i.e., a2V where V is a known matrix while o -2 is an unknown (positive) scalar constant), then we could use the classical weighted least squares estimation (WLSE) methodology to estimate O based on Yi alone; we denote this estimator by O~, for i = 1 , . . . , n. I f the covariance matrix of ei is of some other structural form (for example, first order Markov model), we could use a similar WLSE approach to estimate O from Yi alone. In the conventional case where the covariance matrix of ei is arbitrary and unspecified, we could even use the unweighted LSE methodology to obtain such an estimator based on the individual Y¢. Though such estimates may not be optimal in a conventional sense, yet they capture the interdependence of the elements of e~ which will be reflected in the dispersion matrix of the individual O~. Based on this methodology, we obtain from the original Y a set of n independent r × q stochastic matrices Oi~ i = 1 , . . . ,n. We have thus reduced the longitudinal data model into a con- ventional M A N O V A model, and then use some of the methodologies presented in the previous section. We rewrite Oi into a rq-column vector Z~ = vec 0i, for i = 1 , . . . , n, and proceed as follows.

1 / In the usual setup, we partition vec O as (05, 0'1,02), where the first component stands for an intercept parameter while the rest for regression type parameters, and they are of the order r, rsl and rs2 respectively, and where Sl,S2 are non- negative integers such that s1 + s2 = q - 1. In this setup, we can consider some typical subhypothesis testing problems. For example, we may set

/c/0 : 01 = 0 VS. //i : 01 }L 0 , (7.8)

treating both 00, 02 as nuisance parameters. Now, in a balanced design, we can treat the Zi as i.i.d.r.v.'s, so that the multivariate rank procedures based on

Page 37: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 283

aligned ranks work out well. We shall treat these briefly in the next section. For possibly unbalanced designs, we need to assume that the Zi are centered around vec 0 and their joint distribution is either diagonally symmetric or some other regularity conditions hold.

For longitudinal data models relating to continuous response variables not only aligned rank tests work out well, but also other robust tests based, for example, on suitable M-statistics or L-statistics, or regression rank scores work out. The choice between such robust procedures may depend on some extraneous factors, and we shall discuss these features as well in the next section. For possibly polychotomous response variables (including the correlated binary responses as a special case), one can also work with the generalized linear models that have been discussed in earlier sections. However, faced with the multivariate response models, in such a GLM, we would have a vector of link functions that are not generally stochastically independent, so the simplicity of the G L M formulation may have to be given away, and a more complex quasi-likelihood type formu- lation would be preferred. But, then the exact statistical analysis may have to be replaced by asymptotics, for which the underlying regularity assumptions should be appraised critically; otherwise lack of robustness may surface abruptly. A recent monograph by Diggle et al. (1996) has addressed some of these issues, and we also refer to the article by Singer and Dalton (2000) in this accompanying volume for some further details.

8. Robust statistical inference in general linear models

We have discussed earlier the importance of dosage and response metameters in transforming a dose-response regression into a linear one (or at least approxi- mately so). However, as has already been pointed out that with such a trans- formed model there might not be any assurance that a specified parametric (such as the normal or logistic) model would be appropriate. This is the basic reason why nonparametr ic or semiparametric models are often judged more appropriate in such studies.

Basically, robust (point as well as confidence set) estimation of associated parameters and test of significance of suitable hypotheses constitute the two major areas of statistical inference having useful impact in all biomedical, public health and environmental investigations. The rank estimators of location and regression parameters considered in earlier sections in different contexts are the precursors of such robust estimators. However, there are additional complica- tions in general linear models that arise due to a larger number of associated parameters as well as possible lack of monotonicity properties of aligned rank statistics which generate the estimators. To illustrate this point, we consider the simple linear model:

Yi : ~tXi q- el, i = 1 , . . . , n , (8.1)

Page 38: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

284 P. K. Sen

where the e i are i.i.d.r.v.'s with a continuous d.f. F, the xi are p-vectors of known regression constants while II is an unknown parameter (vector). We may as well include the intercept parameter in this set by letting Xil = 1, i = 1 , . . . , n. In the case of a simple regression model involving an intercept and a single slope pa- rameter, an aligned rank statistic based on a montone score function is transla- tion invariant and monotone in the regression parameter, and this provided the access to define the R-estimator of the slope by the alignment principle that makes the corresponding aligned statistic closest to the null median or mean. In a general linear model, we can define the aligned observations as

Y i ( b ) = Y / - b ' x i , i = l , . . . , n , b E R p . (8.2)

I f we consider a vector of aligned rank statistic

n

L n ( b ) = Zxia , , (Rn i (b ) ) , b E R p , (8 .3 ) i--1

where the a , ( k ) , k = 1 , . . . , n are the (monotone) scores, and Rni(b) is the rank of Yi(b) among the Y~(b), r = 1 , . . . ,n, for i = 1 , . . . ,n, then in general, L,(b) may not be monotone in each coordinate of b when the others are held fixed. Though under the null hypothesis of 11 = 0, Ln (0) has null expectation, equating L, (b) to 0 (with respect to b) may not yield a unique solution or even a closed one. To eliminate this problem, Jaeckel (1972) considered a rank measure of dispersion, presented below, that provides a reasonable solution. Let

n

Dn(b) : Z ( Y i - b ' x i )an (Rn i (b ) ) , b E R p . (8.4) i = 1

Noting that the a x ( R n i ( b ) ) are translation invariant, it is easy to show that D,(b) is translation invariant too. Therefore, it can not be used for the estimation of the intercept parameter, though it is usable for the other parameters. Moreover (Jure6kovfi and Sen, 1996, ch. 6), it is known that D,(b) is nonnegative, continu- ous, piecewise linear and convex function o fb E R p. Further, the gradient of D~(b) at a point b, whenever exists, equals to - L , (b). Therefore, minimising D , (b) with respect to b we obtain a suitable estimator of 1~; this is equivalent to defining

b°: IIL.(b°)ll = infbcRPllL,(b)ll , (8.5)

and letting

~ = set of all b ° that lead to the minimum, (8.6)

so that we may set

[~ = center of gravity of @n . (8.7)

In the particular case of a simple regression ~ reduces either to a single-point set or to an interval, so that (8.7) is defined unambiguously. Nevertheless, it may

Page 39: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 285

require a trial and error (iterative) solution to obtain the estimator in (8.7). Such R-estimators in linear models are globally robust, consistent, and under fairly general regularity conditions, they are asymptotically (multi-) normally distrib- uted. For details, we refer to Jure6kov/t and Sen (1996, ch. 6). Confidence set estimation of II can then be based on this asymptotic multinormality results, and these procedures are also considered there.

Consider now a subhypothesis testing problem, where we rewrite (8.1) as

Yt = Xill~l @ xi2~2 -r- ei, i = 1 , . . . , n , (8.8)

where Iij is a pj vector (so are the xij), for j = 1,2, and pl -t-/)2 = p. Then we consider the hypotheses:

H 0 : 1 1 2 = 0 vs. H1 : P 2 ¢ 0 , (8.9)

treating Pl as a nuisance parameter. Note that under the null hypothesis, the Yi are independent but not necessarily identically distributed, and hence, EDF rank tests may not generally exist for such a testing problem. The alignment principle, discussed in detail in Sen and Puri (1977), can be incorporated to develop some aligned rank tests that are robust and asymptotically distribution-free.

Under H0, we have the reduced model Y/= xi1111 + ei, i = 1 , . . . , n, and we use the R-estimation procedure, discussed in (8.2)-(8.7), to estimate 111. Denote this R-estimator by [lln. Consider next the residuals (aligned observations) Y//= Yi - xa~ln, i = 1 , . . . ,n. Let/~ni be the rank of ~- among the n residuals as defined above. Consider then the (p2)-vector of aligned rank statistics

Ln2 = ~ xi2an(Rni) , (8.10) i=1

where the scores an(k), k = 1,. . . ,n are defined as before. We define X'X = Cn and partition it into 2 × 2 Submatrices Cnr~, of order Pr × P~, r, s = 1,2. Let then

C 1 2 _ 1 Cn22:1 = C n 2 2 - n21CnI1Cn12' An n - 1 {an(i) - an} 2 • (8.11)

The aligned rank test statistic can then be posed as

A 2[, C-1 f (8.12) ~°n2 = n n2 n22:11Jn2 "

Under the null hypothesis, ~ n 2 h a s closely central chi square distribution with/)2 DF, and its (null as well as local non-null) distribution theory is based on some uniform asymptotic linearity results on rank statistics, which are exploited to a unified extent in Jure6kovfi and Sen (1996, ch. 6). Such tests share the same asymptotic properties as the ones for the null hypothesis I I - - 0 against local alternatives that p ~ 0.

From a somewhat local robustness perspective, M-estimators in linear models have been considered by a host of researchers, and these have also been used to construct suitable (aligned) M-tests for the subhypothesis testing problem treated

Page 40: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

286 P. K. Sen

above. First, we consider the M-estimation problem (parallel to the R-estimation case). For an absolutely continuous p(t), t E R having a derivative O(t) that is assumed to be monotone, we define an M-estimator by

O n = a r g m i n P ( Y i - x i t ) : t E R p . (8.13) I , i = 1

Exploiting the absolute continuity of p(t), we can also write down the corre- sponding estimating equations as

Mn(b) = ~-~ xi0(Y~ - xib) = 0 . (8.14) i--1

(For modifications to eliminate possible arbitrariness of the solution, an adjust- ment similar to the R-estimation case may also be prescribed here.) Among the various possible choice of p(t), or equivalently, the influence curve O(t), the fol- lowing are noteworthy:

(1) L 1 norm estimators, for which 0(t) = sign(t) = 1,0, or - 1, according as t i s > , = , or < 0 . Herep ( t )_~ l t I.

(2) Huber-score function, where 0 ( t ) = t l ( ] t t <_ ~) + K . s ign(t)I( l t I >_ ~c), for some suitably chosen positive t~(< e~).

Note that the case of O(t) - t corresponds to the least squares estimator that is known to be nonrobust. The idea of flattening the score function at the two ends is to curb the influence of the heavy tails, and thereby to induce more robustness in the estimation procedure. In fact from robustness considerations generally a bounded influence curve is prescribed.

As in the case of R-estimators, the M-estimators are also implicitly defined statistical functionals, and they are consistent, robust, .regression equivariant, and asymptotically (multi-)normal. If we define

= fR 02(t/dF(t)' - f . /815) G~

and assume that both a• and 7 are nonnegative finite constants, then under mild regularity conditions,

(IL P) yAo, _22 -- "l 0"~,. C - 1 ) , ( 8 . 1 6 )

where we assume that lim~oo n-lXtX = C exists and is p.d.. Note that 0- 3 can be consistently estimated by

Pt 1 ~ t 2 ( y i Xi~n) , (8.17) ^2 O- n ~

i= i

and further, under mild regularity conditions (Jure6kovfi and Sen, 1996, ch. 5), it can be concluded that

Page 41: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 287

n - 1 / 2 ~ l M n ( [ 1 ) ~ JV'p(O, C) . (8.18)

The last convergence result provides an easy access to a robus t (asymptot ic) test for the null hypothesis H0 : [1 = [10 (for some specified [10 tha t wi thout any loss of generali ty we m a y take as 0) against the al ternatives t ha t / / 1 : Ii ¢ [10- We use the Wald type test statistic

~¢,, = &~2(Mn(0)) ' (X 'X)-I (Mn(0)) , (8.19)

which under the null hypothesis [1 = 0 has closely the central chi square distri- but ion with p DF. The same quadra t ic fo rm can also be used to derive a Scheff6 type (s imultaneous) confidence set for [1 defined as

2 ^2 Ii~,~ = {b C l iP: ( M n ( b ) ) ' ( X ' x ) - l ( M n ( b ) ) _< Zp,1 s°'n} , (8.20)

where Zp21 ~ stands for the (1 - ~)-quantile of the chi square distr ibution with p DF . Since Mn(b) is not necessarily linear in b ( though is asymptot ica l ly so), computa t iona l ly it can be quite cumber some to obta in this confidence set, and an i terat ion procedure is therefore recommended . Alternatively, the pa rame te r 7, defined above, m a y also be es t imated f rom the sample, and hence, the asympto t ic mul t inormal i ty of the M-es t imators as stated above can also be used to provide an asympto t ic confidence set in the same manne r as the classical Scheff6 method. In passing, we m a y r emark that bo th the above methods of setting confidence sets for [1 are asymptot ica l ly equivalent, and moreover , they are also valid for parallel aligned rank-stat is t ics or related R-est imators , with an addi t ional advantage tha t A 2, the var iance of the scores an(k), k = 1 , . . . , n, is a nonstochast ic and known quant i ty which converges to a limit A 2 under very general regulari ty conditions. Therefore , we omi t the details, and refer to Jure6kov/t and Sen (1996, ch. 9) for a treatise of this type of robust confidence sets.

Let us proceed on to the subhypothes is testing p rob lem based on M-statistics. We consider the fo rmula t ion as in (8.8)-(8.9), and for the reduced model Y/ = Xi1111 -~- ei, i = 1 , . . . ,n , we consider t h e M - e s t i m a t o r of[11 based on the score funct ion g,(.); this is denoted by ~nl- We consider then the residuals

Yii = Y i i - xil[Ini, i = 1 , . . . , n , (8.21)

and incorpora te them in the const ruct ion of the aligned M-stat is t ic (p2-vector) as

n

l~/In2 = Z XI2O(~') " (8.22) i--1

^2 We define Cn22:1 as in (8.11), and a n as in (8.17) where we m a y replace xiOn by Xil~nl. Then an aligned M-tes t statistic for testing H0 : [I 2 = 0 against al ternatives that it is nonnul l (and treat ing [11 as a nuisance paramete r ) can be posed as

~On2 ^-2 - ! 1 - (8.23) t7 n Mn2Cn22:lMn2 •

Page 42: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

288 P. K. Sen

The test is asymptotically distribution-free having closely central chi square dis- tribution with p2 DF under the null hypothesis. Notice the similarity of the aligned rank and aligned M-tests, they are asymptotically power-equivalent when their score functions are conformable; this feature is discussed in detail in Chapters 7 and 10 of Jure~kov/t and Sen (1996).

Though for the location-scale models, L-statistics (that are based on linear combinations of functions of sample order statistics) have been extensively studied in the literature, their counterparts for general linear models are formu- late under somewhat less generality. However, in recent years there have been some notable developments on L-estimation theory and allied hypotheses testing problems, and here we shall mention briefly two important classes, namely, the Regression Quantiles (Koenker and Bassett, 1978), and the Regression Rank Scores estimators (Gutenbrunner and Jure6kova, 1992). The latter ones are closely related to R-estimators while the former ones to M-estimators.

For the linear model in (8.1), Koenker and Bassett (1978) defined the p- regression quantile, for a p : 0 < p < 1, denoted by ~n (P), as

Jib(p)--argmin{~-~pp(Yi-xib)i=l : b E RP} ' (8.24)

where

pp(X) = Ix[{(1 - p ) I ( x < 0) -t-p/(x > 0)} . (8.25)

They showed that the solution in (8.24) can be characterized as the optimal solution of a linear program that has also been explored in Section 4.7 of Jur- e6kovfi and Sen (1996), along with a treatise of general asymptotic properties of the regression quantile estimators in linear models. It follows from their treatment that for a given p : 0 < p < 1, if we denote by ~p the p-percentile of the error distribution F and the matrix C as in (8.16), then under the usual regularity conditions as needed for the asymptotic normality of sample quantiles in the conventional i.i.d, model.

- P) w (o, p(1 - p) 7%5 c-1) (8.26)

which resembles the usual result for the i.i.d, case. Considering then a weight function (a signed measure on (0, 1)) v(t), t c (0, 1), one may consider a general class of L-statistics that are based on such regression quantiles. This may be defined as

j~0 1 Ln(v) = ~n(p)dv(p) • (8.27)

Various choices of the weight function lead to particular estimators that share similar robustness and other asymptotic properties with the R-estimators and

Page 43: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 289

M-est imators . Like the R-est imators (but unlike the general M-est imators) , such L-es t imators are scale equivariant . I t is possible to choose v(t), t E (0, 1), either as a smoo th (cont inuous and differentiable) function, or even as a step funct ion that has only finitely m a n y jumps on the unit interval. In general, we m a y take v as a linear combina t ion of an absolutely cont inuous c o m p o n e n t and a step function, and with that s tudy the asympto t ic (mul t i - )normal i ty of v ~ ( L n ( v ) - [I); for details, we again refer to Section 4.7 of Jure6kov/t and Sen (1996).

Next , we discuss briefly the regression rank scores es t imators in linear models that are closely related to the regression quantiles considered before. Let u E (0, 1), and ~n(u) be the u-regression quanti le as defined above. Then the vector of regression rank scores at u E (0, 1), denoted by fin(u) = ( a n l ( U ) , . . - , Clnn(U))', is defined as the op t imal solut ion of the linear p r o g r a m m i n g problem:

n

Z Yigt~i(u) = m a x i=1

t / n

~ x i j a n i ( u ) = ( 1 - u ) ~ _ x i j , j = l , . . . ,p , i - 1 i=1

gtni(u) E[O, 1], V 1 < i < n , 0 < u < 1 . (8.28)

This is dual to the opt imal solut ion of linear p r o g r a m m i n g for the u-regression quanti le tha t can be put as:

u r i' + (1 - u) r~ = rain. subject to i--1 i=1

n

+ r + + r , : = i = 1 , . . . , n ; i = l

f l iER, l<j<_p; r+>_O, r, >_0, V i , (8.29)

where r + and r~- are respectively the positive and negative parts o f the residual Y~ - xill, i = 1 , . . . , n. Whereas the regression quantiles are generally useful in the context o f es t imat ion of p, the regression rank scores can be used for bo th the es t imat ion and hypothesis testing problems. With that in mind, we choose a score generat ing funct ion qS~(u),u E (0,1), let wi thout loss of generality, J~ qS~(u)du = 0, and compute the scores

/0' ~;,~ = - q ~ , , ( u ) d & ~ ( u ) , i = 1 , . . . , n . ( 8 .3O)

We use the notat ion/)ni(Y - Xb) to denote the scores based on Y - Xb. Let then

n

, , , (b ) = ~-~£(Y/- xib)bni(Y - Xb), b E R p . (8.31) i=1

Page 44: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

290 P. K. Sen

Then the regression rank estimator of p is defined by

[I n = arg min{Dn(b) : b E R p} . (8.32)

In the same way, working with the Y~-Xi lb l , i= 1 , . . . , n , we can define RR estimator of 1~1- The technical advantage here is that the regression rank scores c~ni are regression invariant, and hence the computation of the b,i remains the same. Once this is done, we can proceed precisely as in the case of subhypothesis testing based on the aligned rank statistics where a RR score statistic is defined as ~iL1 enibni' and the eni depend on X. Certain general asymptotic equivalence results on aligned rank tests and RR scores tests based on the same q~(.) are discussed in detail in Section 6.7 of Jure6kovfi and Sen (1996), where additional results are also presented.

All these procedures have also been extended to multivariate linear models. Whereas in the classical multinormal models, parametric linear statistical infer- ence procedures generally enjoy the affine invariance property, the same may not hold in the case of nonparametric or semiparametric procedures (as they need not be based on linear statistics). Of course, there are many practical situations where affine invariance is not so crucial, and hence, robust inference, typically involving nonlinear statistics, remain as strong contenders of the parametric ones.

9. Nonlinear regression analysis

In the context of indirect quantitative bioassays some nonlinear (dose-response) regression models have been briefly discussed in Section 4. In a parametric setup, sometimes some simple nonlinear models (such as the Gompertz curve) crop up in a natural manner; often, they can be either included in the general framework of generalized linear models, which have been treated earlier. Vonesh and Chinchilli (1997) have a fairly extensive treatise of nonlinear models in longitudinal data analysis; there is a dominant parametric flavour in their presentation. It may also be possible to use quasi-likelihood, profile likelihood, and pseudo-likelihood methods in such a parametric formulation of a nonlinear regression model. As has been emphasized before, in bioenvironmental and public health applications, often, a parametric nonlinear model may not be very suitable from robustness and scope of validity point of view. To illustrate this point, we refer to physio- logically based pharmaco-kinetic (PBPK) models, where often the biological fac- tors and their interactions can be modeled in a parametric setup, albeit generally in a nonlinear form. In the context of drug developmental studies, in order to have better understanding of the complex relationship between dose, drug concentra- tion and therapeutic response, phamacokinetic/pharmacodynamic (PKPD) ana- lyses are currently being advocated on generally scientifically acceptable grounds. In this modulation, pharmacokinetics considers the absorption, distribution and elimination of a drug and its metabolites, whereas pharmacodynamics relates to the action of a drug on specific organs or the body. Based on some generally

Page 45: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 291

accepted biological relations, for a PKPD model, a parametric dose-response regression can be formulated, though typically it could be highly nonlinear and involving a large number of parameters. The prospect for suitable transforma- tions that might reduce the model to a simpler and aimed linear one may not be great either. In such a setup, though often some simple distributions are assumed to hold for the inherent variations (within and between individuals), in realty, the situation can be far away from the presumed one. As such, sometimes, Bayesian methods are prescribed with a view to minimizing the stringency of these assumed distributional patterns. But, even in that way, we have a larger number of pa- rameters (arising from the priors) which might encounter broad issues of ap- propriateness of priors or hyper-priors that are incorporated (in an empirical or hierarchical Bayes method), and nonparametric models to be discussed in the next section might have better resolutions. Of course, from sample size requirements, Bayesian methods may fare better than their nonparametric counterparts (if the assumed priors are appropriate).

10. Nonparametrie regression analysis

For the models treated in earlier sections, often, it has been tacitly assumed that the dosage-response regression is linear. For nonstochastic regressors, this can be mostly justified by suitable transformations on the dose and response variables. The situation is more complex when the regressors are themselves stochastic. A classical example of this type is the so called ANOCOVA model where the treatment effects are taken as fixed, but there may be some concomitant variates that are stochastic in nature. If the joint distribution of the primary response variable and the covariables is multinormal, the conditional distribution, given the covariates, is univariate normal with a mean that is linear in the covariates, and a constant conditional variance that is smaller than its marginal variance when the multiple correlation on the covariates is positive; these provide all the justifications for linear statistical inference. The situation is different when this conditional distribution is not necessarily normal, resulting in either or both of possibly nonlinear regression on the covariates, and heteroscedasticity. Yet in actual biomedical and environmental studies, one is generally confronted with multiple concomitant variates, and rarely, there is sufficient background infor- mation on the adequacy of standard linear models for random-effects or mixed- effects predictors. This naturally calls for a critical appraisal of alternative nonparametric and semiparametric models that are less sensitive to normality of the conditional distribution as well as all the consequent regularity conditions that are assumed in standard parametric models. Following Sen (1996a,b), we may outline this scenario as follows.

Consider the classical ANOCOVA model for completely randomized (i.e., one- way) layouts. Let Y and Z stand respectively for the primary and concomitant variates, and let the zi stand for the nonstochastic treatment-effects. Then we have, given Zij = zij,

Page 46: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

292 P. K. Sen

Y i j = # + z i + 7 ' z i j + e i j for j = 1, . . . ,n i , i = 1 , . . . , k , (10.1)

where # stands for the mean-effect, and in addi t ion to the linearity of regression of Y on Z, the other impor t an t assumpt ions are:

(i) The distr ibution of Zij does not depend on j ( = 1 , . . . , hi) and i (= 1 , . . . , k). Z is a bonaf ide concomi tan t variate if this holds.

(ii) Independence of the errors e on the covariates Z. (iii) o-2 = var(e[Z) = a2(1 - R 2) _< o -2, where o -2 is the margina l var iance of e, and

R 2 is the mult iple correlat ion of Y on Z. I f R 2 = 0, there is no incentive in using an A N O C O V A model instead of a parallel A N O V A model , where the covariates are d ropped f rom the picture.

Robus t and nonparamet r i c methods aim to emphasize most ly on (i) but deem- phasize on (ii) and (iii). As a first step toward this evolution, we define a re- gression funct ion re(z), and assume that (9.1) holds, i.e., re(z) is linear in z, but possibly with a n o n n o r m a l distr ibution of e, given Z = z. In addition, we assume that the Zij are i.i .d.r.v's, i.e., they are not affected by possible t rea tment differ- ences. This leads us to the simplest semiparametr ic (homoscedast ic model):

P{Yij <_ Y lZ i j=z}=Fi ( y l z ) = F ( y - r i - V'z), i = l , . . . , k , (10.2)

where F is a cont inuous d.f. which does not depend on z, and the other nota t ions are adapted f rom (9.1); we te rm this as Semiparametric Model L I t m a y be re- marked tha t though convent ional ly we define m(z) = E{Y/j - vilZij- = z}, it is not necessary to do so always; we m a y as well define this condi t ional measure of central tendency in terms of the condi t ional median or some other locat ion- funct ional o f the condi t ional distribution. In this sense, this model relaxes the normal i ty of F and introduces some flexibility in the definition of the regression function. On the other hand, the assumpt ion that the fo rm of F is free f rom z amoun t s to a b roade r in terpreta t ion of the homoscedast ic i ty condi t ion where the scatter o f the distr ibution is not necessarily measured in terms o f the associated s tandard deviation; in fact, the latter m a y not even exist for an arb i t ra ry F.

Next , we extend this model to allow possible heteroscedastici ty wi thout as- suming normal i ty of F and thereby bypassing undue emphasis on the associated s tandard deviat ion (that might not be a na tura l pa rame te r of F or m a y not even exist). Here we take

F/(ylz) = F((y - ~ i - 7'z)/o-(z)), i = 1 , . . . ,k , (10.3)

where the fo rm of F does not depend on z, though the (positive) scale pa rame te r a(z) m a y depend on z in an arb i t ra ry manner . We have thus a semiparametr ic heteroscadast ic model that we term Semiparametric Model II.

We consider next ano ther extension of Mode l I tha t is te rmed Semiparametric Model III, and is a semiparametr ic fixed-effects but nonparamet r i c covaria te- effects (homoscedast ic) model. This is expressed as

F/(ylz ) = F ( y - ~i - 0(z)), i = 1 , . . . , k , (10.4)

Page 47: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 293

where the form o f F is free from z, and 0(z) is a suitable regression function (on z); generally some smoothness conditions are imposed on this function, but not specifically as a linear or some other strict parametric form. Likewise, we may consider a similar extension of Model II wherein we introduce possible he- teroscedasticity on F as in Model II. In general, the dependence of the conditional d.f. on the covariate levels (held fixed) need not be only through a scale factor that may vary over the domain of the concomitant variates, and hence, we may also consider a more general model as follows.

F/(y]z) - -F(y - T~ i - O(Z)IZ), i = 1, . . . ,k , (10.5)

where the form of the conditional d.f. F(-[z) may depend on z in a more intricate manner. We term this Semiparametric Model IV.

A completely nonparametr ic model (termed Semiparametric Model V) may be written as

F / ( y l z ) = F ( y - ~ ( z i ) - 0 ( z ) l z ) , i = 1 , . . . , k , (10.6)

where the form of F(-lz) may depend on z in an involved way. In this vein, the semiparametric model I is a semiparametric linear model, and

Model I I I is a partially linear semiparametric additive model. Note further the additivity of the model is vitiated for the model II, IV or V. In the semipara- metric linear model, the classical nonparametric A N O C O V A analysis schemes discussed in Section 6 (for the paired comparisons designs) remain pertinent; we need to compute the test statistic based on the entire set of Y and Z variables and subtract the parallel statistic based on the Z variables alone. They are also extendible to the allied regression rank scores procedures (Sen 1996b), as well as, to M- and L-procedures based on suitable M-statistics and linear functions of order statistics. As these follow along the lines of our discussions in Section 8, we do not repeat them here. Rather, we discuss Model I I I along with related developments on partially additive models. Avoiding the theoretical develop- ments sketched in Sen (1996a,b, 1998), we only present the basic methodological backgrounds here.

We present here the semiparametric M A N O C O V A one-way layout model that includes Model I I I as a particular case. Let Yi, i = 1 , . . . ,n be n independent stochastic vectors with which associated are the i.i.d, concomitant (q)-vectors Zi, i = 1 , . . . , n respectively. We assume that conditionally on Zi = z,

F ~ ( y l z ) = P { Y i < _ Y [ Z i = z ) = F ( Y - - ~ ] t i - O ( z ) ) , i = l , . . . , n ,

(10.7)

where the ti are known design (r)-vectors, (r >_ 1), p is an unknown parameter (matrix of order p x r), p >_ 1, 0(z) is an arbitrary smooth (vector valued) function of z (not necessarily linear), and the d.f. F, defined on Y2 p, is continuous but unknown. We may need to impose suitable smoothness conditions on F too. This model may also be characterized as a partial linear model, and it belongs to the class of semiparameAric generalized additive models (GAM).

Page 48: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

294 P. K. Sen

By assumption the Zi are i.i.d.r.v.'s, with a q-variate (unknown) d.f., say G, so on integration, we obtain that

F0i(Y) =P{Yi _< y} = F0(y - lit/), i = 1 , . . . , n;

F0(y) = / - - - f F(ylz)dG(z ) . (10.8)

As such, if we ignore the concomitant variates, we endup with a semiparamemc MANOCOVA model for which various robust as well as nonparametric proce- dures are available in the literature (Purl and Sen, 1971). This enables us to estimate the regression matrix li in a robust manner, and we can adopt some of the methods presented in earlier sections. Let us denote such an estimator by ~,,. Note that in general such an estimator may not be fully efficient for the MAN- OCOVA model (as the concomitant variate effects are ignored in this formula- tion). Nevertheless, they allow a plausible way of estimating 0 in a robust nonparametric way, and that in turn provides better estimates of li.

For the conventional case of i.i.d.r, vectors (~, Z~), nonparametric regression estimation of Y on Z has been considered by a host of researchers. Sen (1996a,b) considered a relatively more general situation wherein nuisance linear fixed-effects are incorporated. We proceed as in Sen (1998) and consider the residual (aligned vectors):

Y i : Y i - ~nti , i = ] , . . . , n . (10.9)

Note that these residuals are not independent any more; they might not be ex- changeable or even marginally identically distributed. Nevertheless, by virtue of the vrn-consistency property of the semiparametric MANOVA estimates of li (Puri and Sen (1985), ch. 6), we claim that the perturbations of these residuals are @(n ~/2). Hence, as long as the functional 0(z) is estimated with a rate of sto- chastic convergence of the order n a, for some a: 0 < a < 1/2, these residuals serve the purpose well. Thus, effectively, we aspire for a slower rate of conver- gence for the nonparametric estimator of the nuisance functional 0(z), while achieving better results for the estimates of the finite-dimensional parameter li. Moreover, in the semiparametric MANOCOVA model, we may not have the affine-equivariance of the estimators of li and particularly 0(z). As such, it is quite conceivable to estimate these parameters or functionals for each of the p coor- dinates separately, and then to adjust for their stochastic dependence in a plau- sible way (Sen, 1998).

We incorporate the K-NN (nearest neighborhood) methodology for the esti- mation of 0. The allied kernel method of smoothing may also be used and that yields parallel results; for simplicity of presentation we only consider the K-NN method. In either setup, we generally confine ourselves to a compact set S E Nq and allow z to be an inner point of this set. Thus, we intend to have a robust estimator of

O(S) = {O(z) : z ~ :~} . (10.10)

Page 49: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametries in bioenvironmental and public health statistics 295

Typically, 0(z) is locat ion measure for the condi t ional d.f. o f Y i - pti, given Zi = z, and hence, we assume tha t the usual t ransla t ion equivariance p roper ty holds here. We m a y therefore express 0(z) as a (vector valued) funct ional o f the condi t ional d.f. F(-lz); we denote this by 0F(Z). Basically, we incorpora te a n o , p a r a m e t r i c es t imator (Fn(.hz)) o f this condi t ional d.f., and express the esti- m a t o r 0 , ( z ) = 0F,(Z) as the same funct ional o f the empirical condi t ional d.f. Whenever z is real valued ~ reduces to a compac t interval, while for vector valued Zi, we choose a suitable metr ic p(z, z0), for every z, E c((; this n o r m could be taken as the usual Euclidean, 'max ' or some other quadrat ic norm.

F o r a chosen pivot z0 E 2# and a suitable n o r m p(-), we consider the non- negative r.v. 's

D ° = p ( z i , z 0 ) , i = l , . . . , n ; (10.11)

we order these r .v. 's and denote the corresponding order statistics by D,°,:l < . . - < D°:, where the ties can be neglected with probabi l i ty one whenever the d.f. G (of Z) is nondegenera te and cont inuous ( that we assume). We rewrite

D°, : i=D°, i = 1 , . . . , n ; S ° , = ( S ° , . . . , S O ) , (10.12)

where the S O stands for the ant i - ranks of the D o relative to the pivot z0. We next consider a sequence {k,} o f posit ive integers such that k~ (< n) is nondecreasing in n with

lim k, = oo; lim n-lkn = 0 . (10.13) . ~ o o n----~ oG

Typically, we choose some a > 0 and let k, = [an4/(q+4)]. Consider then the subset o f observat ions:

(D~0,Ys0), i = 1 , . . . , k , . (10.14)

Hav ing defined all these entities, we define the empirical condit ional d.f. at z0 as

k,,

F,,k,,(ylz0) = k/~ 1 Z l ( Y s o <_ y), y E ~P . (10.15) i - - i

This process can be repeated for a mesh ~ = {z0} of pivots, dense on S , and we m a y consider the es t imator

0,(z0) = 0F,,k,,(Z0), Z0 E J/ l , . (10.16)

The n u m b e r of grid-points in . ~ , can be made to increase (albeit slowly) with n, and moreove r local smooth ing m a y also be made on these es t imators to obta in a smoo th es t imator of 0(JC). Consis tency and asympto t ic propert ies of these esti- ma to r s have been studied by Sen (1996a,b; 1998).

Let us now illustrate how improved es t imat ion of p can be made with the aid of the K - N N methodology . Essentially, we consider a set o f disjoint subsets o f J f that we denote by J l j , j = 1 , . . . ,M, where M can be made to increase with n

Page 50: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

296 P. K. Sen

(though slowly) and the diameter of each f j can be made sufficiently small. Within each subset, by choosing a suitable pivot, we obtain a set of {Zi} that are related to the set of i for which Z; c • j . We denote the number of such obser- vations by n j, so that ~j_<M nj = n and individually the nj are not small. Using these nj observations within the subset f j, again we proceed as in the MANOVA model (ignoring the concomitant variates) and estimate II; we denote these esti- mators by ~ (j) 11 n , j = 1 , . . . , M . We can either take their simple weighted average with weights proportional to the nj and arrive at the pooled estimator, or we could estimate the covariance matrix of ~ ) , and use the weighted least squares methodology to arrive at the pooled estimator (Sen, 1996a,b). Generally the second alternative is computationally more cumbersome, and in homoscedastic models, the first alternative works out well. In view of the fact that in a homo- scedastic model, the difference of the unconditional dispersion matrix and the conditional one is p.s.d., this modification yields generally better estimators, es- pecially when there is a strong regression (not necessarily linear) on the con- comitant variate.

The testing problem (for the fixed-effect parameter 11) in a nonparametric setup (treating 0(z) as nuisance (functional) parameter) is comparatively simpler, and the basic methodolgy presented in Puri and Sen (1985, ch. 8) can be adopted with minor modifications. In view of the presentation made in the preceding section, we briefly outline this as follows. The basic idea is that the concomitant random vectors Zi are i.i.d, whose distribution does not depend on the design (nonsto- chastic) variates ti. Therefore, even if the regression functional of the primary variates Yi on the concomitant variates is unknown and arbitrary, we can appeal to multivariate nonparametrics in a simpler setup. Note that we have p charac- teristics for the primary variate and q for the concomitant variates. Thus, we consider a (p + q)-dimensional setup with the dependent vectors (YI, Zti) ', i = 1 , . . . ,n and with the design variates ti, i = 1 , . . . ,n. The null hy- pothesis of interest is either the overall one H0 : p = 0, or some subhypothesis on [I that we can formulate as in Section 8. In this framework, instead of the p x r matrix Ii, we consider the (p + q) x r matrix 11" by augmenting a q x r null matrix at the lower part (which relates to the independence of the concomitant variates on the design variates). Further, we denote the joint distribution of (Yi, Zi) by ~i(Y, z) so that the q-variate marginal d.f. for Z i is G(Z), and the conditional d.f. for Yi, given Zi = z, is F~(ylz ) which are defined earlier. In this setup, as in Section 8, we consider a test for a hypothesis reformulated in terms of p* based on the entire (p + q)-dimensional observations; this test statistic is denoted by ~0. Next, we ignore the primary variate, and based solely on the concomitant variates, we construct a parallel test statistic which is denoted by S~ (2). Then a test statistic for the same hypothesis but treating the Zi as concomitant variates is taken as

e!l12/= _ (10.17)

For example, if we want to test the overall null hypothesis H0 : II = 0 against the set of alternatives that it is a nonnull matrix, 5('° is the classical permutation test

Page 51: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 297

statistic based on all the p + q variates, and under the null hypothesis, when n is large, it has closely the central chi square distribution with degrees of freedom (p + q)r. Similarly, S~ 2) is the permutation test statistic based solely on the concomitant variates, and for large n, it has closely the central chi square distribution with qr degrees of freedom. Further, as in Section 6, we can verify that

5¢°n is _> 5~! 2), with probability one . (10.18)

Therefore, 5~(= ~12) is nonnegative with probability one, and under the null hy- pothesis (H0) it is permutationally (conditionally) distribution-free, and for large n, it has closely the central chi square distribution with pr degrees of freedom. If we denote the MANOVA test statistic based solely on the Yi and ti by 2p(1), then it is also permutationally distribution-free and for large n, under the null hy- pothesis, it has closely the central chi square distribution with pr degrees of freedom. Moreover, for local alternatives, as in Sen (1984), we can show that the noncentrality parameter for ~(=112) is larger than that of 5~(~ 1). This clearly exhibit the better performance characteristics of the MANOCOVA over the MANOVA procedures, even when the regression on the concomitant variates may not be linear. A very similar picture holds for the subhypothesis testing problem in this GAM setup (Sen, 1999d).

11. Generalized additive models

The generalized additive models (GAM) combine the flexibility of nonparametric models with that of generalized linear models at the cost of additivity of the different components. We refer to a GLM, treated in Section 4, and in addition, we conceive of some other auxiliary (stochastic or not) variates z associated with the response variate Y and the design variate x; if the Z are stochastic, we work with a conditional model, given Z = z. As in (4.2)-(4.6), for the observable Y/, we conceive of a suitable link function g(.), and write (conditional on Zi = zi, in the stochastic case)

q

g(#i)= x'=ill+ Zh j ( z i j ) , i = 1 , . . . ,n , (11.1) j = l

where zi = (zil , . . . , ziq) ~, for some q > 1, and the hi( ) are smooth functions of the respective argument. In this sense, partially linear models belong to the GAM family as well; these models are also sometimes referred to as semiparametric generalized linear models (SGLM). A special case of SGLM or GAM where some of the fixed-effects are linearly additive while the regression on the random-effects component is nonparametric, as treated in a general multivariate setup in Sen (1999a), has been discussed in the previous section. The general setup of GAM incorporates the link function in this formulation, and in that way, the conven- tional likelihood or even the quasi-likelihood approach may not always work out

Page 52: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

298 P. K. Sen

well. Generally, a penalized likelihood function, such as the following, is consid- ered.

n n

l,(0, qS) = ~ l o g c(Y~, c~) + Z[Y~Oi - b(Oi)]/a((o) i=1 i=1

.i~l ,~j {h;!(zj)} 2 dzj , (11.2)

where g(b'(Oi)) = x'ni[J + ~ 1 h/(zi/), the Cj are suitable compact intervals, and the 2j are suitably chosen penulty coefficients. Note that in this formulation, the additive functions h/( ) are assumed to have continuous (and often, bounded) second derivatives, and by virtue of the compactness of the Cj, effectively, we confine ourselves to a compact set C in R q. Whenever the z, are nonstochastic, the choice of C can be made by proper designing. For the stochastic case, such a compact set should be worked out with due experimental considerations. The choice of the 2j has also to be made on the basis of such constraints. We refer to Hastie and Tibshirani (1990) for a treatise of such GAM's and pertinent statistical analysis. We also refer to Green and Silverman (1994) for some related discussion with more emphasis on nonparametrics. In passing, we may remark that the penalized likelihood approach generally takes away the computational simplicity of the classical MLE for the family of exponential densities, and at the sametime, may induce slower rate of convergence of the penalized MLE; the regression parameter [I can often be estimated with the conventional n 1/2 rate of conver- gence, but the estimators of the additive nonparametric components (the hy( ) ) have typically slower rate. In this respect, the situation is quite comparable to the nonparametric regression problem, treated in the previous section. Sen (1999d) has emphasized the need for assessing robustness properties of statistical proce- dures based on GLAM in a more semiparametric setup. His assessments remain pertinent, even to a greater extent, in a nonparametric context. The concerns voiced in earlier sections become more pertinent when in addition to the non- parametric components, the compatibility of the chosen link function with the assumed distributional laws becomes a far more important issue. In bioenvi- ronmental and public health studies, as generally we have much less control over such distributions, and the sampling designs are not usually that standard, the GLAM may have somewhat limited scope in this setup. Rather, we should em- phasize more on nonparametrics to the extent permissible with cost and sample size restraints.

12. Clinical trials and survival analysis

During the past thirty years there has been a spectacular development of statis- tical methodology and analysis schemes for clinical trials and in general survival analysis models. The clinical trials pertain to the core of biomedical and bioen- vironmental studies, and because of 'failure-time' or 'follow-up time' responses

Page 53: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 299

that generally arise in such a context, one typically encounters nonnegative re- sponse variables (that could be binary, polychotomous, count-variable, interval censored or continuous), often with markedly skewed distribution. The presence of numerous concomitant variates adds more complexities in these models. Further, administrative protocols may also add further restraints. In general, clinical trials have persistent clinical-epidemiologic perspectives, confounded with medical ethics, cost-benefit prospects, and drug-marketing incentives. For all these reasons, statistical decision making prospects in clinical trials are to be judged from a much broader perspective; usually, a simple parametric model may not be very appealing, and semiparametric as well as nonparametric models are often preferred on the ground of flexibility, robustness and practical adaptability. Some of these models are discussed in detail in some other chapters in this vol- ume, and hence, citing appropriate cross references, here only complementary discussions are presented.

In a parametric setup, the survival function F(x) (=1 - F(x)) and its dual, the (integrated) hazard function H(x), related by/P(x) = exp{-H(x)} , can be used in an equivalent manner. Though such a relation holds (even for conditional models) in nonparametric or semiparametric setups, the process of estimating them from the sample may often lead to possibly different procedures. As has been discussed in Sen (1999b), in a clinical trial we might have a Phase I, II, III or IV plan, and statistical perspectives may vary considerably from one to the other. One particular aspect that needs a lot of attention for statistical resolutions is the formulation of an interim analysis scheme that fits well with the clinical per- spectives and yet preserves the efficacy of statistical reasonings; multiple look into accumulating data sets with a provision for an early termination of the trial constitute an important statistical aspect of this interim analysis. Moreover, large- scale clinical studies involving human subjects are cropping up in most developed countries, and these need special attention to statistical principles relating to planning (design of experiment), statistical modeling, statistical monitoring (data- quality management), and statistical analysis. From all these perspectives, there is a genuine need to examine the adequacy of parametrics, and to judge the ap- propriateness of nonparametrics. Censoring of various types is also a common feature in clinical trials.

Early statistical developments in interim analysis include the so called repeated significance tests (RST) (Armitage et al., 1969), progressively censoring schemes (PCS) (Chatterjee and Sen, 1973) and group sequential tests (GST) (Pocock, 1977; O'Brien and Fleming, 1979), among others. These procedures have not been dealt with in detail in other chapters, and hence, we shall discuss them here. In addition to these developments, considerable amount of work has been accomplished in the area of partial likelihood, proportional hazards and multiplicative intensity models based on counting processes (Cox, 1972, 1975; Andersen et al., 1993, and others). As some of these latter topics are discussed in greater detail in some other chapters, we will touch on these only briefly. Further, while discussing RST and GST, we shall mainly confine ourselves to clinical trial setups, and in the same vein, we shall contrast them with the PCS based tests.

Page 54: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

300 P. K. Sen

A broad review of RST covering both the frequency and time domains is due to Sen (1991). Armitage et al. (1969) considered RST mainly in the frequency domain where it has been tacitly assumed that the observations, though possibly arriving sequentially over time, can be recorded instantaneously (as is the situa- tion with the classical Wald (1947) sequential probability ratio test). This way the independence of the subset of observations at different stages can be presumed; in addition, they consider specifically the case of normal and binomial populations. A very similar assumption underlies the GST methodology. Such a basic regu- larity assumption (of independent (and often homogeneous) increments may not be very reasonable in most clinical trials wherein a time-domain setup is en- countered; the primary response variate relates to an event (usually termed the failure time) that typically occurs sequentially over time and involve a follow-up scheme. Moreover, exact normality of distribution, as typically assumed in RST/ GST, may not be tenable in most of the cases arising in clinical trials. Never- theless, with some martingale characterizations (see for example, Chatterjee and Sen, 1973, for rank statistics, and Andersen et al., 1993 for counting processes), the frequency-domain results provide good approximations for the time-domain results as well.

In a follow-up study, arising typically in a clinical trial or a life testing problem, the failures (responses) occur sequentially over time. Further, because of cost, time and other limitations, often, the study cannot be conducted until all the failures have occurred. Typically, the study is conducted for a fixed period of time (Type I censoring or right-truncation), or until a prespecified number or pro- portion of failures have been observed (Type II censoring); in either scheme, responses not occuring during the tenure of the study constitutes the censored observations. Note that in Type I censoring, the duration is prefixed, but the number of failures occurring in that period is random, while in Type II censoring, the duration is random but the number of responses is prefixed. In many cases, particularly in exploratory studies, if the response pattern is not known to a certain extent, a single-point censoring, be it Type I or II, may lead to consid- erable loss of statistical information, and increase the risk of making incorrect decisions. Also, for other medical reasons, often, an early termination of the study is advocated when the accumulating evidence provocates the untenability of the null hypothesis. For these reasons, a statistical monitoring is often advocated: this is termed a progressively censoring scheme (Chatterjee and Sen, 1973). The PCS can either be adopted in a continuous monitoring setup, or a discretized version can be used for interim analysis which is a common practice in clinical trials. For example, if a clinical trial is planned for a three years duration, one may look into the accumulating trial dataset every 3 months (quarters) thus allowing, at most, 12 statistical looks, or every month, resulting in a maximum of 36 statistical looks. The idea is simple: if there are serious side effects or toxicity, or if the treatment group performs better than the placebo group, then on medical ethics, the trial should be stopped as soon as possible, and all the surviving patients should be put under the treatment group for better health prospects. On the other hand, if no significant difference is perceived during the progressive monitoring

Page 55: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 301

scheme, then there is no harm in letting the study continue upto the prefixed endpoint, and draw useful statistical conclusions from the acquired trial data. From statistical view point, there are certain basic constraints that we need to take into account in interim analysis or time-sequential inference, in general.

First, we have an RST scheme here where testing is done either on a contin- uous time scale or regularly over the study period. Therefore, if we adopt the conventional level of significance (say cc 0 < c~ < 1) at each time point, the overall significance of this RST scheme might be quite higher; this feature is analogous to the usual RST as prescribed by Armitage et al. (1969) and others. Secondly, in view of the nature of the (ordered) failures and other monitoring features, it is quite likely that for suitable stochastic processes related to the accumulating dataset, the classical assumption of independent and homogeneous increments may not be tenable either. This is particularly more noticable if the failure distribution is not exponential, or if there is a number of concomitant variates, and the regression on the concomitant variates is not that simple. Thirdly, the clinical design might have a bearing on the desired statistical modeling and analysis schemes. For example, the situation might be different for Type I and II cen- soring. Fourthly, random censoring, as will be defined later on, is often used in statistical research literature, to prescribe reasonably handy methodology for such incomplete data models. There are certain basic regularity assumptions that are to be made in this context, and in actual practice, this may not universally hold. Finally, early termination considerations may call for a more complex statistical analysis scheme which pays adequate attention to the multiple looking provision in a sound statistical manner. Armitage (1991) and Jennison and Turnbull (1990) have addressed some of the basic aspects of interim analysis in clinical trials.

Cox (1972) came up with a clever idea of incorporating the partial likelihood principle in conjuction with a proportional hazard model (PHM) that provides reasonably handy statistical analysis schemes as long as the P H M is tenable for the specific application. The PHM in a more general mold (covering some mul- tiplicative intensity counting processes) is discussed in some other chapters of this volume, and hence, we shall avoid the duplication here. However, in passing, we may note that the PHM assumption may not be universally tenable (Sen, 1994). Therefore, we would like to explore here alternative procedures that are based on suitable progressively censored rank statistics (Chatterjee and Sen, 1973; Sen, 1981, 1984, and others).

In survival analysis, the response variable is typically nonnegative and has a skewed distribution. It is customary to work with log-responses which not only induces more symmetry in the response distribution but also extends the range to the real line R. Moreover, if we work with rank statistics, we may note that ranks are invariant with respect to any strictly monotone increasing transformation (and hence, under the logarithm transformation as well). For this reason, we may conceive of the following model (and later on we will incorporate censoring in our treatise of rank statistics). Let X1,. . . ,X~ be n independent random variables with continuous distribution functions F1 , . . . , F~ respectively, all defined on the real line R. Associated with the X~ there are some concomitant variates that we denote

Page 56: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

302 P. K . S e n

by el, i = 1 , . . . ,n. The null hypothesis o f intertest is H0 : F1 ~ F2 --= . . - -- Fn = F (unknown) against al ternatives that the F~ depend on the ci in a convenient re- gression model. In view of the a forement ioned invariance p roper ty of the ranks, we m a y as well conceive of the following simple regression model:

F i ( x ) = F ( x - p ' e i ) , x E R , i = l , . . . , n , (12.1)

where F is unknown (continuous) and the regression pa rame te r (vector) 1~ is also unknown. Under this regression model , we have a semiparametr ic setup, and the null hypothesis reduces to H0 : p = 0, and we are interested in the set o f alter- natives that II ¢ 0. Now, in a clinical trial setup, the X,. are not observable at the start o f the study, ra ther they become observable sequentially over time. In order to depict this flow of events, let us denote by Z~:I < Z~:2 < -. • < Zn:n the ordered r.v. 's cor responding to the X/, i = 1 , . . . , n (ties neglected with probabi l i ty one, by virtue of the assumed continui ty of the F,.). Let Rni be the r ank of X / a m o n g the n observat ions X1 , . . . ,An, for i = 1 , . . . , n; these are the numbers 1 , . . . , n pe rmuted in some ( random) order. Then we have by definition

Xi = Z~:R,,, and Z~:i = X & , i = 1 , . . . , n , (12.2)

where the S,~i are called the anti-ranks, and where Rns,, = i = SnR,,i, i = 1 , . . . , n.

Also, for every n(_> 1), we define a set o f scores by a n ( l ) , . . . , an(n); we take these scores to be monotone , and wi thout any loss of generality, set

n n

a n = n - l Z a ~ ( i ) = O ; A Z = ( n - 1 ) - l Z [ a n ( i ) - d n ] 2 = l . (12.3) i=1 i=1

n Further , we let e~ = n -1 ~i=1 ei and

n

Cn = ~ ( c i ~ . n ) ( e i - ~.n) ! • (12.4) i=l

We assume that C~ is positive definite, and fur ther the generalized Noether - condition:

_ _ ( - c n ) C, (ci e n ) - - ~ 0 a s n - - ~ o c maxl<i<n ci , i (12.5)

is satisfied. Note that if the iV/were all observable, we could have considered the vector of linear rank statistics

n

Ln = ~ ( e i - en)a~(Rni) • (12.6) i=l

We now intend to modi fy Ln in the light o f the observable r .v. 's in our contem- plated setup. As a first step, we rewrite L~ as

n

Ln = ~ ( e s o i - e-,)an(i) • (12.7) i=1

Page 57: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 303

Next we note tha t at a t ime point t: Zn:k _< t < Zn:k+ 1 (for some k = 0, 1 , . . . , n) the observable r a n d o m elements are

(Zn: l ,Snl) , . . . , (Zn:k,Snk) = @n,k say , (12.8)

along with the addi t ional censoring in fo rmat ion tha t the remaining n - k sur- viving units have survival times greater than t; here we let Zn:0 = 0, Zn:n+l = OC. I f we consider a Type I censoring plan with a t runcat ion point T (<ec ) , we let r(T) = m a x { k : Z n : k _< T}, so tha t the observable da ta set cor responds to (~n,r(V), r(T)) where r(T) is a nonnegat ive integer valued r.v., bounded f rom above any n with probabi l i ty one. Fo r a Type I I censoring scheme, cor responding to a prefixed integer r ( < n), the s tudy is curtailed at the failure poin t Zn:,-, so that the dataset cor responds to (~n,r, Zn:~). In view of these formulat ions , we define n o w

Ln~=E0{Lnl~ , ,k} , k = 0 , 1 , . . . , n , (12.9)

where E0 denotes the (condit ional) expecta t ion under the null hypothesis that all the X~ are i.i.d.r.v.'s. Two implicat ions of this results are the following:

(i) {(Lnk,@n,k), k = 0, 1 , . . . ,n} is a mar t ingale (array) (for every n _> 1), and (ii) for every k : 0 < k < n,

k

Lnk = ~-~(esnl - en)[an(i) - a*(k)] , (12.10) i=1

where

a*(k) = (n - k) -1 ~ an(i), O < k < n - 1 ; i=k+ l

= 0, k = n . (12.11)

n Note tha t a~*(0) = ~n = n -1 ~ j = l an(j) = 0. Also, we define Cn as in (10.4), and let

( n - l ) l ~ ~ a~(i) + (n - k)[a*n(k)]2 - n~Zn ~ , A 2n,k = [. J i=1

k < n; AZ.n = A2n , (12.12)

and note that by virtue of the mar t ingale proper ty , 0 < A 2 < A 2 - - n , 1 - - n,2

_ 2 < . . . < A2,n - A n = 1, for every n > 1. F o r scalar ci, we m a y consider a test statistic based on ~n,k simply as Ynk A--l,----1/2T = ~n,k~n ~nk, while for the general (q)- vector case, we take

- 2 t -1 ~n~ = An,k{LnkCn L,k} . (12.13)

Under a Type I I censoring scheme, after the r th failure poin t Zn:~ (for a prefixed r), we compu te 5Pn~ as a test statistic for testing the null hypothesis o f homoge- neity of the F/ (against the regression al ternat ive stated before). This is distribu-

Page 58: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

304 P. K. Sen

tion-free under the null hypothesis, and for large n, it has closely the central chi squared distribution with q degrees of freedom. Under Type I censoring, r(T) is itself stochastic, and under 14o, r(T) has the Bin(n ,F(T) ) distribution, so that r (T ) /n ---+ F(T ) in probability, as n --+ oc. In this scheme, we consider ~°nr(T ) as the test statistic, and note that it is conditionally (given r(T) = r) distribution-free under H0; in large samples, it has the central chi square distribution with q degrees of freedom.

The situation is different in an interim analysis scheme. Typically, on a ca- lender time basis, say 0 < Ti < . . . < Tx one looks into the accumulating data sets @n,r(~), J = 1 , . . . , K with the provision of an early termination if advocated by the flow of the experimental outcomes. Thus, it seems quite appealing to consider the sequence of test statistics

~n4~), J = I , . - - , K , (12.14)

and formulate a stopping rule and decision rule based on this sequence. For q = 1, marginally (under H0), each Ynr, for r ~ np : 0 < p < 1, has asymptotically the standard normal distribution; however, the Snrj, j _> 1 are not independent. For small values of K (and nonstochastic r(Tj), ] = 1, . . . ,K), some numerical tabulations of the critical values of the test statistic

LP*~ = m a x i < x { w f 1 [5¢~r(~)]} (12.15)

for suitable weight functions wj, j = 1 , . . . , K , has been considered by Pocock (1977), and O'Brien and Fleming (1979), among others. Their numerical studies based on strictly independent and homogeneous Gaussian increments are not only too complex for large K but also inadequate to adjust for the actual case where the r(Tj) are themselves stochastic. DeLong (1981) considered some ex- tensive tabulations of boundary crossing probabilities for standardized Bessel processes which not only cover the particular case of q = 1, but also provide good approximation for the general case of q _> 1, allowing K to be arbitrary, and retaining the stochastic nature of the r(Tj). We elaborate this in the case of the progressively censoring schemes (PCS) that include Type I and II censoring as special cases.

In a PCS, one generally initiates monitoring from the beginning of the study, often, continuously over time, and with a view to have the termination of the study at intermediate stage if the accumulate outcome at that stage provocates the rejection of the null hypothesis in favor of the alternative. Recall that the picture changes only at the successive failure points Zn:k, k > 1 where at the point Z~:k one has the cumulative picture N~,k. In this way, we have the sequence of test statistics

{ ~ , k ; k > 0} where we let 5¢~,0 = 0 , (12.16)

and the basic problem is to formulate a stopping rule that permits the interim analysis subject to a specified overall level of significance and a good power of the test. Since these statistics may not generally have independent or homogeneous increments, our basic formulation rests on suitable martingale characterizations

Page 59: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 305

and related invariance principles (Chatterjee and Sen, 1973; Majumdar and Sen, 1979). We have two related type of test statistics, that we refer to as Type A and Type B time-sequential test. In Type A testing, we take the analog of the classical sequential probability ratio type of test statistics, while in the other case, we incorporate the statistics in (10.16) directly.

First, we define a target number r(_< n) which can be settled from extraneous considerations. For example, if a study is planned for a maximum duration of 5 years, and for the specific cohort group, the probability of not surviving beyond these 5 years as estimated from census or other studies is say 0.10, then we may set r = [0.ln]. Next, we define a process kn = {k,(t),0 < t < 1} of nonnegative inte- gers by letting

k,(t) max{k:A~, k <_ tA2,r}, 0 < t < 1 , (12.17)

so that k,(t) is nondecreasing in t c (0, 1). Let then

Un(t) = 5fn,k,,(t), V~(t) = t- lUg(t) , t C (0, 1) . (12.18)

In the Type B procedure, we propose the test statistic

U~ ~ = sup{U~(t) : 0 < t < 1} , (12.19)

so that for a suitable critical level, ,2 say u,,~, we may define the stopping variable as

*2 K~(B) = inf{k~(t) _> 0: Un(t) > u~,~} . (12.20)

In the Type A scheme, as ~( t ) might not be properly defined when t is close to 0, we choose a positive e, could be small, and define the stopping variable as

*2 K~(A, c) = inf{k~(t) > kn(e) : V,,(t) _> vn,~} , (12.21)

,2 stands for the critical level. Note that v*,~ may generally depend on the where v,,~. chosen e. These critical levels are well approximable in terms of the corresponding levels for the (q parameter) Bessel processes (for Type B) and their normalized and truncated versions for Type A (Sen, 1981, ch. 2 and ch. 11). DeLong (1981) has tabulated these entries for various values of q(_> 1), ~, c. Incidentally, in the particular case of q = 1, the Brownian motion approximation (having indepen- dent and homogeneous increments) also allows us to make use of Pocock's GST formulation (based on independent and homogeneous normal subsamples of equal size and known variance). Some comparisons made in Sen (1999b) reveal that Pocock's numerical results might not be very accurate, particularly when K is not small; moreover, his numerical results stumbles into enormous computational difficulties when K is not small, whereas DeLong's results are accurate upto 4 decimal places for a much wider range of e values. The same criticism can be labelled against the O'Brien and Fleming (1979) numerical studies. Both assume homogeneous and independent increments and choose equally distant time- points; as has been explained earlier, in clinical trials this may not always be the case. Lan and DeMets (1983) introduced the clever idea of a spending funct ion

Page 60: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

306 P. K. Sen

that allows possibly uneven (but prefixed) spacings of the time-points; again if these time-points are stochastic (as is the case in clinical trials where the r(T/) are stochastic) then there is a need to make suitable adjustments to their exact for- mulation, which may result in a more crude approximation. DeMets and Lan (1994) have addressed some of these issues heuristically. We refer to Wei et al. (1990) and Wu and Lan (1992) for some discussions of monitoring in a sequential clinical trial.

The PCS approach provides a simple resolution for Type I and Type II cen- soring, as r (T ) or r can be either chosen from extraneorus considerations. The situation is more complex with random censoring schemes. However, as is gen- erally assumed in practice that the censoring times T,. and failure times X,- are independent, and in addition the Ti are unaffected by possible treatment effects (i.e., noninformative censoring). In this setup, the Cox (1972) partial likelihood approach places prime weights to the (nested) risk sets at the observable fail- urepoints, and a similar consideration applies to the Kaplan-Meier product limit estimator. In the present regression setup, if we work with the d.f. G of the T,-, and denote by Y/= min(X/, T,.) and let the d.f. of Y/ be denoted by Hi(y) , then on working with the respective survival functions, G, F /and Hi, we have

IYIi(t) = G(t)Fi(t) = a ( t ) F ( t - [J'ei), i = 1 , . . . , n , t >_ 0 . (12.22)

As such the null hypothesis of equality of the F~ implies the equality of the Hi, so that a PCS rank test (based on the Y,. instead of the Xi) can be prescribed as before. Since the divergence of the Hi will be damped by the presence of G (over the Fi), such a procedure will lose some power due to censoring. Again, a similar loss of power due to censoring occurs in the Cox model as well. The two approaches share a common property, namely the weak convergence to a Bessel process under the null hypothesis (and to a drifted one under a local (contiguous) al- ternative), and again DeLong's findings for the corresponding critical levels can be incorporated in the test formulation.

In clinical trials, in interim analysis, it is more customary to consider a dis- cretized monitoring scheme, though the number of possible looks may not be typically very small. There are certain points worth pondering in this context (Sen, 1999b), and we enlist some of these below.

(I) Large sample approximations, alternative hypothesis and the nature of the nonnull distribution. The Gaussian approximations, referred to earlier, hold mostly under the null hypothesis, and for local (contiguous) alternatives. Therefore, it remains to judge carefully whether such local alternatives are ap- propriate in the given context. Moreover, even if such a local alternative is per- tinent in the given experimental setup, typically the noncentral distributions involve a nonlinear drift function. Often, such a nonlinear drift cannot be transformed into a linear one by simple time-transformation. As such, analytical studies of the (asymptotic) power function for such local alternatives becomes difficult, and the prospect rests on simulation and numerical studies. In this re- spect, the setup of GST (wherein a linear drift function is presumed) is not of

Page 61: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 307

much relevance to clinical trials, and the well known results on linear boundaries crossing probabilities for a Brownian motion process may not be of much help.

(II) Power and Optimal Test. Not only it is difficult to study the (asymptotic or exact) power of multiple hypotheses tests in interim analysis, but also, because of their complex nature, no uniformly better test may exist. This feature, of course, suggests the need of simple testing procedures, so that the power can be studied conveniently, but experimental setups might make it difficult to justify the validity of such a simple testing scheme.

(III) Optimal designs. Unlike the conventional agricultural or biometric ex- periments, we do not have generally a simple design that captures the true ob- jectives of the study, and yet convey a linear model. Therefore, optimal designs may not exist. Mostly, the design is adopted from certain experimental consid- erations, and for such complex designs, it may be quite difficult to proclaim some optimality properties, even in an asymptotic sense. Rather, censoring, staggering entry plans, and other experimental constraints need to be appraised properly in formulating a suitable clinical study design, and to probe how far in that way a desirable testing scheme can be pursued with due statistical safeguards?

Therefore, validity, and robustness considerations along with experimental constraints dominate the choice of clinical trials designs. We shall discuss more on this item in the next section.

13. Design of bioenvironmental studies

Planning or design of a study preceeds statistical modeling and that in turn preceeds statistical analysis or drawing of statistical conclusions. This way, the modeling depends on the (sampling) design, and, of course, drawing of statistical conclusions depends a lot on the underlying statistical model and the sampling design. As has been stressed throughout that in bioenvironmental and public health studies, there is a predominant emphasis on the hazard identification, exposure to hazard levels and mensuration, and the level of exposure to the response (hazard) relationship. For this triplet, not only the modeling part could be quite complex, but also the sampling scheme may generally be quite non- standard, and thereby the classical parametric inference procedures may not be usually valid or efficient. Thus, in statistical modeling, we need to incorporate appropriate regularity conditions that enhance the scope of statistical conclu- sions, and this can be facilitated with proper safeguards on the planning or design aspects of the study. To illustrate this point, we present side by side an agricul- tural or biometrical study where typically for the response variable (say, X) a linear model with additive effects and normally distributed, homoscedastic and independent error components (say, e) is presumed. We have already indicated that possible departures from the assumed normality, independence, homo- scedasticity as well as the additivity can not only make parametric inference procedures inefficient but also inconsistent in some extreme cases. The Box-Cox type transformations are sometimes used to render linearity of the effects, but

Page 62: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

308 P. K. Sen

there might not be any guarantee that after the transformation, the additivity or the normality and homoscedasticity conditions may still be tenable. In a bioassay setup, say, in a quantal assay, the response variables are dichotomous whose probability laws would depend on the dose levels, typically in a nonlinear form. Both the logit and probit models try to linearize the dosage response regression by using a strictly parametric (logistic or normal) form of the underlying tolerance distribution. Again an incorrect assumption might lead to an incorrect model, and hence, the derived statistical conclusions might not be precise or even con- sistent in some extreme cases; this is particularly true for responses with either for low doses with very small chance of being positive or very high doses where they are close to 1. Yet in practice, due to various experimental (medical/environ- mental) constraints, only low doses are permissible. This often makes accelerated life testing models quite nonrobust with only very limited scope for drawing efficient statistical conclusions. We refer to Chen-Mok and Sen (1999) for some discussion of compliance models in bioenvironmental studies; other references are cited there. In the classical statistical experimental designs the tripos (of ran- domization, replication and local control) occupies the focal point. Although these are still the essential ingredients in designing clinical, biomedical, bioenviron- mental and other public health studies, in view of the associated sampling scheme and the model relevant to such a sampling design, each of these criteria has to be appraised in a possibly different manner. In many epidemiologic studies, matching, cohort studies, and case-control studies are adopted, resulting in dif- ferent sampling designs. In many retrospective studies, length-biased sampling schemes are employed. In biomedical studies, cross-over designs are often adopted on medical ethics and other extraneous considerations. Most of these models have been considered in some other accompanying articles in this volume, and hence, by cross references to most of them, we will avoid the duplication of presentation.

The general objectives of an intended study dictate the basic planning aspects to a greater extent, and hence, in many bioenvironmental and public health in- vestigations with complex objectives and various operational constraints, the designs are likely to be nonstandard and much more complex. As has already been emphasized in earlier sections, adoption of standard linear models in a simple design (like the completely randomized or randomized block design) may not be generally tenable here, and with greater complications in the interpretation and measurement of the primary endpoint and concomitant variates proper safeguards are needed to incorporate reliable and appropriate statistical analysis tools in order to draw valid and efficient statistical conclusions.

To illustrate the nature of experimental setups in bioenvironmental and public health studies, and allied designs, we consider the following examples.

(I) Life testing: industrial vs. clinical studies. In either setup, sampling may be destructive, as the selected units are followed until the failures occur. In that setup, apart from the relative cost of sampling there is a basic difference: In dealing with living subjects (mostly subhuman primates and human volunteers), the medical ethics prompts us to avoid loss of life as far as possible, and to take proper

Page 63: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 309

safeguard so that no subject is intentionally put to any extra risk due to some undesirable experimental factors. In an industrial setup, items are subject to si- multaneous life testing under varied experimental conditions, and their survival functions are to be compared in order to draw statistical conclusions on the ex- perimental factor. In industrial setting, usually exponential or Weibull survival models are considered in a GLM setup wherein the associated parameters are tied- down to suitable link functions. The choice of a link function, particularly, a canonical one, may require a good deal of background information on the un- derlying complex and their induced variations. In this manner, there may be ample room for lack of robustness properties for such G L M based designs. For this reason, sometimes, monotone failure rate distributions are considered that result in a semiparametric model with a somewhat greater robustness prospect. In a clinical setup, a parametric model, such as the Weibull or exponential, may not be universally tenable, and the presence of numerous concomitant and auxiliary variables may even call for some mixed-models where a parametric approach could be even more nonrobust. Some of these issues in connection with designing survival analysis regression models have been addressed in the accompanying article by Klein and Johnson (2000) in this volume. The classical Cox (1972) hazard regression model approach has better appeal from survival analysis point of view, and this has been explored quite extensively in the literature. However, the very basic PHM assumption inherent in this approach may not hold in all bio- environmental and public health investigations (Sen, 1994), and hence, in for- mulating appropriate designs for such studies, proper precautions should be taken.

(ii) Sampling design for bioenvironmental studies. In conventional agricultural and biometric studies, generally, the collection of observations follows traditional routes, and linear or generalized linear models (mostly, parametric) can be in- corporated in the planning and statistical analysis. The sampling scheme could be quite different in many bioenvironmental studies. For example, in a study of atmospheric pollution and its potential impact on health of human beings, there may be a good deal of inherent longitudinal data or repeated measurement design aspects; in such studies, spatial-temporal aspects also call for a critical appraisal. Conventional completely randomized or some simple blocked designs may not be very appropriate in such studies. For example, in monitoring the level of carbon particles and carbon monoxide in the air of a specific area, such as a town or a traffic-congested sector of a metropolitan area, it should be kept in mind that the level might not be stationary over an entire day period, nor from one day to another as a process; within a day, it can depend on the high-traffic intensity time periods as well as other humidity, moisture and accompanying atmospheric factors. There is generally a carry-over effect from earlier accumulation; cooking and (house-)heating practices might also contribute a lot to this phenomenon. Seasonality of the pattern can often be clearly identified. The picture prevailing in the adjacent places (viz., an industrial plant nearby) can also have some spatial effect. Moreover, the pollution level at the ground level may differ considerably from that above, say, 15 feet or more of higher elevations. Further, how to sample the air quality keeping in mind this three dimensional differential picture.

Page 64: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

310 P. K. Sen

Naturally, some nonstandard sampling schemes may need to be incorporated in such investigations. There is an additional complication in such bioenvironmental studies, namely, identifying the true end-point(s) as well as their total entity. For example, in studying the health hazards from atmospheric pollutions, which are the relevant response variates to be included in the study plan? There may be a large number of such end-points and they may be highly interacting. Besides, there may be some latent effects. This complex may call for some time-varying parametric or semiparametric models, for which an optimal simple design may not exist, and with a very complex design, either the performance of statistical analysis may become quite naive, or may require an enormously large number of observations that could be unattainable on cost and other grounds. We refer to the accompanying article by Weller et al. (2000) in this volume for some addi- tional discussion on inhalation toxicology where such design aspects are very paramount. Basically, the inhalation toxicity itself is of highly complex nature (Sen and Margolin, 1995), and hence, defining and interpreting that in simple terms could convey some loss of information and precision as well. Mutagenesis occurs in such studies, and that affects the response variables in a completely different direction. At the present, in many such studies, some markers are used to gain additional information on these genetic impacts and that way more precise statistical analysis can be made. However, from designing as well operational points of view, there are additional complications in such molecular biological studies, and much more remains to be accomplished in this direction.

(iii) Multiple end-points and allied designs. As has been mentioned earlier, in most bioenvironmental studies with a broad objective, it is usual to have multiple end-points; often, these end-points can be ordered in accordance with their rel- evance and impact on the study scheme. In such a case, a multi-response design can be adopted and a step-down procedure for drawing statistical conclusions can be incorporated in a suitable manner so as to have good control over the basic features of inference procedures (viz., Roy et al. (1971) for a general treatise of parametric designs). But, as has been explained earlier, such parametric models may not appear to be very realistic in the specific cases, and therefore a more general treatment of this type of designs in nonparametric and semiparametric setups is very much in need. Of course, such designs could be more complex, and in order to incorporate them in specific applications, sufficient care is needed to ensure validity and efficacy of statistical procedures that are appropriate for such studies.

There are other situations where multiple endpoints arise in a natural way, and on top of that there are clusters within which the observations may no longer be stochastically independent. This may arise in familial aggregation studies (for genetic disease or disorder) or in other contexts (viz., Clegg et al., 2000). Design for such clustered samples multiple endpoints studies could be quite different from conventional clinical trials, and there might be some competing risk setups in a broad sense (viz., DeMasi et al., 1997). In such cases, it may be harder to validate a simple parametric approach based on specific distributional assump- tions, and therefore, semiparametric and nonparametric methods are being

Page 65: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-pararnetrics in bioenvironmental and public health statistics 311

worked out more and more. In this respect, there is some emphasis on incorpo- rating the marginal models along with suitable covariance adjustments to account for possible dependence (Clegg et al., 2000) or to define cause specific conditional hazard functions and to incorporate the Cox P H M along with appropriate ad- justments for dependence (DeMasi et al., 1997). As these are discussed in detail in some accompanying articles in this volume, we avoid the repetition. Matrix- valued counting processes have also been incorporated in depicting the statistical flow of events in multiple endpoints clinical trials. We refer to Pedroso de Lima and Sen (1997, 1999) where other pertinent references are also cited.

There is an important factor that underlies many clinical studies where the primary end-point is failure or loss of life. As such, with human subjects, it is neither desirable nor possible to let the experiment run with the provision of this fatal end-point. In many clinical trials the variables of interest, known as the true endpoints, are either too costly or hard to measure, and hence some endpoints that are easier and less costly to measure are chosen. These are known as sur- rogate endpoints, and there is some lack of an unified statistical interpretation of such a surrogate endpoint. Prentice (1989) advocated that a surrogate endpoint should be a response variate for which a test for the null hypothesis of no rela- tionship to the treatment groups under comparison is also a valid test for the corresponding null hypothesis based on the true endpoint. Thus, the surrogate should not only be informative about the primary endpoint but also should fully capture the effect of treatment on the true endpoint. In this sense, it differs from the usual measurement error models where a surrogate is a substitute for a true covariate. It also differ from latent-class models (see, E1-Moalem and Sen, 1998). In this sense, a validation sample with both surrogate and primary endpoints is generally needed to ensure valid statistical estimate of the relationship between the surrogate and the true endpoints. Using a validation sample in addition to a surrogate endpoint sample, Pepe (1992) and Tsiatis et al. (1995) dealt with semiparametric approaches. These semiparametric procedures retain some flexi- bility of a comparatively smaller sample size with some compromise on the underlying model structures. The nonparametric formulations (Sen, 1994; E1-Moalem and Sen, 1998) though flexible with respect to model structures are generally more complex mainly due to the fact that any reduction of the relevant statistical information through only a few summaritative measures might not suffice the purpose, and a large dimensional parameter space may invariably require a comparatively larger sample size. The last two articles explored non- parametric rank tests for ANOCOVA models in dealing with surrogate end- points, and by cross reference, we omit the duplication here.

(iv) Crossover designs. Tudor et al. (2000) have nicely elaborated the setups in biomedical and environmental studies, where the primary endpoint or the prime response variate can re-occur in a pattern over a period of time. There are certain complications that arise in the interpretation of the component effects and in their modeling. For example, in the simplest case, in a 2 x 2 design, involving two treatments, say, A and B, for a number of subjects, treatment A is applied for a period of time, followed by B for a latter period, while for some other subjects, the

Page 66: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

312 P. K. Sen

sequence {A, B} is reversed. Even if these two periods are separated, there could be a carry-over effect from one period to another, and this may well depend on the sequence A,B (i.e., whichever is administered first). Greater complications may arise in more complex designs involving larger number of treatments and/or periods. Let us illustrate this with the following simple example. Suppose that there are three milk formulae, say A, B and C. Those newly born babies who are separated from their mothers either due to mother's death or possible moving to some orphanage, are fed with such substitute milk formulae. The plan is to judge the efficacy of the formulae by recording the gain in weight and height of a baby over a six month period. A simple one-way layout could relate to prescribing a particular formula over a six-month period, and invoking the usual growth curve models over repeated measurements at convenient time points, say, every 4 weeks. However, there is a feeling that one of the three formulae has some deficiencies (e.g., iron/vitamin/protein), and hence, its administration over a long period may induce serious health hazards. So it might be argued in favor of removing that formula from the study protocol. On the other hand, perninent statistical and medical information on that formula might be valuable in future nutritional research. One way of achieving this goal is to have a sequence of 3 periods of two months each, and have all possible 3! (= 6) subsets, each one containing A, B, C is a specific permutation (so that no specific formula is used for a period longer than 2 months). In this setup, it might depend a lot if a sequence starts with the weaker formula or not. Thus, there are carry-over effects that need to be properly identified and interpreted in modeling and statistical analysis. For other inter- esting examples, we refer to Tudor et al. (2000).

In traditional cases, a linear model is incorporated in crossover designs al- lowing carryover effects in cyclic patterns and assuming normal errors. As has been repeatedly stressed in earlier sections, in many bioenvironmental and public health (specially in environmental epidemiologic) studies, there could be very little justification for adoption of a normal theory parametric models, and hence, more emphasis is being paid nowadays on suitable nonparametric and semiparametric models. The recent text by Senn (1993) has captured some of these developments in good applications standing, though there remains ample room for further methodological developments. A more recent text by Diggle et al. (1997) covers a broader treatise of semiparametrics in this context. Tudor et al. (2000) contains a good account of some application oriented nonparametrics in crossover designs for bioenvironmental and public health studies.

(v) Case-control studies. In epidemiologic studies, case-control designs are often used not only for observational convenience but also for more informative data collection. In .familial aggregation studies in the context of genetic epide- miology, a nice accopunt of case-control studies is due to Laird et al. (2000). Whereas in the cohort (or prospective) studies, pertaining to the relationship between a disease and a hypothesized risk factor, subjects are selected on the basis of their exposure to the risk factor, in case-control or retrospective studies, subjects are selected on the basis of their disease status (along with various concomitant variates that are generally associated with the disease status). This

Page 67: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-pararnetrics in bioenvironmental and public health statistics 313

results in a different sampling scheme, and thereby calls for alternative criteria for choosing an optimal (or at least desirable) design. In many such studies, the outcome variable is binary (or polychotomous), and hence, we have suitable contingency tables with different restraints for different sampling schemes. In epidemiology, there is a natural emphasis on odds ratio (OR) (or their general- izations for polychotomous responses) in the assessment of such disease-risk factor relationship, and many conventional nonparametric models have been adapted. In the simplest 2 x 2 case (disease-nondisease and exposed-nonexposed) there are good asymptotic results for such nonparametric procedures as may be found in contemporary texts (see for example, Agresti, 1995).

Semiparametric developments are of comparatively recent origin. Among these, the logistic regression model and the Cox proportional hazards model have been advocated in some studies (Whittemore, 1995; Zhao and Prentice, 1990). These have been discussed in the accompanying article by Laird et al. (2000) in this volume, and hence, we avoid the repetition.

14. Molecular biology and genetics

In recent years there has been a spectacular evolution of statistical reasoning in molecular biology and genetics. This not only has strengthened the frontiers of statistical genetics but also has linked epidemiologic and environmental genetics in a broader field of profound scientific as well as social interest. Statistical ge- netics has the genesis in the Mandelian hereditary principles that rest on some simple probability structures. In more complex genetic studies, such probability laws may also become highly complex, and there are genuine statistical issues that merit careful appraisals. Chakraborti and Rao (2000) have addressed some of these issues in an accompanying article in this volume. For this reason, we take recourse to a complementary area and review of the recent developments where non-parametrics play a basic role.

Population genetics and epidemiologic genetics have been an active domain of fruitful statistical research for quite sometime, and the accompanying article by Pinheiro et al. (2000) in this volume pertains to some of these developments. Mutagenesis is an important topic of indepth study not only from academic interest but also from practical considerations arising in the emerging fields of biotechnology as well as environmental health sciences. The evolution of DNA/ RN A research with the focus on sexually transmitted diseases has truely opened an enormous field, and there is a profound need to import more sophisticated statistical tools in such studies. In this sense, computational biology may be regarded as an interdisciplinary field whose practitioners come from diverse backgrounds, including molecular biology, mathematics, statistics and biosta- tistics, computer science, and physics. The basic principles of molecular genetics provide the foundation of this complex field of study. Computational biology has emerged as especially important since the advent of the GENOME projects. The Human GENOME project alone give us the raw sequence of an estimated

Page 68: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

314 P. K. Sen

70,000 to 100,000 human genes, only a fraction of which have been studied experimentally. Most of the problems in computational sequence analysis (CSA) are essentially statistical. Stochastic evolutionary forces act on genomes. Dis- cerning significant similarities between anciently diverged sequences amid a chaos of random mutation, natural selection, and genetic drift presents serious signal to noise probability theory. On these grounds, probabilistic models have therefore been advocated, though there is ample room for further developments in this evolutionary field. Hidden Markov models (HMM), discussed for example in Durbin et al. (1998), are advocated strongly, though there are certain limi- tations for such an approach. The main factor being a lack of spatial topography that is needed for HMM. This we explain below with two simple problems in molecular biology. We are primarily interested in both internal analysis and external analysis that are typically comparable to internal multivariate analysis and multivariate analysis of variance (MANOVA) models, though in structure they differ fundamentally. For example, let us consider DNA sequences from human immunodeficiency virus (HIV), as may be studied on a geographical (spatial) basis or over a period of time (i.e., temporal) basis; they represent the MANOVA models. Also, in an internal analysis, resembling the canonical analysis, we may be interested in the covariation at different sites for a set of biological sequences.

In the DNA sequence, typically, we have K, a large number, of sites, and at each site, at a point of time, there is a polychotomous response relating to C categories that represent the prevalent amino acid or nucleotide (e.g., the nucle- otide levels (A, C, T, G)). There is no prevalent ordering (even partial) of these categories. Though the outcome X~k for the ith sequence, kth site can take on the indices {1 , . . . ,C} , for each k ( = I , . . . K ) , the coordinates of the vector Xi = (X~I,... ,X~x) ~ are generally not stochastically independent. In addition the number (K) of sites could be very large, with very little information on the proximity of different positions in a conveniently interpretable spatial sense. In this manner, we encounter a K x C categorical response model with the marginal probabilities

:zk(c ) = P { X i k = c } , c = 1 , . . . , C; k = 1 , . . . ,K . (14.1)

where we have the restraints

c

~-~zk(e) = 1, Vk = 1 , . . . , K . (14.2) e= l

On top of that as the elements of Xi are not generally independent, we may not be in a position to adopt the conventional product-multinomial law that is used for categorical data models. Technically, we need to define a K-vector e = ( c l , . . . , ex) ~, where each ck can take on the values 1 , . . . , C, and define the joint probabilities as

7z(e) = P { X i = c } , c C ~ , (14.3)

Page 69: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 315

where cg = {e : ck = 1 , . . . , C , k = 1 , . . . , K } is a K dimensional grid-set. If the K = [I~=1 ~rk(Ck), Ve E ~. The crux sites were independent then we would have ~(e) x

of the statistical problem is to deemphasize the independence assumption and develop suitable statistics for analysis of such multi-site DNA sequence data models. In this respect, we may want to test for the hypothesis of independence at two particular sites, or we may consider several groups of DNA sequences, and want to study within and between group statistical variations. We refer to Kar- noub et al. (1999) for such tests for independence, and to Pinheiro et al. (2000) for the genetic analysis of variation problem. There are some other developments based on Monte Carlo Markov chain (MCMC) modelings and Gibbs sampling tools; however, with our emphasis on the nonparametrics, we shall mostly confine ourselves to some recently proposed nonparametric tools.

Consider now a set of N sequences Xi, i = 1 , . . . ,N, and for any two such sequences Xi, Xj, each relating to K sites, define the Hamming distance Dij as

Dij = (Number of positions where X/k,Xyk differ)/K K

= K -1 Z l ( X i k 7L~.k) . (14.4) k=l

Note that if the two sequences are independent and identically distributed, then

K

k = l

K C

k= l c= l

= K -1 1 - ~z~(k) = K -~ ~ ¢ k , (14.5) c= l k= l

where J~ = 1 - ~cc~ ~z2(k) is the well known Gini-Simpson index of biodiversity (Simpson, 1949) for the kth position, for k = 1, . . . ,K. Thus, the Hamming dis- tance between two sequences is the sample counterpart of the average (over the K sites) Gini-Simpson indexes. In this formulation, we are not assuming the sto- chastic independence for the K sites, and in D¢j their dependence pattern will show up in the formula for its sampling variance.

The last equation paves the way for two related but apparently different sta- tistical approaches for the analysis of such sequences of data sets. First, the U- statistics approach: With the Dij defined as in (12.4), for every ( i , j ) : l _< i < j < N, we can define

DN= ~ Dij (14.6) { 1 <_i<j<N}

as the best (symmetric, unbiased and minimum risk) estimator of A. Clearly, DN is a U-statistic (Hoeffding, 1948) where the kernel D~j = D(Xi, Xj) is of degree 2 and

Page 70: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

316 P. K . S e n

is bounded. Secondly, we can estimate the individual no(k), c= 1 , . . . , C, k = 1 , . . . , K by the respective sample proportions among the N se- quences (belonging to the cth category in the kth site); we denote these ~cc(k), c = 1 , . . . , C, k = 1 , . . . , K . Then we can consider a plug-in estimator

K C

]IN = 1 -- K - 1 Z Z gCc2(k) " (14.7) k=l c=l

This formulation corresponds to the von Mises (1947) functional, and may not be unbiased for A. However, the two estimators in (12.6) and (12.7) are very close to each other. In fact, it follows readily that

[J~N - - ] IN[ = O(N -1) almost surely as N ~ oc . (14.8)

For this reason, Pinheiro et al. (2000) advocated the U-statistics approach. Consider now the usual ANOVA model in this extended categorical data

model context (see Light and Margolin (1972), for some special cases). Let there be G such groups of individuals, and for each group let there be N sequences of X/°, i = 1, . . . ,N, g = 1, . . . ,G. For the gth group, we define A(g)(= 5 (gg)) as in (12.5) as the within-group Gini-Simpson index, while for a pair (g, g~) of groups, we define a between-group measure by

K

A(°'g') ~- K - 1 Z P{X~ ¢ X~} . (14.9) k=l

Note that the A (°'g') can be unbiasedly estimated by a (generalized) U-statistics

N N K

i~l j = l k=l

(14.10)

On the other hand, pooling all the G groups together into NG sequences, we can define an overall measure by D~v G in the sameway as in (12.6). Pinheiro et al. (2000) succeed is showing that the usual ANOVA type statistical analysis can be performed (though in an asymptotic setup where N is taken large) in a reasonable way, and that paves the way for testing for between-group divergence in the light of the within-group measures.

In passing, we may note here that the Gini-Simpson diversity index has been intensively studied and its relationship with entropy based measures have been explored by a host of researchers; we refer to Rao (1982a-c), Nayak (1986a,b) and others; most of these works related to genetic diversity and related problems. Sen (1999e) has incorporated such indexes in a more general utility-oriented form to not only income inequality studies but also to a broader context of quality of life studies.

A high rate of mutation due to error-prone reverse transcriptase and a high rate of replication between the two RNA strands lead to a evolution of HIV

Page 71: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 317

genome at a fast rate. An appraisal of their simultaneous mutat ion processes at several sites may not only provide useful information on protein structures, but also on the linkage between these sites. In this respect, for sequence data, usually it is assumed the positions undergo independent mutat ion processes, though in all likelihood this assumption of independence might not be very realistic. Karnoub et al. (1999) have considered a specific pair of sites, say Position I and II, and assume that in Position I, a pair of amino acids, say V and W, are present while in Position II, another pair, say D and E prevail. A consensus pair in the resulting 2 x 2 table refers to the most frequent configuration. We place the consensus pair on the left top corner of the 2 x 2 table, so that the pair on the right bo t tom corner refers to the double mutation. Suppose that we have N sequences of which Nil relate to the consensus pair, while the other three entries are denoted by N12,N21 and N22 respectively. Basically, we like to model the underlying proba- bility law for these counts in such a way that we could test for the null hypothesis of independence of mutations in the two sites, against possible dependences. Though tempting, the conventional Fisher's (1932) exact (conditional) test for independence in a 2 x 2 contingency table, may not be very appealing in the current situation for the following reason. In the conventional case, the condi- tional test is based on conditioning both the marginals (namely NI. = N11 + N12 and N.1 = Nil 4- N21). In the present case, the way the 2 × 2 table is constructed, the conditioning is to be made on the consensus pair count (N~I), and not on the marginals. Hence, Karnoub et al. (1999) have formulated another conditional test that addresses this conditioning in the proper manner. Essentially, they assume that the Nij, i , j = 1,2 are independent Poisson variables with parameters, say, )~ij, i , j = 1,2, and in conformity with the stochastic largeness of the consensus pair count over the others, they further assume that

211 > > (-)q2, 221, "~22) • (14.11)

Under the null hypothesis of independence, we have

2 v = 2.o:ifij, i , j = 1,2 , (14.12)

where )~ = )oll + 212 + 221 +-)~22, and the cq, fij are nonnegative quantities satis- fying cq + ~2 = 1 = fil + fi2- They show that the event that Nll is greater than all the other three counts has a probabili ty that goes to 1 as N increases, and ad- vocated the estimation of 0 = alfll f rom the marginal binomial law for Nll, given N. Then they consider the conditional law of N12,N21,N22 , given Ni l ,N , and estimate cq subject to the restraint that cqfll = Nl l /N . Let this estimator be de- noted by &l. The other estimator fil is obtained by using the same restraint on O, and the complementary parts are denoted by &2, f12. Then as a measure of divergence from independence, they consider the statistic

222 = N-I/2(N22 - N&2fi2 ) . (14.13)

An estimator of the asymptotic mean square error of 222 (under the null hy- pothesis), denoted by VN 2, has also been prescribed in this context. As such, as a

Page 72: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

318 P. K. Sen

test statistic (for testing the null hypothesis of independence under the consensus setup), one may consider

T N = Z 2 2 / V N , (14.14)

and exploit the asymptotic normality of TN (under H0) for the construction of the rejection region. Some simulation studies made by them conform to the expected pattern.

In reality, of course, the situation is much more complex than the simple 2 x 2 case considered above. Not only we have a general r x c case, for some r _> 2, c _> 2, but also there are many sites, resulting in a very high dimensional contingency table, with possibly small counts in many of the cells formed in this multiple categorization. The general concept of a consensus cell or even a mul- titude of consensus pairs needs to be appraised properly. Secondly, in view of this identification the proper conditioning arguments in favor of the distribution theory of suitable test statistics needs to be examined. Further, in formulating suitable null hypotheses and their alternatives, we need to keep in mind that in general nonnormal or categorical multivariate laws, there might not be an explicit relationship between measures of pairwise independence and higher order or total independence. Therefore, a null hypothesis has to be chosen on the basis of the set objectives of the study and their statistical resolutions. There are at least two other important factors we should keep in mind. First, there may not be a spatial proximity of the different sites, and there are too many such sites that have been identified on merely biological or genetical observations. Hence, reducing the number of sites to a canonical set may not be feasible in all such studies. Working with too many sites may not only require an enormously large number of se- quences in order to achieve an appropriate margin of sampling errors but also development of some summaratitive measures that reflects the impact of all these sites in a comprehensive manner. Secondly, even for the simple 2 × 2 case treated above, it may not be always prudent to assume that the observations in a se- quence are all stochastically independent. This drawback has been discussed in the literature in the context of HIV problems, and some researchers have adocated the use of appropriate dimensional Markov chains. While that can be worked out to a certain extent for introducing some dependence patterns to the sequence data, there is still a big controversy on the use of Markov Chains or Markov Fields to adequately address the dependence pattern for the different sites for a single observation. This difficulty primarily arise due to any geo- graphical proximity of the sites or any other norm that could order them or single out them in suitable neighborhoods (as in done in neuronal network studies relating to the CNS (central nervous system) or the cortex). Statisticians and biomathematicians need to have better understanding of the genome complex from molecular genetists and biologists in order to comprehend more appropriate statistical tools for valid and fruitful statistical analysis of molecular genetical data models.

There are many researchers currently engaged in the broad field of bioenvi- ronment and public health (science and practice); their primary emphasis is placed

Page 73: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 319

o n the iden t i f i ca t ion o f h a za rd s f r o m v a r i o u s factors , iden t i f i ca t ion a n d m e n s u - r a t i o n o f h u m a n exposures to such p reva i l ing hazards , the i r p o t e n t i a l g r o w t h over t ime, a n d in the l ight o f a p p r o p r i a t e dose - r e sponse (hazards) r e l a t ionsh ip , to assess the ex ten t o f ecologica l d a m a g e s to o u r su rv iv ing prospects . This assess- m e n t has a g e n u i n e need for o u r surviva l as well as b e t t e r m e n t o f life o n ear th , a n d we look fo rwa rd to m o r e m e a n i n g f u l i n t e r a c t i o n s be tween in t e rd i sc ip l ina ry scient is ts to achieve a f ru i t fu l r e so lu t ion . Sta t is t ical r ea son ing , however , occupies a focal p o i n t in this respect , a n d the p a r a d i g m is far b e y o n d the c o n v e n t i o n a l pa r ame t r i c s re la ted to m o d e l bu i ld ing , as well as, d r a w i n g useful a n d r e l evan t conc lus ions . N o n - p a r a m e t r i c s w o u l d n a t u r a l l y ho ld the k e y w o r d in the m u c h an t i c i pa t ed s ta t is t ical d e v e l o p m e n t s in this v i ta l b i o e n v i r o n m e n t a n d pub l i c hea l th

discipl ines.

References

Adichie, J. N. (1978). Rank tests for subhypotheses in the general linear regression. Ann. Statist. 6, 1012-1026.

Agresti, A. (1990). Categorical Data Analysis, John Wiley, New York. Andersen, P. K., O. Borgan, R. D. Gill and N. Keiding (1993). Statistical Models Based on Counting

Processes'. Springer-Verlag, New York. Armitage, P., C. K. McPherson and B. C. Rowe (1969). Repeated significance tests on accumulating

data. J. Roy. Statist. Soc. A 132, 235-244. Armitage, P. (1991). Interim analysis in clinical trials. Statist. Med. 10, 925-937. Bahadur, R. R. (1961). A representation of the joint distribution of responses to n dichotomous items.

In: Studies in Item Analysis and Prediction (Ed., H. Solomon) pp. 158-176. Stanford Univ. Press, Calif.

Brown, G. W. and A. M. Mood (1951). On median tests for linear hypotheses. In Proc. 2nd Berkeley Symp. Math. Statist. Prob. (Ed., J. Neyman), vol. 1, pp. 159-166. Univ. Calif. Press, Los Angeles.

Carroll, R. J., D. Ruppert and L. A. Stefansky (1995). Nonlinear Measurement Error Models, Chapman and Hall, London.

Chakraborti, R. and C. R. Rao (2000). Selection biases of samples and their resolutions. In Handbook o f Statistics, Vol. 18. Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen) Elsevier, Amsterdam, pp. 673-712.

Clegg, L. X., J. Cai and P. K. Sen (2000). Modeling multivariate failure time data. In Handbook o f Statistics', Vol. 18." Bioenvironment and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), North Holland, Amsterdam, 803-838.

Chatterjee, S. K. (1966). A bivariate sign-test for location. Ann. Math. Statist. 37, t771-1781. Chatterjee, S. K. and P. K. Sen (t964). Nonparametric tests for the bivariate two-sample location

problem. Calcutta Statist. Assoc. Bull. 13, 18 58. Chatterjee, S. K. and P. K. Sen (1965). Nonparametric tests for the bivariate two-sample association

problem. Calcutta Statist. Assoc. Bull. 14, 14~34. Chatterjee, S. K. and P. K. Sen (1966). Nonparametric tests for the multivariate multisample location

problem. In Essays in Probability and Statistics in Memory o f S. N. Roy (Eds., R. C. Bose et al.), pp. 19%228. Univ. N. Carolina Press, Chapel Hill, NC.

Chatterjee, S. K. and P. K. Sen (1973). Nonparametric testing under progressive censoring. Calcutta Statist. Assoc. Bull. 22, 13 50.

Chen-Mok, M. and P. K. Sen (1999). Nondifferentiable dose-compliance error logistic models. Comm. Statist. Theor. Meth. 28, 931-946.

Page 74: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

320 P. K. Sen

Cox, D. R. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser. B 34, 187-220.

Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269-276. Davidson, R. and R. A. Bradley (1970). Multivariate paired comparisons: Some large sample results

on estimation and tests of equality of preference. In Nonparametric Techniques in Statistical In- ference (Ed., M. L. Purl), pp. 111-125. Cambridge Univ. Press, New York.

DeLong, D. M. (1981). Crossing probabilities for a square root boundary by a Bessel process. Comm. Statist. Ser. A 10, 2197 2213.

Delong, E. R. and D. M. DeLong (2000). Statistical applications in cardiovascular diseases. In Handbook of Statistics, Vol. 18." Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 915-940. North Holland, Amsterdam.

DeMasi, R. A. (2000). Statistical methods for multivariate failure time data and competing risks. In Handbook of Statistics, Vol. 18. Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 749-782. North Holland, Amsterdam.

DeMasi, R., B. Qaqish and P. K. Sen (1997). Statistical models and asymptotic results for multivariate failure time data with generalized competing risks. Sankhya, Ser, A 59, 408-435.

DeMets, D. L. and K. K. G. Lan (1994). Interim analysis: The alpha spending approach. Statist. Med. 13, 1341-1352.

Diggle, P. J., K. Y. Liang and S. L. Zeger (1997). Analysis of Longitudinal Data, Oxford Univ. Press, Oxford, UK.

Durbin, R., S. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis': Probabilistic models of proteins and nucleic acids. Cambridge Univ. Press, UK.

E1-Moalem, H. and P. K. Sen (1998). Nonparametric recovery of interblock information in clinical trials with a surrogate endpoint, or. Statist. Plan. Infer. 72, 185-205.

Finney, D. J. (1964). Statistical Method in Biological Assay, Charles Griffin, London, 2nd ed. Fisher, R. A. (1932). Statistical Methods for Research Workers, Oliver-Boyd, Edinburgh. Freedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of

variance. J. Amer. Statist. Assoc. 32, 675-701. Ghosh, M., J. E. Grizzle and P. K. Sen (1973). Nonparametric methods in longitudinal studies.

J. Amer. Statist. Assoc. 68, 29-36. Green, P. J. and B. W. Silverman (1994). Nonparametric Regression and Generalized Linear Models,

Chapman-Hall, London. Gutenbrunner, C, and J. Jure6kovfi (1992). Regression rank scores and regression quantiles. Ann.

Statist. 20, 305-330. Hastie, T. J. and R. J. Tibshirani (1990). Generalized Additive Models, Chapman and Hall, London. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist.

19, 293 325. Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the residuals.

Ann. Math. Statist. 43, 1449-1458. Jennison, C. and B. W. Turnbull (1990). Statistical approaches to interim monitoring of medical trials;

a review and commentary. Statist. Sci. 3, 299-317. Jure~kovfi, J. and P. K. Sen (1996). Robust Statistical Procedures: Asymptotics and Interrelations, John

Wiley, New York. Karnoub, M. C., F. Seillier-Moiseiwitscz and P. K. Sen (1999). A conditional approach to the de-

tection of correlated mutations, Inst. Math. Statist. Lect. Notes and Mon. Ser. 33, 221-235. Kiefer, J. and J. Wolfowitz (1952). Stochastic estimation of the maximum of a regression function.

Ann. Math. Statist. 23, 462-466. Kim, H. and P. K. Sen (2000). Robustness in bioassays and bioequivalence studies. SankhSt, Ser. B 62,

in press. Klein, J. P. and R. A. Johnson (1999). Regression model for survival data. In Handbook of Statistics,

Vo118: Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 161-192. North Holland, Amsterdam.

Koenker, R. and G. Bassett (1978). Regression quantiles. Econometriea 46, 33 50.

Page 75: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 321

Koziol, J. A., D. A. Maxwell, M. Fukushima, M. E. M. Colmerauer and Y. H. Pilch (1981). A distribution-free test for tumor growth curve analysis with applications to an animal tumor immunotheraphy experiment. Biometrics 37, 383-390.

Kruskai, W. H. and W. A. Wallis (1952). Use of ranks in one-criterion variance analysis. J. Amer. Statist. Assoc. 47, 583-621.

Laird, N. M., G. M. Fitzmaurice and A. G. Schwartz (2000). The analysis of case-control data: Epidemiologic studies of familial aggregation. In Handbook of Statistics: Vol. 18. Bioenviron- mental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 465-482. North Holland, Amsterdam.

Lan, K. K. G. and D. L. DeMets (1983). Discrete sequential boundaries for clinical trials. Biometrika 70, 659-663.

Lehmann, E. L. (1953). The power of rank tests. Ann. Math. Statist. 24, 23-43. Liang, K. Y. and S. Zeger (1986). Longitudinal data analysis using generalized linear models. Bio-

metrika 73, 13~2. Light, R.J. and B. H. Margolin (1971). An analysis of variance for categorical data. J. Amer. Statist.

Assoc. 66, 534 544. Light, R. J. and B. H. Margolin (1974). An analysis of variance for categorical data, II: Small sample

comparisons with chi-square and other competitors. J. Amer. Statist. Assoc. 69, 755-764. Lyles, R. and L. L. Kupper (2000). Measurement error models for environmental and occupational

health applications. In Handbook of Statistics, Vol. 18: Bioenvironmental and Public Health Sta- tistics (Eds., C. R. Rao and P. K. Sen), pp. 50t-517. North Holland, Amsterdam.

Maguar, D. and V. M. Chinchilli (2000). Methods for establishing in vitro-in vivo relationships for modified release drug products. In Handbook o f Statistics, Vol 18: Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 975-1002. North Holland, Amsterdam.

Majumdar, H. and P. K. Sen (1978). Nonparametric tests for multiple regression under progressive censoring. J. Multivar. Anal. 8, 73-95.

McCullagh, P. and J. Netder (1989). Generalized Linear Models. 2nd ed. Chapman Hall, London Nayak, T. K. (1986a). Sampling distributions in analysis of diversity. Sankhy~ B 48, 1-9. Nayak, T. K. (1986b). An analysis of diversity using Rao's quadratic entropy. Sankhya B 48, 315 330. Nelder, J. A. and R. W. M. Wedderburn (1972). Generalized linear models. J. Roy. Statist. Soc. Ser A

135, 370-384. O'Brien, P. C. and T. R. Fleming (1979). A multiple testing procedure for clinical trials. Biometrics 35,

549-556. Ohanian, E. V., J. A. Moore, J. R. Fowle III, G. S. Omenn, S. C. Lewis, G. M. Gray and D. W. North

(1997). Risk characterization: A bridge to informed decision making (Workshop Overview). Fundamen. Appl. Toxicol. 39, 81-88.

Pedroso de Lima, A. C. and P. K. Sen (1997). A matrix valued counting process model with first-order interactive intensity. Ann. Appl. Prob. 7, 494~507.

Pedroso de Lima, A. C. and P. K. Sen (1999). Time-dependent coefficients in a multi-event model for survival data. J. Statist. Plann. Infer. 75, 393-414.

Pepe, M. S. (1992). Inference using surrogate outcome data and a validation sample. Biometrika 79, 355-365.

Peto, R., M. C. Pike, P. Armitage, et al. (1976). Design and analysis of randomized clinical trials requiring prolonged observations of each patient: 1, Introductibn and design. Brit. J. Cancer 43, 153 162.

Pinheiro, H. P., F. Seillier-Moiseiwitsch, P. K. Sen and J. Eron (1999). Multivariate CATANOVA and applications to DNA sequences in categorical data. In this volume, pp. 713 746.

Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika 64, 191 199.

Potthoff, R. F. and S. N. Roy (1964). A generalized multivariate analysis of variance model especially useful for growth curve problems. Biometrika 51, 313 326

Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statist. Med. 8, 431-440.

Page 76: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

322 P. K. Sen

Puri, M. L. and P. K. Sen (1971). Nonparametric Methods in Multivariate Analysis, John Wiley, New York.

Puri, M. L. and P. K. Sen (1985). Nonparametric Methods in General Linear Models, John Wiley, New York.

Rao, C. R. (1982a). Gini-Simpson index of diversity: A characterization, generalization and appli- cations. Utilitus Mathematica 21, 273-282.

Rao, C. R. (1982b). Diversity and dissimilarity coefficients: A unified approach. Theor. Popul. Biol. 21, 24M3.

Rao, C. R. (1982c). Diversity: Its measurement, decomposition, apportionment and analysis. Sankhy~t A 44, 1-21.

Robbins, H. and S. Monro (1951). A stochastic approximation method. Ann. Math. Statist. 22, 400 407.

Roy, S. N. (1957). Some Aspects of Multivariate Statistical Analysis, John Wiley, New York/Statist. Pub. Calcutta

Roy, S. N., R. Gnanadesikan and J. N. Srivastava (1971). Design and Analysis of Some Multifactor and Multiresponse Experiments. Pergamon Press, New York.

Sen, P. K. (1963). On the estimation of relative potency in dilution (-direct) assays by distribution-free methods. Biometrics 19, 53~552.

Sen, P. K. (1964). Tests for the validity of the fundamental assumption in dilution (-direct) assays by distribution-free methods. Biometrics 20, 770-784.

Sen, P. K. (1965). Some further applications of nonparametric methods in dilution (-direct) assays. Biometrics 21, 799-810.

Sen, P. K. (1968a). Estimates of the regression coefficient based on Kendall's tau. J. Amer. Statist. Assoc. 63, 1379-1389.

Sen, P. K. (1968b). On a class of aligned rank order tests in two-way layouts. Ann. Math. Statist. 39, 1115 1124.

Sen, P. K. (1968c). Robustness of some nonparametric procedures in linear models. Ann. Math. Statist. 39, 1913-1922

Sen, P. K. (1968d). Asymptotically efficient tests by the method of n-rankings. J. Roy. Statist. Soc. Ser. B 30, 31~317.

Sen, P. K. (1969). On a class of rank order tests for the parallelism of several regression lines. Ann. Math. Statist. 40, 1668 1683.

Sen, P. K. (1970). Nonparametric inference in replicated 2 m factorial experiments. Ann. Inst. Statist. Math. 22, 281-294.

Sen, P. K. (1971). Robust statistical procedures in problems of linear regression with special reference to quantitative bio-assays, I. Internat. Statist. Rev. 39, 21-38.

Sen, P. K. (1972). Robust statistical procedures in problems of linear regression with special reference to quantitative bio-assays, II. Internat. Statist. Rev. 40, 161-172.

Sen, P. K. (1973). Some aspects of nonparametric methods in multivariate statistical analysis. In Multivariate Statistical Analysis (Eds., D. G. Kabe and R. P. Gupta), pp. 230 240. North Holland, Amsterdam.

Sen, P. K. (1981). Sequential Nonparametries: Invarianee Principles and Statistical Inference, John Wiley, New York.

Sen, P. K. (1984). Nonparametfic procedures for some miscelianeous problems. In Handbook of Statistics, Vol. 4: Nonparametric Methods' (Eds., P. R. Krishnaiah and P. K. Sen), pp. 699-740. North Holland, Amsterdam.

Sen, P. K. (1985). Theory and Applications of Sequential Nonparametrics, CBMS-NSF SIAM Publi- cation, No. 49, Philadelphia.

Sen, P. K. (1988). Functional jackknifing: Rationality and general asymptotics. Ann. Statist. 16, 450 469.

Sen, P. K. (1991). Repeated significance tests in frequency and time domains. In Handbook of Sequential Analysis (Eds., B. K. Ghosh and P. K. Sen), pp. 169-198. Marcel Dekker, New York.

Page 77: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

Non-parametrics in bioenvironmental and public health statistics 323

Sen, P. K. (1993). Perspectives in multivariate nonparametrics: Conditional functionals and ANOC- OVA models. Sankhygt, Ser. A 55, 516-532.

Sen, P. K. (1994a). Incomplete multiresponse designs and surrogate endpoints in clinical trials. J. Statist. Plan. Infer. 42, 161-186.

Sen, P. K. (1994b). Some change-point problems in survival analysis: Relevance of nonparametrics in applications. J. Appl. Statist. Sci. 1, 425444.

Sen, P. K. (1994c). Bridging the biostatistics-epidemiology gap: the Bangladesh task. J. Statist. Res. 28, 21-39.

Sen, P. K. (1994d). Incomplete multiresponse designs and surrogate endpoints in clinical trials. J. Statist. Plan. Infer. 42, 161-186.

Sen, P. K. (1995a). Censoring in theory and practice: Statistical perspectives and controversies. Analysis' o f Censored Data, I M S Lecture Notes Monog. Sr. 27 (Eds., J. V. Deshpande and H. L. Koul), pp. 177-192.

Sen, P. K. (1995b). Paired comparisons for multiple characteristics: An ANOCOVA approach. In: Statistical Theory and Applications: Papers in Honor o f Herbert A. David (Eds., H. N. Nagaraja, P. K. Sen and D. F. Morrison), pp. 247364. Springer-Verlag, New York.

Sen, P. K. (1995c). Statistical analysis of some reliability models: parametrics, semiparametrics and nonparametrics. J. Statist. Plan. Infer. 43, 41-66.

Sen, P. K. (1996a). Regression rank scores estimation in ANOCOVA. In Ann. Statist. 24, 1586 1602. Sen, P. K. (1996b). Robust and nonparametric estimation in linear models with mixed effects. Tetra

Mount. Math. Publ. 7, 231 243. Sen, P. K. (1996c). Generalized linear models in biomedical applications. In Applied Statistical Sci-

ences. I (Eds., M. Ahsanullah and D. Bhoj), pp. 1 22. Nova Publ., New Jersey. Sen, P. K. (1997). A critical appraisal of generalized linear models in biostatistical analysis. J. Appl.

Statist. Sci. 5, 69 83. Sen, P. K. (1999a). Robust nonparametrics in mixed-MANOVA models. J. Statist. Plan. Infer. 75,

433 451. Sen, P. K. (1999b). Multiple comparisons in interim analysis. J. Statist. Plan. Infer. 82, 5-23. Sen, P. K. (1999c). Utility-oriented Simpson-type indexes and inequality measures. Calcutta Statist.

Assoc. Bull. 49, 1-22. Sen, P. K. (1999d). Generalized linear and additive models: Robustness perspectives. Revista Brasiliera

de Probabilidade e Estatistica 13, 91-112. Sen, P. K. (2000). Bioenvironment and public health: Statistical perspectives. In Handbook of Sta-

tistics, Vol. 18: Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 3 29. North Holland, Amsterdam.

Sen, P. K. and H. A. David (1968). Paired comparisons for paired characteristics. Ann. Math. Statist. 39, 200308.

Sen, P. K. and B. H. Margolin (1995). Inhalation toxicolgy: Awareness, identifiability, statistical perspectives and risk assessments. Sankhya, Ser. B 57, 252-276.

Sen, P. K. and M. L. Puri (1967). On the theory of rank order tests for location in the multivariate one sample problem. Ann. Math. Statist. 38, 1216 1228.

Sen, P. K. and M. L. Puri (1977). Asymptotically distribution-free aligned rank order tests for com- posite hypotheses for general multivariate linear models. Zeit. Wahrsch. verw. Geb. 39, 175 186.

Sen, P. K. and J. M. Singer (1993). Large Sample Methods in Statistics." An Introduction with Appli- cations, Chapman Hall, New York.

Senn, S. (1993). Cross-over Trials in Clinical Research. John Wiley, New york. Simpson, E. H. (1949). Measurement of diversity. Nature i63, 688. Singer, J. M. and A. Dalton (2000). Analysis of longitudinal data. In Handbook of Statistics, Vol. 18:

Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen). pp. 115 160. North Holland, Amsterdam.

Tsiatis, A., V. DeGruttola and M. Wulfsohn (1995). Modeling the relationship of survival to longi- tudinal data measured with error, application to survival and cd4 counts in patients with AIDS. J. Amer. Statist. Assoc. 90, 27-37.

Page 78: [Handbook of Statistics] Bioenvironmental and Public Health Statistics Volume 18 || 9 Non-parametrics in bioenvironmental and public health statistics

324 P. K. Sen

Tudor, G., G. G. Koch and D. Catellier (2000). Statistical methods for crossover designs in bioen- vironmental and public health. In Handbook of Statistics, Vol. 18." Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 571-613. North Holland, Amsterdam.

von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. Ann. Math. Statist. 18, 309 348.

Vonesh, E. F. and V. M. Chinchilli (1997). Linear and Nonlinear Models for the Analysis of Repeated Measurements. Marcel Dekker, New York.

Wald, A. (1947). Sequential Analysis, Wiley, New York. Wedderbnrn, R. W. M. (1974). Quasi-likelihood function, generalized linear models, and the Gauss-

Newton method. Biometrika 45, 939-955. Wei, L. J., J. Q. Su and J. M. Lachin (1990). Interim analyses with repeated measurements in a

sequential clinical trial. Biometrika 77, 359-364. Weller0 E., L. Ryan and D. Dockery (2000). Statistical issues in inhalation toxicology. In Handbook of

Statistics, Vol. 18: Bioenvironmental and Public Health Statistics (Eds., C. R. Rao and P. K. Sen), pp. 423-440. North Holland, Amsterdam.

Westiake, W. J. (1988). Bioavailability and bioequivalence of pharmaceutical formulations. In Bio- pharmaceutical Statistics for Drug Development (Ed., K. E. Peace), Marcel Dekker, New York.

Whittemore, A. S. (1995). Logistic regression of family data from case-control studies. Biometrika 82, 57-67.

Wu, M. C. and K. K. G. Lan (1992). Sequential monitoring for comparison of changes in a response variable in clinical trials. Biometrics 48, 765-780.

Zhao, L. P. and R. L. Prentice (1990). Correlated binary regression using a quadratic exponential model. Biometrika 77, 64~648.