using the sas® system to assess multivariate normality steiner.pdf · using the sas® system to...

3
Using the SAS® System to Assess Multivariate Normality Richard P. Steiner. University of Akron. Akron. OH ABSTRACT Assessment of multivariate normality is an important and difficult problem for those applying multivariate statistical methods. No single graphical or inferential technique alone seems adequate for identifying the many possible types of departure from multivariate normality. Therefore. this paper discusses four tools for assessing multivariate normality. and presents easy-to-use SAS programs for their implementation. INTRODUCTION less than three decades ago multivariate statistical methods were little more than a theoretical curiosity. For even modest numbers of variables and cases. the calculations were prohibitive. With the advent of electronic computers all that changed, and with the advent of powerful statictical software packages such as SAS. multivariate statistics became readily available to a broad spectrum of users. Along with that availability came the need to assess whether the theoretical assumptions of multivariate methods were being reasonably met in practice. An underlying assumption common to many multivariate methods is that of multivariate normality. just as univariate normality is assumed by many univariate techniques (eg. t-tests. F-tests). Multivariate data may be represented as a p )( 1 vector of random variables X = { Xl' X 2 ..... Xpl' with mean vector !-l' = (!-l l' !-l 2' ... , !-l p)' and p x p variance-covariance matrix = The assumption of multivariate normality is that X - Np ( !-l. ). Methods for assessing multivariate normality may be classified by two dichotomies: 1) extensions of methods for 1291 univariate normality vs strictly multivariate procedures (Koziol. 1986); and 2) graphical vs inferential techniques. No single procedure alone seems adequate for identifying the many possible types of departure from multivariate normality. Different procedures are more sensitive to particular kinds of non normality than others. An enlightening approach. although one which should be viewed as essentially exploratory. is to construct a battery of tests and plots. If all are indicative of multivariate' normality. one may consider it a reasonable assumption. If there are indications of non normality . then the procedure that suggested it may also provide insight into the type of departure. This paper discusses several tools for assessing multivariate normality, and presents easy-ta-use SAS programs for their implementation. The programs employ existing SAS procedures ( CORR, PLOT.' UNIVARIATE. RANK). DATA step programming. and PROC MATRIX programming. They were developed in the OS and CMS environments. but should be applicable in any mainframe environment. METHODS Strategies At the center of any approach to assessing multivariate normality is the problem of reducing dimensionality. Graphical techniques need to reduce the p-dimensional observations to one or two dimensions for plotting, while inferential methods are aimed at producing scalar test statistiCS with tractable null distributions. The following properties of the multivariate normal distribution have been exploited in the development of methods forassessing multivariate normality:

Upload: hanhu

Post on 19-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Using the SAS® System to Assess Multivariate Normality

Richard P. Steiner. University of Akron. Akron. OH

ABSTRACT

Assessment of multivariate normality is an important and difficult problem for those applying multivariate statistical methods. No single graphical or inferential technique alone seems adequate for identifying the many possible types of departure from multivariate normality. Therefore. this paper discusses four tools for assessing multivariate normality. and presents easy-to-use SAS programs for their implementation.

INTRODUCTION

less than three decades ago multivariate statistical methods were little more than a theoretical curiosity. For even modest numbers of variables and cases. the calculations were prohibitive. With the advent of electronic computers all that changed, and with the advent of powerful statictical software packages such as SAS. multivariate statistics became readily available to a broad spectrum of users.

Along with that availability came the need to assess whether the theoretical assumptions of multivariate methods were being reasonably met in practice. An underlying assumption common to many multivariate methods is that of multivariate normality. just as univariate normality is assumed by many univariate techniques (eg. t-tests. F-tests). Multivariate data may be represented as a p )( 1 vector of random variables X = { Xl' X2..... Xpl' with mean vector

!-l' = (!-l l' !-l 2' ... , !-l p)' and p x p variance-covariance matrix ~ = [s~. The assumption of multivariate normality is that X - Np ( !-l. ~ ).

Methods for assessing multivariate normality may be classified by two dichotomies: 1) extensions of methods for

1291

univariate normality vs strictly multivariate procedures (Koziol. 1986); and 2) graphical vs inferential techniques. No single procedure alone seems adequate for identifying the many possible types of departure from multivariate normality. Different procedures are more sensitive to particular kinds of non normality than others. An enlightening approach. although one which should be viewed as essentially exploratory. is to construct a battery of tests and plots. If all are indicative of multivariate' normality. one may consider it a reasonable assumption. If there are indications of non normality . then the procedure that suggested it may also provide insight into the type of departure.

This paper discusses several tools for assessing multivariate normality, and presents easy-ta-use SAS programs for their implementation. The programs employ existing SAS procedures ( CORR, PLOT.' UNIVARIATE. RANK). DATA step programming. and PROC MATRIX programming. They were developed in the OS and CMS environments. but should be applicable in any mainframe environment.

METHODS

Strategies

At the center of any approach to assessing multivariate normality is the problem of reducing dimensionality. Graphical techniques need to reduce the p-dimensional observations to one or two dimensions for plotting, while inferential methods are aimed at producing scalar test statistiCS with tractable null distributions.

The following properties of the multivariate normal distribution have been exploited in the development of methods forassessing multivariate normality:

(1) If X ~ Np (;t. L). then each X; - N (;t;. sji)' ;. 1 ..... p. That is, if X is multivariate normal, then each component of X will be univariate normal. Thus marginal univariate normality is a necessary condition tor multivariate normality. but it is not sufficient.

(2) If X - Np ( 11. L ) . Ihen the quadratic forms d2. = (X - 11)' L ·1 ( X - 11) have a chi-square distribution with p degrees of freedom ( d2 - X2 pl. The d2 are called

squared radii or squared statistical distances.

Assessing Univariate Normality of Components

The SAS system directly provides a useful tool for assessing univariate normalitv, PROC UNIVARIATE. Coefficients of skewness and kurtosis are produced. With the PLOT option a plot of observed sample quantiles vs expected normal quantiles (quantile vs quantile, or 0-0. plot) is also generated. The NORMAL option produces the Shapiro and Wilk (1965) test lor normality (n :s; 50) or Ulliefors' (1967) modification of the Kolmogorov - Smirnov goodness of fit test (n > 50). If any of the components of X are nonnormal, there is evidence that X is not mutlivariate normal; however if all components are univariate normal, this does not ensure mu~ivariate normality (Property 1).

Chi-square Plot

Healy (1968) discussed a probability plot for multivariate normality based on Property 2. Sample squared radii

are plotted vs corresponding quantiles of x2 p' Multivariate normality is indicated by a near linear plot. This type of plot is sometimes called a chi-square plot.

The technique was implemented in the SAS

1292

system by using PROe eORR to output a data set containing the sample mean vector and variance-covariance matrix. This data sel was input to PROe MATRIX. where the dt were computed. The dt were then output to the RANK procedu re to produce their empirical cumulative distribution function (cdl). The elNV function was then applied to this empirical cdf to produce the appropriate quantiles of X2 p' PRoe PLOT was used to generate the final plot.

Kolmogorov-Smirnov Goodness of Fit Test on Squared Radii

Malkovich and Afiti (1973) proposed testing multivariate normality by testing the goodness of fit of the dj2 to X2 p using the Kolmogorov-Smirnov (K-S) test. A significant test statistic (O} suggests lack of multivariate normality. However. since 11 and L are estimated from sample data for computing the dj2 , the null distribution of 0 is not the same as for the case with no parameters estimated (Koziol, 1986). Analytic results for the distribution of 0 do not yet exist and numerical results are limjted. The test using the usual K-S distribution is probably conservative.

For implementing this test in the SAS system the dj2 were computed as described above. The computation of the K-S 0 statistic and its P-value were programmed using the MATRIX procedure. The P-value was computed using an algorithm due to Smirnov (1948). which assumes no estimated parameters:

Multivariate Generalizations of the Shapiro-Wilk Statistic

Royston (1983) proposed a test of multivariate normality based on the combined information in the p separate univariate Shapiro-Wilk W-statistics. The technique involves a normalizing transformation of the univariate Ws, which are then combined to form a chi-square test statistic. H.

PROG UNIVARIATE was used to produce a data set containing the univariate W-statistics (n:$; 50). For n > 50, DATA step programming, PROG RANK, and the MATRIX procedure were used to obtain W-statistics employing Royston's (1982) large sample method. The calculation of H was done primarily in PROG MATRIX. The P-value of H was obtained with the PROBCHI function.

A multivariate generalization (W·) of the Shapiro-Wilk statistic was developed by Malkovich and Afili (1973). This test is based on Roy's union-intersection principle. The MATRIX and RANK procedures allowed implementation of this test. W· has the same null distribution as the univariate Shapiro-Wilk statistic (W). Royston's (1982) method of finding significance levels of W were used to find the P-value of W·. Royston claims this method works well for . n > 6. For 3 :$; n :$; 6, the P-value was obtained by linear interpolation in the table of critical values for the univariate W-statistic published by Shapiro and Wilk (1965).

USING THE PROGRAMS

The programs are easy to use. They are packaged in a statement macro named MNORM, which contains several other modules written in the SAS macro language. Any or all of the methods for multivariate normality described in this paper may be requested, as well as the marginal univariate techniques available in PROC UNIVARIATE. The macro MNORM may be invoked as follows:

MNORM DATA = SASdatasetname VAR = varlist [CHIPLOT MAKS MASW ROYSTON UNI 1 [ALL 1 PRINT = [OSO OME 1 ;

Where CHIPLOT requests the chi-square plot, MAKS requests Malkovich and Afili 's K-S D, MASW requests W"'; ROYSTON requests H, UNI requests. PROe UNIVARIATE output for each variable on varlist , and ALL requests all the above output. II no analysis requests (eg.

1293

GHIPLOT, MAKS, _ .. , UN I) are made, none are performed.

The DATA option defaults to the last SAS data set created LLAST_) and the VAR option defaults to all numeric variables in the data set LNUMERIC_). The PRINT option defaults to no additional output, but may be set to request casewise listing of the squared radii dl (OSO), and/or observed minus expected cdfs of the sample dt

REFERENCES

Healy, M. J. R. (1968) Multivariate normal plotting. Appl. Statist., 17: 157-161.

Koziol, J. A. (1986) Assessing multivariate normality: A compendium. Commun. Statist.-Theor. Meth. ,15: 2763-2783.

Lilliefors, W. H. (1967) On the Kolmogorov-Smirnov test lor normality with mean and variance unknown. J. Amer. Statist. Assoc. , 62: 399-402.

Malkovich, J.F. and A.A. Aliff. (1973) On tests for multivariate normality. J. Amer. Statist. Assoc. ,68: 176-179.

Royston, J.P. (1982) An extension of Shapiro and Wilk's W test for normality to large samples. Appf. Statist. ,31: 115-124.

Royston, J.P. (1983) Some techniques for assessing multivariate normality based on the Shapiro-Wilk W. Appl. Statist. , 32: 121-133.

Shapiro, S.S. and M.B. Wilk. (1965) An analYSis of variance test for normality (complete samples). Biometrika, 52: 591-611.

Smirnov, N.V. (1948) Table lor estimating the goodness of fit of empirical distributions_ Ann. Math. Statist. 19: 279:281.