sas procedures

SAS Procedures

Anil Kumar

PROC tabulate

• Summarize the data in the form of a well organized table

• Syntax:

PROC tabulate DATA=dataname;

ClASS class variables;

VAR variables;

TABLE page, row, column description/options;

RUN;

PROC tabulate – example (1)

proc tabulate data=sashelp.Class;class sex;var height weight;table sex, height weight;

run;

Result:

PROC tabulate – example (2)

proc tabulate data=sashelp.Class;class sex;var height weight age;table sex all, (age height weight)*(std mean sum);

run;

Result:

Gplot – A simple example

• SAS/ Graph modular is feathered by the flexible PROC gplot

• A simple example:

proc gplot data=sashelp.Class;symbol i=none v=star;plot height*weight;

run;quit;

Resulting graph

Gplot – further example

• The following example shows more flexibility of the procedure

goptions reset=all;proc gplot data=sashelp.Class;

symbol1 color = green i = join v= diamond line = 1 w=2 h=2;symbol2 color = red i= join v= star line = 2 w=2 h=2;plot Height*Weight=Sex/ hminor=0 legend=legend1;legend1 down=1 position=(top center inside) cshadow = blue frame value = (f=duplex)ACROSS =1label=(font=duplex h=1.5);title f= zapf color=blue h =5pct 'Testing the graph';

run;

Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions

ANOVA Interval Categorical, Fixed Effects only

Normality

REG Interval Interval, Fixed Effects only Normality

LOGISTIC Binary Categorical, Interval, Fixed Effects Only

Log-Normal

GLM Interval Categorical, Interval, Fixed Effects Only

Normality

GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only

Exponential Family

MIXED Interval Categorical, Interval, Random Effects

Normality

GLIMMIX Categorical, Interval, Random Effects

Categorical, Interval, Random Effects

Exponential Family

Cerrito

General Outline of Model Choices

PROC REG

• Inputs and output are interval

• Ordinal data may be included

• Assumptions on – Normally distributed– has mean zero and constant variance– Is independent

• Residual analysis should be a routine part of the analysis

Residuals

• The studentized residual, the RSTUDENT statistic, is similar to the the standardized residual except that the mean square error is calculated omitting the observation.

• Observations with studentized residual absolute values of greater than 2 are potential outliers.

Regression Example

Output

Scatterplot With Regression Line

Residuals

PROC ANOVA• Each treatment should have exactly the

same number of observations; every categorical outcome has the same number of observations.

• Caution: If you use PROC ANOVA for analysis of unbalanced data, you must assume responsibility for the validity of the results.

• Use PROC GLM instead.

Categorical Procedures


LOGISTIC Binary Categorical, Interval, Fixed Effects Only

Log-Normal

CATMOD Analyzes data that can be represented by a two-dimensional contingency table. Input can be raw data, cell counts, or direct input of a covariance matrix


Exponential Family



Exponential

Family

PROC CATMOD

• PROC CATMOD provides a wide variety of categorical data analyses.

• Now that PROC LOGISITIC handles classification variables, there is less of a need to use PROC CATMOD for regression.

• PROC CATMOD should not be used when a continuous input variable has many distinct values.

Output

Logistic Regression

• Binary outcomes• Allows for any combination of nominal, ordinal or

continuous explanatory variables• Computes predicted values, the receiver

operating characteristics (ROC) curve and an approximation to the area beneath the curve ( c ), and a number of regression diagnostics

• If the occurrence is rare, use the Poisson distribution in PROC GENMOD.

Generalized Linear Models In generalized linear models the response is assumed to

possess a probability distribution of exponential form. That is, the probability density of the response Y for continuous response variables, or the probability function for discrete responses, can be expressed as

for some functions a, b, and c that determine the specific distribution (omitting some requirements for these functions). Expressions for the mean and variance are

Important to note is that the exponential family (or form) of distributions constitute a broad class of probability density functions. Don’t confuse this broad family with the exponential pdf.

Distributions and Associated Default Link Functions Available in PROC

GENMOD


ANOVA Interval Categorical, Fixed Effects only

Normality

REG Interval Interval, Fixed Effects only Normality

GLM Interval Categorical, Interval, Fixed Effects Only

Normality


Exponential Family

MIXED Interval Categorical, Interval, Random Effects

Normality



Exponential Family

Interval (Quantitative) Procedures

Assessing Goodness of Fit -Akaike’s Information Criterion (AIC)

• Information criteria uses the covariance matrix and the number of parameters in a model to calculate a statistic that summarizes the information represented by the model by balancing a trade-off between a lack of fit term and a penalty term.

• SAS calculates Akaike’s Information Criterion (AIC) for every possible 2p models for p ≤ 10 independent variables.

• AIC estimates a measure of the difference between a given model and the “true” model. The model with the smallest AIC among all competing models is deemed the best model.

• Beal’s example provides SAS code that can be used to simultaneously evaluate up to 1024 models to determine the best subset of variables that minimizes the information criteria among all possible subsets.

Minimum AIC

• The AIC statistic is widely used to select the best model among alternative parametric models. • AIC = - 2( maximum log-likelihood) +

2( number of free parameters) • The amount of AIC is not meaningful.• The difference of the two AIC values is

considered insignificant if it is far less than 1.

Beal’s Simulation

• Implements five common statistical techniques to determine the best linear model – minimizing the RMSE– maximizing R2

– forward selection– backward elimination– Stepwise regression

• The RMSE is a function of the sum of squared errors (SSE), number of observations n and the number of parameters p:

RMSE =sqrt(SSE/(n - p))

Generate the Data

Partial Code for Regressions

Simulation Results: n=1000

Simulation Result: n=10000

AIC Selected Coefficients for Five Runs

Generalized Linear Mixed Models

PROC MIXED

• The mixed model generalizes the standard linear model: y=X + Z +

• is an unknown vector of random-effects parameters with known design matrix Z, and is an unknown random error vector whose elements are no longer required to be independent and homogeneous.

• PROC MIXED is a generalization of the GLM procedure in the sense that PROC GLM fits standard linear models, and PROC MIXED fits the wider class of mixed linear models.

• Both procedures have similar CLASS, MODEL, CONTRAST, ESTIMATE, and LSMEANS statements.

• But their RANDOM and REPEATED statements differ.

RANDOM and REPEATED Statementsin PROC GLM and PROC MIXED

• The RANDOM statement in PROC MIXED incorporates random effects constituting the vector in the mixed model.

• However, in PROC GLM, effects specified in the RANDOM statement are still treated as fixed as far as the model fit is concerned, and they serve only to produce corresponding expected mean squares.

• The REPEATED statement in PROC MIXED is used to specify covariance structures for repeated measurements on subjects.

• The REPEATED statement in PROC GLM is used to specify various transformations with which to conduct the traditional univariate or multivariate tests.

• In repeated measures situations, the mixed model approach used in PROC MIXED is more flexible and more widely applicable than either the univariate or multivariate approaches.

PROC GLIMMIX

• The GLIMMIX procedure fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed.

• These models are known as generalized linear mixed models (GLMM).

• November 2005: Production level version can now be downloaded from http://support.sas.com/rnd/app/da/glimmix.html

http://support.sas.com/rnd/app/da/glimmix.html

http://support.sas.com/rnd/app/da/glimmix.html

PROC GLIMMIX (continued)

• The GLMMs, like linear mixed models, assume normal (Gaussian) random effects.

• Conditional on these random effects, data can have any distribution in the exponential family.

• The binary, binomial, Poisson, and negative binomial distributions, for example, are discrete members of this family.

• The normal, beta, gamma, and chi-square distrubtions are representatives of the continuous distributions in this family.

Summary

• Know what your assumptions are and check them.

• Theory, methods and techniques evolve.

• Consider using– PROC GLIMMIX– Enterprise Guide

• Fit the model to the data!

References• Akaike, H. (1973), "Information Theory and an Extension of the Maximum Likelihood

Principle," in Petrov and Csaki, eds., "Proceedings of the Second International Symposium on Information Theory," 267-281.

• Beal, Dennis J. (2005), SAS “Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria”, Proceedings, Southeast SAS Users Group Conference.

• Bickel, Peter J. and Doksum, Kjell A. (2001), Mathematical Statistics, Prentice-Hall, Inc., Upper Saddle River, NJ.

• Cerrito, Patricia B. (2005), “From GLM to GLIMMIX-Which Model to Choose?” Workshop, Southeast SAS Users Group Conference.

• Long, J.Scott (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks, CA: Sage Publications, Inc.

• McCullagh, P. and Nelder. J. A. (1989), Generalized Linear Models, Second Edition, London: Chapman and Hall.

• Seber, G.A.F. (1984), Multivariate Observations, John Wiley & Sons, New York.• Stokes, M.E., Davis, C.S., and Koch, G.G. (2000), Categorical Data Analysis Using the

SAS System, Second Edition, Cary, NC: SAS Institute Inc. • SAS Online Documentation, http://www.sas.com• GLIMMIX Procedure Documentation, “The GLIMMIX Procedure, Nov. 2005”, SAS

Institute.

http://www.sas.com/

UPCOMING COLLOQUIA

"Using LaTeX for Scientific Publication and Presentation,” Wed., November 30, at 3:30 PM., presented by Ed Hall

----------------------

Please take a minute to complete the feedback form and leave it on the counter as you exit.

Thank you.

The Research Computing Support Center will be closed on Wednesday-Friday, Nov. 23, 24 and 25. We will re-open on Monday, November 28th at 9:00 a.m.

Note: EG project files, programs and other SAS source used in the original presentation are available by request, but they are not contained in this online version - kmg

sas procedures

Documents

proc tabulate data sashelp

proc gplot data sashelp

fixed exponential effects

fixed normality effects

proc tabulate

proc mixed

fixed effects

exponential family