sas procedures
DESCRIPTION
Proc tabulate, Gplot, Glimmix, Proc Reg, Proc Anova, Proc Mixed, Proc catmod, Proc GenmodTRANSCRIPT
SAS Procedures
Anil Kumar
PROC tabulate
• Summarize the data in the form of a well organized table
• Syntax:
PROC tabulate DATA=dataname;
ClASS class variables;
VAR variables;
TABLE page, row, column description/options;
RUN;
PROC tabulate – example (1)
proc tabulate data=sashelp.Class;class sex;var height weight;table sex, height weight;
run;
Result:
PROC tabulate – example (2)
proc tabulate data=sashelp.Class;class sex;var height weight age;table sex all, (age height weight)*(std mean sum);
run;
Result:
Gplot – A simple example
• SAS/ Graph modular is feathered by the flexible PROC gplot
• A simple example:
proc gplot data=sashelp.Class;symbol i=none v=star;plot height*weight;
run;quit;
Resulting graph
Gplot – further example
• The following example shows more flexibility of the procedure
goptions reset=all;proc gplot data=sashelp.Class;
symbol1 color = green i = join v= diamond line = 1 w=2 h=2;symbol2 color = red i= join v= star line = 2 w=2 h=2;plot Height*Weight=Sex/ hminor=0 legend=legend1;legend1 down=1 position=(top center inside) cshadow = blue frame value = (f=duplex)ACROSS =1label=(font=duplex h=1.5);title f= zapf color=blue h =5pct 'Testing the graph';
run;
Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions
ANOVA Interval Categorical, Fixed Effects only
Normality
REG Interval Interval, Fixed Effects only Normality
LOGISTIC Binary Categorical, Interval, Fixed Effects Only
Log-Normal
GLM Interval Categorical, Interval, Fixed Effects Only
Normality
GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only
Exponential Family
MIXED Interval Categorical, Interval, Random Effects
Normality
GLIMMIX Categorical, Interval, Random Effects
Categorical, Interval, Random Effects
Exponential Family
Cerrito
General Outline of Model Choices
PROC REG
• Inputs and output are interval
• Ordinal data may be included
• Assumptions on – Normally distributed– has mean zero and constant variance– Is independent
• Residual analysis should be a routine part of the analysis
Residuals
• The studentized residual, the RSTUDENT statistic, is similar to the the standardized residual except that the mean square error is calculated omitting the observation.
• Observations with studentized residual absolute values of greater than 2 are potential outliers.
Regression Example
Output
Scatterplot With Regression Line
Residuals
PROC ANOVA• Each treatment should have exactly the
same number of observations; every categorical outcome has the same number of observations.
• Caution: If you use PROC ANOVA for analysis of unbalanced data, you must assume responsibility for the validity of the results.
• Use PROC GLM instead.
Categorical Procedures
Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions
LOGISTIC Binary Categorical, Interval, Fixed Effects Only
Log-Normal
CATMOD Analyzes data that can be represented by a two-dimensional contingency table. Input can be raw data, cell counts, or direct input of a covariance matrix
GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only
Exponential Family
GLIMMIX Categorical, Interval, Random Effects
Categorical, Interval, Random Effects
Exponential
Family
PROC CATMOD
• PROC CATMOD provides a wide variety of categorical data analyses.
• Now that PROC LOGISITIC handles classification variables, there is less of a need to use PROC CATMOD for regression.
• PROC CATMOD should not be used when a continuous input variable has many distinct values.
Output
Logistic Regression
• Binary outcomes• Allows for any combination of nominal, ordinal or
continuous explanatory variables• Computes predicted values, the receiver
operating characteristics (ROC) curve and an approximation to the area beneath the curve ( c ), and a number of regression diagnostics
• If the occurrence is rare, use the Poisson distribution in PROC GENMOD.
Generalized Linear Models In generalized linear models the response is assumed to
possess a probability distribution of exponential form. That is, the probability density of the response Y for continuous response variables, or the probability function for discrete responses, can be expressed as
for some functions a, b, and c that determine the specific distribution (omitting some requirements for these functions). Expressions for the mean and variance are
Important to note is that the exponential family (or form) of distributions constitute a broad class of probability density functions. Don’t confuse this broad family with the exponential pdf.
Distributions and Associated Default Link Functions Available in PROC
GENMOD
Model Output VariableOutput Variable Types of InputsTypes of Inputs AssumptionsAssumptions
ANOVA Interval Categorical, Fixed Effects only
Normality
REG Interval Interval, Fixed Effects only Normality
GLM Interval Categorical, Interval, Fixed Effects Only
Normality
GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only
Exponential Family
MIXED Interval Categorical, Interval, Random Effects
Normality
GLIMMIX Categorical, Interval, Random Effects
Categorical, Interval, Random Effects
Exponential Family
Interval (Quantitative) Procedures
Assessing Goodness of Fit -Akaike’s Information Criterion (AIC)
• Information criteria uses the covariance matrix and the number of parameters in a model to calculate a statistic that summarizes the information represented by the model by balancing a trade-off between a lack of fit term and a penalty term.
• SAS calculates Akaike’s Information Criterion (AIC) for every possible 2p models for p ≤ 10 independent variables.
• AIC estimates a measure of the difference between a given model and the “true” model. The model with the smallest AIC among all competing models is deemed the best model.
• Beal’s example provides SAS code that can be used to simultaneously evaluate up to 1024 models to determine the best subset of variables that minimizes the information criteria among all possible subsets.
Minimum AIC
• The AIC statistic is widely used to select the best model among alternative parametric models. • AIC = - 2( maximum log-likelihood) +
2( number of free parameters) • The amount of AIC is not meaningful.• The difference of the two AIC values is
considered insignificant if it is far less than 1.
Beal’s Simulation
• Implements five common statistical techniques to determine the best linear model – minimizing the RMSE– maximizing R2
– forward selection– backward elimination– Stepwise regression
• The RMSE is a function of the sum of squared errors (SSE), number of observations n and the number of parameters p:
RMSE =sqrt(SSE/(n - p))
Generate the Data
Partial Code for Regressions
Simulation Results: n=1000
Simulation Result: n=10000
AIC Selected Coefficients for Five Runs
Generalized Linear Mixed Models
PROC MIXED
• The mixed model generalizes the standard linear model: y=X + Z +
• is an unknown vector of random-effects parameters with known design matrix Z, and is an unknown random error vector whose elements are no longer required to be independent and homogeneous.
• PROC MIXED is a generalization of the GLM procedure in the sense that PROC GLM fits standard linear models, and PROC MIXED fits the wider class of mixed linear models.
• Both procedures have similar CLASS, MODEL, CONTRAST, ESTIMATE, and LSMEANS statements.
• But their RANDOM and REPEATED statements differ.
RANDOM and REPEATED Statementsin PROC GLM and PROC MIXED
• The RANDOM statement in PROC MIXED incorporates random effects constituting the vector in the mixed model.
• However, in PROC GLM, effects specified in the RANDOM statement are still treated as fixed as far as the model fit is concerned, and they serve only to produce corresponding expected mean squares.
• The REPEATED statement in PROC MIXED is used to specify covariance structures for repeated measurements on subjects.
• The REPEATED statement in PROC GLM is used to specify various transformations with which to conduct the traditional univariate or multivariate tests.
• In repeated measures situations, the mixed model approach used in PROC MIXED is more flexible and more widely applicable than either the univariate or multivariate approaches.
PROC GLIMMIX
• The GLIMMIX procedure fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed.
• These models are known as generalized linear mixed models (GLMM).
• November 2005: Production level version can now be downloaded from http://support.sas.com/rnd/app/da/glimmix.html
PROC GLIMMIX (continued)
• The GLMMs, like linear mixed models, assume normal (Gaussian) random effects.
• Conditional on these random effects, data can have any distribution in the exponential family.
• The binary, binomial, Poisson, and negative binomial distributions, for example, are discrete members of this family.
• The normal, beta, gamma, and chi-square distrubtions are representatives of the continuous distributions in this family.
Summary
• Know what your assumptions are and check them.
• Theory, methods and techniques evolve.
• Consider using– PROC GLIMMIX– Enterprise Guide
• Fit the model to the data!
References• Akaike, H. (1973), "Information Theory and an Extension of the Maximum Likelihood
Principle," in Petrov and Csaki, eds., "Proceedings of the Second International Symposium on Information Theory," 267-281.
• Beal, Dennis J. (2005), SAS “Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria”, Proceedings, Southeast SAS Users Group Conference.
• Bickel, Peter J. and Doksum, Kjell A. (2001), Mathematical Statistics, Prentice-Hall, Inc., Upper Saddle River, NJ.
• Cerrito, Patricia B. (2005), “From GLM to GLIMMIX-Which Model to Choose?” Workshop, Southeast SAS Users Group Conference.
• Long, J.Scott (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks, CA: Sage Publications, Inc.
• McCullagh, P. and Nelder. J. A. (1989), Generalized Linear Models, Second Edition, London: Chapman and Hall.
• Seber, G.A.F. (1984), Multivariate Observations, John Wiley & Sons, New York.• Stokes, M.E., Davis, C.S., and Koch, G.G. (2000), Categorical Data Analysis Using the
SAS System, Second Edition, Cary, NC: SAS Institute Inc. • SAS Online Documentation, http://www.sas.com• GLIMMIX Procedure Documentation, “The GLIMMIX Procedure, Nov. 2005”, SAS
Institute.
UPCOMING COLLOQUIA
"Using LaTeX for Scientific Publication and Presentation,” Wed., November 30, at 3:30 PM., presented by Ed Hall
----------------------
Please take a minute to complete the feedback form and leave it on the counter as you exit.
Thank you.
The Research Computing Support Center will be closed on Wednesday-Friday, Nov. 23, 24 and 25. We will re-open on Monday, November 28th at 9:00 a.m.
Note: EG project files, programs and other SAS source used in the original presentation are available by request, but they are not contained in this online version - kmg