eipb 698d lecture 5 raul cruz-cano spring 2013. midterm comments proc means vs. pros surveymeans for...

EIPB 698D Lecture 5

Raul Cruz-CanoSpring 2013

Midterm Comments

• PROC MEANS VS. PROS SURVEYMEANS• For non–parametric: Kriskal-Wallis

Proc Reg

• The REG procedure is one of many regression procedures in the SAS System.

PROC REG < options > ; MODEL dependents=<regressors> < / options > ; BY variables ; OUTPUT < OUT=SAS-data-set > keyword=names ;

http://v8doc.sas.com/sashtml/stat/chap55/sect6.htm




data blood;INFILE ‘F:\blood.txt';INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;run;

data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run;

proc reg data =blood1; model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ;run;

Proc reg output

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 7 41237 5891.02895 2.54 0.0140Error 655 1521839 2323.41811Corrected Total 662 1563076

DF - These are the degrees of freedom associated with the sources of variance. (1) The total variance has N-1 degrees of freedom (663-1=662). (2) The model degrees of freedom corresponds to the number of predictors minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has 8-1=7 degrees of freedom. (3) The Residual degrees of freedom is the DF total minus the DF model, 662-7 is 655.

Proc reg output Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 7 41237 5891.02895 2.54 0.0140Error 655 1521839 2323.41811Corrected Total 662 1563076

Sum of Squares - associated with the three sources of variance, total, model and residual.

SSTotal The total variability around the mean. Sum(Y - Ybar)2. SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted)2.SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar)2.

Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of R-Square, the proportion of the variance explained by the independent variables


Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model


F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean Square Residual. F-value and P value are used to answer the question "Do the independent variables predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable.

Proc reg output

Root MSE 48.20185 R-Square 0.0264Dependent Mean 201.69683 Adj R-Sq 0.0160Coeff Var 23.89817

Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).

Proc reg output

Root MSE 48.20185 R-Square 0.0264Dependent Mean 201.69683 Adj R-Sq 0.0160Coeff Var 23.89817

Dependent Mean - This is the mean of the dependent variable.

Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(48.2/201.69) =23.90).

How much variability is explained by the model

Proc reg output

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t|

Intercept 1 187.91927 17.45409 10.77 <.0001 sex 1 1.48640 3.79640 0.39 0.6955 typeA 1 0.74839 4.01841 0.19 0.8523 typeB 1 10.14482 6.97339 1.45 0.1462 typeAB 1 -19.90314 10.45833 -1.90 0.0575 Age_old 1 -11.61798 3.85823 -3.01 0.0027 RBC 1 0.00264 0.00191 1.38 0.1676 WBC 1 0.20512 1.88816 0.11 0.9135

t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used in testing

the null hypothesis that the

coefficient/parameter is 0.

Another (better?) approach for weighted data

• Experimental design data have all the properties that we learned about in statistics classes. – The data are going to be independent– Identically-distributed observations with some known error distribution – there is an underlying assumption that the data come to use as a finite number of

observations from a conceptually infinite population– Simple random sampling without replacement for the sample data

• Sample survey data,– Does not come from a finite target population– The sample survey data do not have independent errors. The sample survey data

do not come from a conceptually infinite population. – The sample survey data may cover many small sub-populations, so we do not

expect that the errors are identically distributed.

12

Household Component of the Medical Expenditure Panel Survey (MEPS HC)

• The MEPS HC is a nationally representative survey of the U.S. civilian noninstitutionalized population.

• It collects medical expenditure data as well as information on demographic characteristics, access to health care, health insurance coverage, as well as income and employment data.

• MEPS is cosponsored by the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Health Statistics (NCHS).

• For the comparisons reported here we used the MEPS 2005 Full Year Consolidated Data File (HC-097).

• This is a public use file available for download from the MEPS web site (http://www.meps.ahrq.gov).

13

Transforming from SAS transport (SSP) format to SAS Dataset (SAS7BDAT)

• The MEPS is not a simple random sample, its design includes:– Stratification– Clustering– Multiple stages of Selection– Disproportionate sampling.

• The MEPS public use files (such as HC-097) include variables for generating weighted national estimates and for use of the Taylor method for variance estimation. These variables are: – person-level weight (PERWT05F on HC-097)– stratum (VARSTR on HC-097)– cluster/psu(VARPSU on HC-097).

LIBNAME PUFLIB 'C:\'; FILENAME IN1 'C:\H97.SSP';

PROC XCOPY IN=IN1 OUT=PUFLIB IMPORT; RUN;

Needed for even better estimates of the CI

H97.SASBDAT occupies 408MB vs. 257MB for H97.SSP vs. 14MB for H97.ZIP

14

PROC SURVEYFREQ Simple Example

SAS7BDAT

15

PROC SURVEYREG DATA= mylib.H97; strata VARSTR; cluster VARPSU; model TTLP05X = SEX; weight PERWT05F;Run;

Predict Total Income Based on Sex

Logistic regression

• For binary response models, the response, Y, of an individual or an experimental unit can take on one of two possible values, denoted for convenience by 1 and 0 (for example, Y=1 if a disease is present, otherwise Y=0). Suppose x is a vector of explanatory variables and is the response probability to be modeled. The logistic regression model has the form

Logit (P(Y=1)) =log (P(Y=1)/(1- P(Y=1)) = β0+ β1x

Proc logistic

The following statements are available in PROC LOGISTIC: PROC LOGISTIC < options >; BY variables ; CLASS variable ;MODEL response = < effects > < / options >; MODEL events/trials = < effects > < / options >; OUTPUT < OUT=SAS-data-set >

< keyword=name...keyword=name > / < option >;

The PROC LOGISTIC and MODEL statements are required; only one MODEL statement can be specified. The CLASS statement (if used) must precede the MODEL statement.

http://www.technion.ac.il/docs/sas/stat/chap39/sect4.htm






High school data

• The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and science studies.

• The response variable is high writing test score (high_write), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low;

• from which we explore its relationship with gender, reading test score (read), and science test score (science).

High school data

data new ;set d.hsb2;if write>=60 then high_write=1; else high_write=0;keep ID female math read science write high_write;run;

proc logistic data= new descending; model high_write = female read science;run;

Logistic output

Model Information Data Set WORK.NEW

Response Variable high_write Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring

Number of Observations Read 200 Number of Observations Used 200

This the data set used in this procedure.

This is the type of regression model that was fit to our data. The term logit and logistic are exchangeable.

Logistic output

Response Profile

Ordered high_ Total Value write Frequency

1 1 53 2 0 147

Probability modeled is high_write=1.

Ordered value refers to how SAS models the

levels of the dependent variable. When we

specified the descending option, SAS treats the levels in a descending

order (high to low), such that when the regression

coefficients are estimated, a positive

coefficient corresponds to a positive relationship for high write status. By default SAS models the

lower level This is a note informing which level of the response variable we are modeling.

Logistic output

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics

Intercept Intercept and Criterion Only Covariates

AIC 233.289 168.236 SC 236.587 181.430 -2 Log L 231.289 160.236

This describes whether the maximum-likehood

algorithm has converged or not, and what kind of

convergence criterion is used

to asses convergence.

Model with no predictors just intercept tem

These are various

measurements used to assess the model fit. The smaller values the better fit.

The fitted model

Logistic output

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 71.0525 3 <.0001 Score 58.6092 3 <.0001 Wald 39.8751 3 <.0001

These are three asymptotically equivalent Chi-Square tests. They test against the null hypothesis that all of the predictors'

regression coefficient are equal to zero in the model. With P<0.001, we will reject Ho and conclude that at least one of the

predictors' regression coefficient is not equal to zero.

Logistic output

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -12.7772 1.9759 41.8176 <.0001 female 1 1.4825 0.4474 10.9799 0.0009 read 1 0.1035 0.0258 16.1467 <.0001 science 1 0.0948 0.0305 9.6883 0.0019

Here are the parameter estimates along with their P-value. Base on the estimates, our model is log[ p / (1-p) ] = -12.78 + 1.48*female + 0.10*read + 0.09*science.

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167

The odds ratio is obtained by exponentiating the Estimate, exp[Estimate]. We can interpret the odds ratio as follows: for a one unit change in the predictor variable, the odds ratio for a positive outcome is expected to change by the respective coefficient,

given the other variables in the model are held constant.

Logistic output

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167

If the 95% CI does not cover 1, it suggests

the estimate is statistically significant

Logistic output

Weighted Example

• Just as with linear regression, logistic regression allows you to look at the effect of multiple predictors on an outcome.

• Consider the following example: 15- and 16-year-old adolescents were asked if they have ever had sexual intercourse. – The outcome of interest is intercourse. – The predictors are race (white and black) and gender (male and female).

Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002.

Here is a table of the data:

Intercourse

Race Gender Yes No

White Male 43 134

Female 26 149

Black Male 29 23

Female 22 36Raul Cruz-Cano, HLTH653 Spring 2013

Data Set Intercourse

DATA intercourse;INPUT white male intercourse count;

DATALINES;1 1 1 431 1 0 1341 0 1 261 0 0 1490 1 1 290 1 0 230 0 1 220 0 0 36;RUN;

SAS:

• “descending” models the probability that intercourse = 1 (yes) rather than = 0 (no).

• “rsquare” requests the R2 value from SAS; it is interpreted the same way as the R2 from linear regression.

• “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit Test. This tells you if the model you have created is a good fit for the data.

PROC LOGISTIC DATA = intercourse descending; weight count;MODEL intercourse = white male/rsquare lackfit;

RUN;

SAS Output: R2

Interpreting the R2 value

The R2 value is 0.9907. This means that 99.07% of the variability in our outcome (intercourse) is explained by including gender and race in our model.

PROC LOGISTIC Output

The odds of having intercourse is 1.911 times greater for males versus females.

Hosmer and Lemeshow GOF Test

H-L GOF Test

The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses:Ho: the model is a good fit, vs. Ha: the model is NOT a good fit

With this test, we want to FAIL to reject the null hypothesis, because that means our model is a good fit (this is different from most of the hypothesis testing you have seen).

Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit.

In this case, the pvalue = 0.2419, so we do NOT reject the null hypothesis, and we conclude the model is a good fit.

Model Selection in SAS

• Can be applied to both Linear and Logistic Models• Often, if you have multiple predictors and interactions in your model, SAS

can systematically select significant predictors using forward selection, backwards selection, or stepwise selection.

• In forward selection, SAS starts with no predictors in the model. It then selects the predictor with the smallest pvalue and adds it to the model. It then selects another predictor from the remaining variables with the smallest pvalue and adds it to the model. It continues doing this until no more predictors have pvalues less than 0.05.

• In backwards selection, SAS starts with all of the predictors in the model and eliminates the non-significant predictors one at a time, refitting the model between each elimination. It stops once all the predictors remaining in the model are statistically significant.

Forward Selection in SAS

We will let SAS select a model for us out of the three predictors: white, male, white*male. Type the following code into SAS:

PROC LOGISTIC DATA = intercourse descending; weight count;MODEL intercourse = white male white*male/selection = forward lackfit;

RUN;

Output from Forward Selection: “white” is added to the model

“male” is added to the model

No more predictors are found to be statistically significant

The Final Model:

Hosmer and Lemeshow GOF Test: The model is a good fit

SAS Weigted vs. Survey Procedures

• A random sample• 300 students from each of the classes: freshman, sophomore, junior, and senior classes.

proc format; value Design 1='A' 2='B' 3='C'; value Rating 1='dislike very much' 2='dislike' 3='neutral' 4='like' 5='like very much'; value Class 1='Freshman' 2='Sophomore' 3='Junior' 4='Senior';

run; data Enrollment;

format Class Class.; input Class _TOTAL_; datalines; 1 3734 2 3565 3 3903 4 4196 ;

run;

data WebSurvey; format Class Class. Design Design. Rating Rating. ; do Class=1 to 4; do Design=1 to 3; do Rating=1 to 5; input Count @@; output; end; end; end; datalines; 10 34 35 16 15 8 21 23 26 22 5 10 24 30 21 1 14 25 23 37 11 14 20 34 21 16 19 30 23 12 19 12 26 18 25 11 14 24 33 18 10 18 32 23 17 8 15 35 30 12 15 22 34 9 20 2 34 30 18 16 ;

run; data WebSurvey;

set WebSurvey; if Class=1 then Weight=3734/300; if Class=2 then Weight=3565/300; if Class=3 then Weight=3903/300; if Class=4 then Weight=4196/300;

run;

PROC Logistic

proc logistic data=WebSurvey; freq Count; class Design; model Rating (ref='neutral') = Design ; weight Weight;

run;

PROC surveylogistic

proc surveylogistic data=WebSurvey total=Enrollment; freq Count; class Design; model Rating (ref='neutral') = Design; stratum Class; weight Weight; run;

If you want “better” results..

For the Ratings for Design B vs. Design C compare1. The point estimate2. 95% Confidence Interval

eipb 698d lecture 5 raul cruz-cano spring 2013. midterm comments proc means vs. pros surveymeans for...

Documents