depression diagnosis

7
Austin Kinion STA 138 December 4, 2014 Final Project Introduction: The data that I will be analyzing for this report deals with whether a a patient id diagnosed or not diagnosed with depression in a visit during one year of care. There are many ways in which a patient can be diagnosed with depression, so there are many more variables not taken into account with this model, that may affect the results greatly, but for the sake if this project, I will try and predict whether a patient will be diagnosed with depression using stepwise logtisitic regression. The variables for this data are as follows: ——————————————————————————————————————- Diagnosis of depression in any visit during one DAV year of care. (0= Not diagnosed, 1= Diagnosed) Physical component of SF-36 measuring health PCS status of the patient. Mental component of SF-36 measuring health MCS status of the patient. The Beck depression score of the patient BECK The Gender of the patient PGEND The Patients age in years AGE The number of years of formal schooling EDUCAT ——————————————————————————————————————— The response variable is DAV. The explanatory variables are PCS, MCS, BECK, and PGEND which indicates the gender, AGE, which indicates the age, and EDUCAT which tells the number of years of formal schooling. Materials and Methods: For this project, I will be testing to see if we can predict whether a patient will be diagnosed with depression based on the variables above, and pick the best model ,using stepwise logistic regression. 400 patients were randomly selected from primary care facilities and the above 7 variables were recored for each patient. SAS Code and Results: To read in the data and format:

Upload: austin-kinion

Post on 17-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Analysis of Depression Data

TRANSCRIPT

Page 1: Depression Diagnosis

Austin KinionSTA 138December 4, 2014

Final ProjectIntroduction: The data that I will be analyzing for this report deals with whether a a patient id diagnosed or not diagnosed with depression in a visit during one year of care. There are many ways in which a patient can be diagnosed with depression, so there are many more variables not taken into account with this model, that may affect the results greatly, but for the sake if this project, I will try and predict whether a patient will be diagnosed with depression using stepwise logtisitic regression. The variables for this data are as follows:——————————————————————————————————————-Diagnosis of depression in any visit during one DAVyear of care. (0= Not diagnosed, 1= Diagnosed) Physical component of SF-36 measuring health PCSstatus of the patient. Mental component of SF-36 measuring health MCSstatus of the patient. The Beck depression score of the patient BECKThe Gender of the patient PGENDThe Patients age in years AGEThe number of years of formal schooling EDUCAT———————————————————————————————————————The response variable is DAV. The explanatory variables are PCS, MCS, BECK, and PGEND which indicates the gender, AGE, which indicates the age, and EDUCAT which tells the number of years of formal schooling.Materials and Methods: For this project, I will be testing to see if we can predict whether a patient will be diagnosed with depression based on the variables above, and pick the best model ,using stepwise logistic regression. 400 patients were randomly selected from primary care facilities and the above 7 variables were recored for each patient. SAS Code and Results: To read in the data and format:

Page 2: Depression Diagnosis

The very first thing I did was to check to make sure that the model with main effect did not include any multi collinearity, I did this with the following code:

And partial output:

Nest, I want to create the logistic model. I will fist show you the model with only the main effects (no interactions) that were chosen from the stepwise logistic regression in SAS, though this is not the model that I will use. The code that was used to obtained the best model through forward stepwise regression was:

And the output for the best model was :

I chose to use a similar model as the one above, with two extra interaction terms.I chose these interaction terms in the model because I believe that firstly, the interaction between PGEND and EDUCAT can help with the prediction of depression because educations effect may differ depending on the gender of the patient. Secondly, the interaction between PGEND and BECK, I believe, may help with prediction of depression because the Beck’s depression score may differ depending on gender as well. So the SAS code for the model described above is:

From the output, it is clear that there is no multicollinearity in the model since the variance inflation for each of the variables is much lower than 10.

So our model for the best picked from SAS, with no interactions is:

log(πˆ/1-πˆ) = -2.3093 - (0.047)MCS + (0.0721)BECK - (0.6633)PGEND + (.1785)EDUCAT

Page 3: Depression Diagnosis

and the partial output from this to obtain the model is:

An explanation of the variables used for my final model is as follows: The intercept β0 is -2.7921 which is for when all of the other parameters are equal to zero. The slope estimate for MCS is -0.0487, which means that when MCS increases by one unit, the odds of the patient being diagnosed with depression is (e^(-0.0487) = 0.9524) 0.9524 the odds of the patient not being diagnosed with depression. For EDUCAT, when a patient does one extra year of formal schooling, the odds of the patient being diagnosed with depression is (e^(.2151)) 1.24 times the odds of the patient not being diagnosed with depression. For BECK, when a patients Beck depression score increases by one unit, the odds of that patient being diagnosed with depression is (e^(.0813)) 1.085 times the odds of that paient not being diagnosed with depression. For PGEND, we can say that the odds of the patient being diagnosed with depression as a male are (e^(1.3659)) 3.92 the odds of the patients being diagnosed with depression as a female.

From the output to the left, we san see the SAS reported 95% Wald confidence intervals for the above variables described. For MCS, we can be 95% confident that when the MCS unit increases by one, the odds of that person being diagnosed

with depression will increase by between 2.6% and 6.3%. For EDUCAT, we can be 95% confident that with one year of additional formal education, the odds of that person being

So the model is : log(πˆ/1-πˆ) = β0 + β1x1 + β2y…

log(πˆ/1-πˆ) = -2.7921 - (0.0487)MCS + (0.0813)BECK - (1.3659)PGEND + (.2151)EDUCAT - (0.1296)EDUCAT*PGEND - (0.0419)BECK*PGEND

Page 4: Depression Diagnosis

diagnosed with depression will increase by between 5.9% and 34.9%. For BECK, we can be 95% confident that when the Beck depression score increases by one unit, the odds of that person being diagnosed with depression will increase by between 1.0% and 14.3%. For PGEND, since the 95% confidence interval contains 1, it is not statistically significant.

-Residual Analysis: It is clear form the chart to the right, that

there are many outliers, which have a Pearson and deviance Residual of over the absolute value of 2.0, so they are influencing the coefficients and the goodness of fit. After looking at the data output (which is not displayed because it is too big), I can see that the following observations have a Pearson and Deviance residual of over the absolute value of 2.0: observations 22, 115, 173, 194, 255, 260, 286, 316, 323, 325, 333, 353, and 368. These observation numbers are the ones corresponding to the output form SAS, with observation 1 being nothing (the header). So if I were to adjust the observations to match exactly the observations form the data, they would be the observation numbers listed above minus 1: 21, 114, 172, 193, 254, 259, 287, 315, 322, 324, 332, 352, and 367. Out of these “adjusted” observations, the 5 observations with the highest Pearson and Deviance residuals are observations (with Pearson residual, Deviance residual): 193(5.47, 2.62), 259(3.54, 2.28), 315(4.20, 2.42), 324(4.31, 2.44), 352(5.75, 2.65).

-Influential ObservationsLooking at the hat matrix diagonal column (what we were told to do in class) from the SAS

output, it is clear, after carefully looking, that there is really only one influential observation which is not even listed as a residual, it is observation 378, with a hat matrix diagonal equal to .0892, which is much higher than any of the others (the next highest is .02). With a hat matrix diagonal so high, this means that this observation is affecting the the parameter estimates.

Page 5: Depression Diagnosis

-Goodness of FitThe percent concordant is 76.5 and the percent discordant is 23.1. This is relatively a good thing, with Somers’ D, Gamma, and C being relatively high (.535, .537, and .767 respectively). This means that (using

Somers’ D) there 53.5% concordant’s (or agreement) with the model that we have selected. This isn’t an excellent number, but it still implies that there is some association. So we can conclude with Somer’s D that the average difference in he percent concordant and percent discordant is 53.5%, which means our model is doing an “okay” job at predicting. We could do a similar analysis for Gamma and say that since it is positive and relatively large (.537), that there is some association.

With the lowest AIC(305.201), I chose to work with the ‘best model’ chosen by SAS with stepwise regression over the model with only the intercept (AIC: 353.736), and over the ‘best’ model with

the interaction terms (AIC: 308.519). With the Hosmer and Lemshow Goodness-of-fit test in SAS, we can see that the χ2 statistic is: χ2 = 7.4172 with 8 degrees of freedom and p-value= 0.4924. Since we have such a large p-value in this case, we will fail to reject H0: Model fits the data well, and conclude that the model IS a good fit for the data.

I have repeated the analysis of maximum likelihood estimates table from above, for the best fit model so we can test the β’s to see if they are statistically significant: To start, Lets take a look at β0. To test, we have H0: β0=0, against Ha: β0≠0. Since

our Wald Chi Square Statistic is 3.897, with p-value= 0.0484, (p-value < .05), we can conclude that β0 has statistical significance, Ha: β0≠0.

Page 6: Depression Diagnosis

Lets take a look at β1. To test, we have H0: β1=0, against Ha: β1≠0. Since our Wald Chi Square Statistic is 9.773, with p-value= 0.0018, (p-value < .05), we can conclude that β1 has statistical significance, Ha: β2≠0.

Lets take a look at β2. To test, we have H0: β2=0, against Ha: β2≠0. Since our Wald Chi Square Statistic is 5.2214, with p-value= 0.0223, (p-value < .05), we can conclude that β2 has statistical significance, Ha: β2≠0.

Lets take a look at β3. To test, we have H0: β3=0, against Ha: β3≠0. Since our Wald Chi Square Statistic is 3.8280, with p-value= 0.0504, (p-value ≥ .05), we can conclude that β3 has NO statistical significance, H0: β3=0. Just because this fails the test though, does not mean it should not be included in the model. It is right on the edge of being statistically significant and it can play part in predicting one’s depression diagnosis, so I believe it should stay in the model.

Lastly, lets take a look at β4. To test, we have H0: β4=0, against Ha: β4≠0. Since our Wald Chi Square Statistic is 8.36009, with p-value= 0.0038, (p-value < .05), we can conclude that β4 has statistical significance, Ha: β2≠0.

-Goodness-of-link function This was done with the code: Output:

From the output to the right, we can see that the estimated new variable (linkf) is 1.4111, with a Wald Chi Square Statistic of .4469, and p-value = .5038. We can conclude, since p-value > .05, that this variable is statistically insignificant, so the link function is appropriate.

Conclusion and Discussion: A logistic regression model was fit to the data, using stepwise logistic regression. I found that we can be 95% confident that when the MCS unit increases by one, the odds of that person being diagnosed with depression will increase by between 2.6% and 6.3%. For EDUCAT, we can be 95% confident that with one year of additional formal education, the odds of that person being diagnosed with depression will increase by between 5.9% and 34.9%. For BECK, we can be 95% confident that when the

Page 7: Depression Diagnosis

Beck depression score increases by one unit, the odds of that person being diagnosed with depression will increase by between 1.0% and 14.3%. And that for PGEND, since the 95% confidence interval contains 1, it is not statistically significant.

I also did residual analysis on the data, and searched for influential observations. I found there was an influential observation that was not included in the residuals, which was observation 378. Even though there were several residuals and an influential observation, the model was still found to be a good fit for the data, which was determined from the Hosmer and Lemshow goodness-of-fit test.

As a result from the statements above, I was able to conclude that the Variables BECK, EDUCAT, MCS, and PGEND, are all associated with the depression diagnosis of a patient. Based on the result, we cannot reject the Null Hypothesis that the model is fitting the data; yet I would be more comfortable with a model that provides more support of fit. Therefore, I recommend researching additional covariates in order to make more reliable predictions, such as How many people the patient talks to on a dally basis, if the patient has a hobby, and so on.