logistic regression analysis
TRANSCRIPT
A STUDY OF LOW BIRTH WEIGHT OF CHILDREN
WITH SPECIAL EMPHASIS ON LOGISTIC
REGRESSION ANALYSIS
PREPARED BY VINYA.P
INTRODUCTION
• Regression methods have become vital component of any data analysis concerned with explaining the relationship between a response variable and one or more explanatory variables.
Logistic regression measures the relationship between the dichotomous dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by estimating probabilities.
• Logistic regression is a model used for prediction of the probability of occurrence of an event. It is a generalized linear model used for binomial regression. It makes use of several predictor (explanatory) variables that may be either numerical or categorical. Specifically, logistic regression can be used only with two types of target (response or dependent) variables.
A categorical target variable that has exactly two categories (i.e. a binary or a dichotomous variable)
A continuous target variable that has values in the range 0 to 1 representing probability values or proportions.
• In binary logistic regression the outcome is usually coded as “0” or “1”. Success is coded as “1” and failure is coded as “0”.
• Logistic regression is used to predict the probability of odds being a case based on the values of the independent variables.
• Logistic regression is used widely in many fields, including the medical and social sciences.
The odds of occurrence of some event is defined as the ratio of the probability that the event will occur to the probability that the event will not occur. That is the odds of the event E is given by
Odds(E) =P(E)/P(E’)=P(E)/1 − P(E)
Odds(E) = n/m is interpreted to mean that the probability of occurrence of the event is n m times the probability of its not occurring. Equivalently, the odds are “m to n" that the event will not happen.
ODDS AND ODDS RATIO
The odds ratio ORA,B, that compares the odds of events EA and EB (that is, Event E occurring in group A and B , respectively), is defined as the ratio between the two odds; that is
ORA vs B = =
In particular, if an odds ratio is equal to one, the odds are the same for the two groups. Note that, if we define a factor with levels corresponding to groups A and B, respectively, then an odds ratio equal to one is equivalent to there being no factor-effect.
• Low birth weight is defined as a birth weight of a live born infant of less than 2,500 g (5 pounds 8 ounces) regardless of gestational age.
Low Birth Weight
• Low birth may lead to increase in risks for complications such as mental retardation, vision loss, or learning problems.
OBJECTIVES OF THE STUDY
The main objective of the survey is to study about the low birth weight of children with special emphasis on logistic regression analysis.
My study focuses on the characteristics like age, weight of the subject at her last menstrual period, race, hyper tension and the number of physician visits during the first trimester of pregnancy etc.
ANALYSISThe statistical tools used for the analysis are
• Logistic regression• Odds and odds ratio• Hosmer-Lemeshow test• Wald statistic• Likelyhood ratio statistic• Cox and Snell's
• Q-Q Plot• Box plot• Anova• Kruskal-walli’s analysis of variance• Ancova
Q-Q plot show that the distribution is not normal. That is plot show departure from normality. If the data are normally distributed, the data points will be close to the diagonal line. Here the data points are in a non linear fashion, so the data are not normally distributed. The straight line in the plot represent expected values when the data are normally distributed.
BOX PLOT
From the box plot we understand that the distribution is positively skewed. Since the upper whisker is longer and the line corresponding to median is in the lower part of the box.
Kolmogorov-smirnov test of goodness of fit is used to examine the suitability of normal distribution in describing the low birth weight. from the table we have the p-value is 0.000. Based on the p-value we see that normal distribution gives a good fit to the data on low birth weight. Since p- value of the Shapiro-Wilk test is less than 0.05,low birth weight is not normally distributed. That is the data significantly deviate from the normal distributon.
The asymptotic significance estimates the probability of obtaining
a chi-square statistic greater than or equal to the one displayed, if
there truly are no differences between the group ranks. A chi-
square of 13.960 with 2 degrees of freedom should occur only
about 1 times per 1, 000.
Kruskal Wallis One Way Anova
Since p-value 0.001 which is less than 0.05 weight pounds at the last menstrual period indifferent races are different. Hence we go for Paired Wilcoxon Rank Sum Test to examine where this difference lies.
CHI-SQUARE df 13.960Asymp.sig .001
a.Kruskal wallis test
Grouping variable:race
Test statistics
From the above we can understand that there is a significant difference between weight pounds at the last menstrual period belong to the first and second races.
Willcoxon rank sum test
Mann-whitney U 1017.500
Wilcoxon rank 5673.500Asymptotic.sig(2 tailed) -1.442Grouping variable race
TEST STATISTICS
To test the ANCOVA we have to establish the relationship between weight pounds of mother at the last menstrual period and the age of mother. Here p-value corresponding to age 0.028 which is less than 0.05. Also p-value corresponding to Ptl, Ui is not less than 0.05. Therefore Ptl and Ui are insignificant covariate. Hence p-value corresponding to race, low, smoke, ht, ftv are less than 0.05.Therefore age is significant covariate. Hence ANCOVA shows the rejection ,that is weight pounds of mother at last menstrual period for these different variables are significantly different. Therefore to find where the difference lies ,we go for post-hoc analysis. Here we find the post-hoc analysis for race.
Under Model Summary we see that the -2 Log Likelihood statistic is 203.343.The Cox & Snell R2 can be interpreted like R2 in a multiple regression,but cannot reach a maximum value of 1. The Nagelkerke R2 can reach a maximum of 1.Bigger the value of R2 fits the data better.
step Likelihood
race 203.343 .153 .215
MODEL SUMMARY
Since the p-value for the chi-square statistic is greater than 0. 05,that is 0. 546>0.05. So the Hosmer and Lemeshow test is insignificant. The data fits the model better.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
5 6.910 8 .546
The classification table compares the predicted values for the dependent variable, based on the regression model, with the actual observed values in the data. This table compares these predicted values with the values observed in the data. In this case, the model 2 variables can predict which value of low birth is observed in the data 71. 4% of the time. Here in this class table ,we see that small increase in our overall percentage rate, from 68% to 71.4%. Comparing this table with the one above with no predictors, our classification accuracy improves .
• A logistic regression analysis was conducted to evaluate the relationships between the likelihood of having a low birth weight (LBW) and certain maternal characteristics.
• History of hypertension increases the log odds of having a low birth weight baby by 1.856.
• Weight of mother is negatively related to the log odds of having a low birth weight baby.
• Presence of premature labour leading the chance of low birth weight baby.
FINDINGS AND CONCLUSION
• The Hosmer and Lemeshow statistic indicates a beter fit if the chi -square statistic is insignificant.
• From the box plot and Q-Q plot, we identify that the weight pounds of mother at last menstrual period is positively skewed.
• Shapiro-Wilk test reveals that weight pounds of mother at last menstrual period does not follow normal distribution. Therefore we use Kruskal-Wallis Anova to compare the means of weight pounds of mother at last menstrual period for different races.
• By Ancova, we get weight pounds of mother at last menstrual period is significantly different.