regression analysis on health insurance coverage rate

21
Georgia Institute of Technology ISyE 6414 Spring 2014 April 25 REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE A Study on Influential Factors of Uninsured Rate in Georgia Group 8 XueyingLinghu 903004963 Chaoyi Wu 903001682

Upload: chaoyi-wu

Post on 15-Jul-2015

62 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

Georgia Institute of Technology

ISyE 6414 Spring 2014

April 25

REGRESSION ANALYSIS ON HEALTH

INSURANCE COVERAGE RATE A Study on Influential Factors of Uninsured Rate in Georgia

Group 8

XueyingLinghu 903004963

Chaoyi Wu 903001682

Page 2: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

1

Summary

The proportion of people without health insurance coverage in United States is a

problem that can be affected by demographic and geographical, such as income,

unemployment rate, gender etc. In this paper, we use multiple linear regression to set

up a model to estimate the uninsured rate of counties in Georgia given several

estimators. We found that proportion of people without health insurance coverage is

closely related to the age distribution, median income, poverty level, employment

situation, gender distribution and citizenship status of each county. A large population

between 18-24 years, a large native born population, a large rich population or a

prosperous job market indicate a low uninsured rate. With our model, prediction on

insured population in the future can be obtained with chosen estimators.

The project is divided into four parts. First we chose our topic and set up the problem

statement. In the stage of data preparation, we cleaned the data from data source and

applied Principal Component Analysis on correlated variables. In fitting the model

stage, we used several model selection methods such as stepwise regression, LASSO

and F test to select the best model, which was followed by an inference analysis and

interpretation. The last part is the discussion for future study.

Page 3: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

2

Table of Contents

Summary ........................................................................................................................................... 1

Background and Problem Statement ................................................................................................. 3

Data for Modeling ............................................................................................................................. 3

1. Data Source ............................................................................................................................... 3

2. Data exploration ........................................................................................................................ 4

3. Principal Component Analysis for Employment Related Variables .......................................... 7

Fitting the Multiple Linear Regression Model .................................................................................. 8

1. Full Model ................................................................................................................................. 8

2. Exploration on Transformation ................................................................................................. 8

3. Variable Selection ..................................................................................................................... 9

4. Exploration on Interaction ...................................................................................................... 10

5. Model Comparison .................................................................................................................. 11

6. Interpretation ........................................................................................................................... 12

7. Diagnostics .............................................................................................................................. 12

Evolution ......................................................................................................................................... 13

Appendix ........................................................................................................................................... 1

Appendix A. Code ......................................................................................................................... 1

Appendix B.ANOVA Table for Full Model .................................................................................. 4

Appendix C.Exhaustive Search Result ......................................................................................... 5

Appendix D.Stepwise Process ...................................................................................................... 6

Page 4: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

3

Background and Problem Statement

The number of people without health insurance coverage in the United States is one of

the primary concerns raised by advocates of health care reform. A person without

health insurance is commonly termed uninsured. According to the United States

Census Bureau, The percentage of the non-elderly population who are uninsured has

been generally increasing since the year 2000.

The causes of this rate of uninsured population remain a matter of political debate.

Americans who are uninsured may be so because their job does not offer insurance;

they are unemployed and cannot pay for insurance; or they may be financially able to

buy insurance but consider the cost prohibitive. Other factors that may influence the

health insurance coverage rate include the age, education level, race, sex and so on.

To better understand the relevant factors influencing health insurance coverage rate

and make prediction on it, we decide to set up a multiple linear regression model that

canbe used to predict the uninsured population given the demographic and

geographical information of Georgia.

It can be expected that income should be a factor that influences people’s decision on

whether they would like to purchase health insurance or not. Age hierarchy may also

tell how uninsured coverage looks like to some extent. Young people who believe

they are healthy and money should not be spent on healthcare are less likely to have

health insurance. Besides, education, race, gender even macroeconomic key

performance indicators such as unemployment rate are all possible to have a

significant relation with the uninsured coverage.

Data for Modeling

1. Data Source

The data is from American FactFinder1. County level data for Georgia state is

extracted from three Health Insurance Coverage Status tables (2010 ACS 3-year

estimates, 2011 ACS 3-year estimates, 2012 ACS 5-year estimates2) and three Income

in the Past 12 Months tables (2010 ACS 3-year estimates, 2011 ACS 3-year estimates,

2012 ACS 5-year estimates). In the health insurance tables, population data is

collected and categorized into different subjects such as Age, Race, Sex, etc.

Extract data from the tables and choose the following variables:

Pop: total civilian noninstitutionalized population3 (for the purpose of simplicity, we

1American FactFinder is a web site used to distribute data collected by the United States CensusBureau

2 The description is available at http://www.census.gov/acs/www/guidance_for_data_users/estimates/

3 People 16 years of age and older residing in the 50 States and the District of Columbia who are not

inmates of institutions (penal, mental facilities, homes for the aged), and who are not on active duty in

the Armed Forces

Page 5: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

4

refer it as total population in the following contents). All data collected is for civilian

noninstitutionalized population.

Uninsure: uninsured population/total population. It will be the response in our model.

Age_18_64 (%): population in age 18 to 64 years/total population. We don’t care

about population under 18 because usually their insurance is covered by their families.

We don’t care about population over 65 either because they are qualified for

government’s healthcare program. It needs to point out that a better variable might be

population in age 19-25 years because this group of people usually don’t have high

income and are more likely to ignore the importance of health insurance. However,

data for this group is not complete, so we have to use other data instead.

HgSch_below (%): proportion with less than high school graduate in population 25

years and older.

Unem (%): unemployed population proportion in total labor force

LaFrc (%): In labor force proportion in population 18 years and older

FT (%): population worked full time in the past 12 months/population 18 years and

older

NonFT (%): population worked less than full time in past 12 months/population 18

years and older

NoWork (%): population did not work/population 18 years and older

NtBrn (%): native born population/total population

Female (%): female population/total population

Black (%): black or African American alone population/total population

Inc2Pov_High (%): population with Ratio of Income to Poverty Level at 2.00 and

over in the past 12 months/total population for whom poverty status is determined. A

higher ratio indicates a better financial condition.

Income (dollars): household median income.

Yr (%): categorical variable that is used to catch the differences of data due to three

different surveys. 1 stands for 2012 ACS 5-years estimates, 2 stands for 2011 ACS

3-years estimates, 3 stands for 2010 ACS 3-years estimates.

Data is available for all 159 counties in Georgia in 2012 ACS 5-year estimates table

but only covers information for 92 counties in the other two tables. After removing

some observations with missing data, we have 309 observations.

2. Data exploration

There are four points that should be mentioned about the data.

a) Correlated variables. Variable Unem, LaFrc, FT, NonFT and NoWork are all related

to employment, especially FT+NonFT+NoWork=1. It is interesting to see that the

correlations between FT, NonFT and NoWork are not as strong as that between the

Page 6: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

5

three variables and LaFrc respectively. TABLE 1 shows their correlation coefficients.

So far we don’t consider choose one from the five. Instead we will use Principal

Component Analysis (PCA) later to deal with them. There are also correlation exists

among other variables. We need to be cautious in modeling and variable selection.

Correlation Unem LaFrc FT NonFT NoWork

Unem 1.000 -0.124 -0.435 0.084 0.285

LaFrc -0.124 1.000 0.831 0.549 -0.961

FT -0.435 0.831 1.000 0.074 -0.812

NonFT 0.084 0.549 0.074 1.000 -0.641

NoWork 0.285 -0.961 -0.812 -0.641 1.000

TABLE 1. Correlation between Employment Variables

b) Outliers detection. Observation 50(highlighted in red in FIGURE 1) is considered

as an outlier. One reason is that judging from both plots in FIGURE 1, it locates far

away from other points in Uninsure but for other variables it is not special, which

means it is an abnormal events just in the response variable. The other reason is that

the observation is for County Echols, which is the only place in Georgia of

banishment for many of Georgia's criminals. Though no confirmed causality can link

the fact to the abnormally high uninsured proportion, we prefer remove it from data

before modeling. Now, 308 observations are ready for modeling.

FIGURE 1. Scatter Plot with Outlier in Red

c) The total population is highly screwed (see FIGURE 2) and is actually took into

account with other data—for most percentage variables total population is the

denominator, we prefer not to add this variable into model as a predictor.

Page 7: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

6

FIGURE 2. Population Distribution

d) The data is not suitable for Binomial or Poisson regression because no cross table

or count data available. For example, no specific population size is found for females

who were between 18 and 64 and worked as full time in last 12 months. Due to the

restricted data availability, multiple regression is the optimal choice.

e) From the right scatter plots in FIGURE 1, we can tell that except the plots for

HgSch_below and Inc2Pov_High, no other plots show a clear pattern between the

predictors and the response. HgSch_below shows a curving positive trend with

Uninsure and Inc2Pov_High has a nonlinear negative trend relation with Uninsure.

The plots indicate that we need to do some data transformations. After several tries, it

seems that log transformation for Uninsure and HgSch_below can improve the

linearity between them. The ladder plot (FIGURE 3) below verifies the conclusion.

The index in the top of the plot is the power for HgSch_below with 0 meaning log

transformation. The index in the right is for Uninsure. In the plot, log(Uninsure) vs

log(HgSch_below) is most linear. We may try a regression with transformed data in

next section. No effective transformations for other variables.

FIGURE 3. Summary of Transformations with Different Powers

Page 8: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

7

3. Principal Component Analysis for Employment Related Variables

Principal component analysis (PCA) is a statistical procedure that uses orthogonal

transformation to convert a set of observations of possibly correlated variables into a

set of values of linearly uncorrelated variables called principal components. This

transformation is defined in such a way that the first principal component has the

largest possible variance (that is, accounts for as much of the variability in the data as

possible), and each succeeding component in turn has the highest variance possible

under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding

components. Principal components are guaranteed to be independent if the data set is

jointly normally distributed.

According to the data we collect, there are five predictors that are used to describe the

employment status and working experience of people in each county of Georgia,

including Unem, LaFrc, FT, NonFT, NoWork. FIGURE 4 shows the relationship

between those predictors. We can find that there is a linear relationship between those

variables which indicates they are highly correlated. To avoid including too many

correlated predictors and simplify our model, we apply PCA to convert the

observations of these five variables into a set of uncorrelated variables called principal

components, and use the influential principal components as our new predictors to do

regression analysis.

After applying the PCA function in R, we find that the first two principle components

have explained 0.9382964 of the variability which can be seen in FIGURE

5.Therefore we decide to use them as our new predictors named as work1 and work2

since they represent the factors concerning employment and working status. We also

get the linear combination of the five original predictors (which are standardized) and

their corresponding weights (loadings) contributing to work1 and work2:

work1= −0.613LaFrc − 0.422FT − 0.208NonFT + 0.630NoWork

work2=0.355Unem− 0.584FT + 0.715NonFT − 0.128NoWork

FIGURE 5.Proportion of Variance FIGURE 4. Scatter Plot for Unem,

LaFrc, FT, NonFT and NOWork

Page 9: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

8

Data is now ready for modeling. We bring variables Age_18_64, HgSch_below,

NtBrn, Black, Inc2Pov_High, Yr, Female, work1, work2, Income and the response

variable Uninsure into next section for modeling.

Fitting the Multiple Linear Regression Model

1. Full Model

First include all variables into the model and TABLE 2 shows the estimators.

According to the p value, Age_18_64, NrBrn, Inc2Pov_High, Female, work1 and

Income have a significant effect on the response at a significance level of 0.001 while

HgSch_below, Black and Yr and work2’s parameters are not stable. That means either

education level is not an important factor or its effect is removed by other correlated

predictors. Black and work2 also might not be effective predictors for similar reasons.

Yr is used as indicator to show the possible impact on the regression brought by the

different approached to get the data. Its insignificance makes sure that we can drop the

indicator and do not need to worry about this hard-to-control factor.

Coefficients Estimate Std. Error p value

Intercept 1.265e+02 1.015e+01 < 2e-16

Age_18_64 -2.693e-01 6.565e-02 5.32e-05

HgSch_below 3.432e-04 3.861e-02 0.992913

NtBrn -2.915e-01 3.998e-02 2.82e-12

Black -6.016e-03 1.022e-02 0.556413

Inc2Pov_High -1.829e-01 4.693e-02 0.000121

as.factor(Yr)2 -1.234e-01 3.299e-01 0.708713

as.factor(Yr)3 -2.908e-01 3.514e-01 0.408615

Female -8.332e-01 1.193e-01 1.91e-11

work1 -1.864e-01 2.506e-02 1.13e-12

work2 -3.794e-03 3.372e-02 0.910496

Income -2.166e-06 3.806e-07 3.03e-08

TABLE 2. Summary of Full Model

The ANOVA table in Appendix B shows that with all previous predictors in the model,

HgSch_below and Black still can explain some information of the response although

in the summary its individual p value is large.

2. Exploration on Transformation

Since in data exploration, we see that a log transformation for Uninsure and

HgSch_below may increase the linearity. Do log transformation for Uninsure and

HgSch_below and run the regression again. With the finding that the residual variance

is not constant (see FIGURE 6), it is not wise to adopt this data transformation. It is

worthwhile to mention that other data transformations have been tried but none of

Page 10: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

9

them makes an improvement on model fitnessand some destroy the assumption of

constant variance, so give up data transformation.

FIGURE 6. Residuals in model with log transformation for Uninsure and HgSch_below

3. Variable Selection

We select the variables with exhaustive search with Mallow’s Cp, Stepwise regression

with AIC and LASSO with Mallow’s Cp.

The three approaches all remove the same variables from the model, namely

HgSch_below, Black and Yr and work2 which happens to be the four with

insignificant parameters in the full model. From FIGURE 7, it can tell that the Cp

reaches the smallest after 10 steps when Age_18_64, NrBrn, Inc2Pov_High, Female,

work1 and Income are in the model (see TABLE 3 for sequence of moves). Outputs

for exhaustive search and Stepwise regression are in the Appendix C and D.

FIGURE 7. LASSO Step and corresponding Cp

Page 11: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

10

Sequence of LASSO moves:

Inc2Pov_High HgSch_below NtBrn Female Income work1 Age_18_64 Black Yr work2

Var 5 2 3 10 9 7 1 4 6 8

Step 1 2 3 4 5 6 7 8 9 10

TABLE 3. Sequence of LASSO moves

After model selection, we decide to include Age_18_64, NtBrn, Inc2Pov_High,

Female, work1 and Income into the model (Model 1).

4. Exploration on Interaction

Notice that variable HgSch_below is dropped but in data exploration we find it has an

obvious relation with the response Uninsure. However, we do not want to do a

stepwise regression with the variable being forced to stay in the model since the

method is at the cost of deteriorating model quality. We wonder if there are any

interaction effects between other variables and HgSch_below that should also be

considered. So in next step, we include interaction effects between HgSch_below with

all other variables and obtain a stepwise regression model (Model 2) with residuals

sum of squares=1470.15 and its degree of freedom=294.

Coefficients Estimate Std. Error p value

Intercept 2.054e+02 2.963e+01 2.65e-11

Age_18_64 -7.309e-01 2.190e-01 0.000953

NtBrn -6.005e-01 1.181e-01 6.57e-07

Female -1.430e+00 3.678e-01 0.000125

work1 -3.169e-02 8.249e-02 0.701121

Income -4.527e-06 1.175e-06 0.000142

Inc2Pov_High 1.571e-01 1.672e-01 0.348378

HgSch_below -4.060e+00 1.388e+00 0.003718

Age_18_64:HgSch_below 2.645e-02 1.201e-02 0.028454

NtBrn:HgSch_below 1.468e-02 5.922e-03 0.013743

Female:HgSch_below 2.828e-02 1.706e-02 0.098480

work1:HgSch_below -6.987e-03 3.852e-03 0.070756

Income:HgSch_below 1.292e-07 5.923e-08 0.029967

Inc2Pov_High:HgSch_below -1.667e-02 7.642e-03 0.029991

Adjusted R-squared: 0.6731

TABLE 4. Summary of Model 2

If we look at the model performance in TABLE 4, standard errors for

Page 12: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

11

Female:HgSch_below and work1:HgSch_below are quite large and their p values also

indicate the estimates are not stable. So we consider remove these two predictors out

of the model to have a reduced model with residuals sum of squares=1504.34 and its

degree of freedom=296.

Compare the stepwise regression model with interactions and its reduced model with

hypothesis test:

𝐻0: the reduce model catches informationvs 𝐻1: the reduce model loses information.

𝐹 =(𝑅𝑆𝑆𝑟𝑒𝑑𝑢𝑐𝑒−𝑅𝑆𝑆𝑚𝑜𝑑𝑒𝑙 2)/(𝑑𝑓𝑟𝑒𝑑𝑢𝑐𝑒−𝑑𝑓𝑚𝑜𝑑𝑒𝑙 2)

𝑅𝑆𝑆𝑚𝑜𝑑𝑒𝑙 2/𝑑𝑓𝑚𝑜𝑑𝑒𝑙 2admitsF(2,294) distribution under the

null hypothesis. 𝐹 𝑣𝑎𝑙𝑢𝑒 =(1504.34−1470.15 )/(296−294)

1470.15/294= 3.419. The p value=0.03, so

we can reject the null hypothesis and choose Model 2.

5. Model Comparison

Till now, we have found two models (Model 1 vs. Model 2) that are potentially good

to be our final model. Model 1 is the reduced model got by doing model selection

without any interaction. Model 2 is the stepwise model with interaction terms.

TABLE 4 and TABLE 5 list their coefficients and statistics values. Compared Model

1 and Model 2, the adjusted R for Model 2 (0.6731) is a little bit larger than Model 1

(0.6626). By checking the residual plots, Model 2 does not show an improvement in

the constant variance and normality assumption.

For the purpose of prediction, we prefer small model to avoid over fitting, so choose

model 1 as our final model (see TABLE 5 for model summary).

Uninsure = 1.288e+02-2.826e-01*Age_18_64 -2.904e-01*NtBrn -1.820e-01*work1

-8.731e-01* Female -2.137e-06* Income-1.810e-01*Inc2Pov_High

Coefficients Estimate Std. Error t value p value

Intercept 1.288e+02 7.531e+00 17.098 < 2e-16

Age_18_64 -2.826e-01 5.531e-02 -5.110 5.73e-07

NtBrn -2.904e-01 3.950e-02 -7.351 1.87e-12

Inc2Pov_High -1.810e-01 4.234e-02 -4.275 2.57e-05

Female -8.731e-01 9.337e-02 -9.351 < 2e-16

work1 -1.820e-01 2.426e-02 -7.501 7.17e-13

Income -2.137e-06 3.513e-07 -6.084 3.56e-09

Adjusted R-squared: 0.6626

TABLE 5. Summary of Final Model (Model 1)

Page 13: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

12

6. Interpretation

The standard errors are small in our final model, which indicates that the estimates are

reliable. By checking the p values we can know that all the coefficients in our model

is significant since they have very small p values. In another word, those predictors

have significant influences on our estimate of the uninsured rate.

According to the all negative coefficients, we can see that the uninsured population

coverage will decrease as the proportion of people whose age are between 18 and 64

increases, the proportion of native born people increases, the proportion of population

with Ratio of Income to Poverty Level increases, the proportion of female population

increases, the principle component of employment factors increases and the median

income of county increases. The variable work1 equals to -0.613LaFrc-0.422FT

-0.208NonFT + 0.630NoWork. That means work1 will increase when LaFrc, FT,

NonFT increase or when NoWork decreases. In another word, work1 is an indicator

positively correlated to job market. In general, the model tells us a better financial

condition may help decreasing uninsured rate.

Use Fulton’s data in 2012 5-year estimates to see how the model works with

prediction. With 95% confidence, the predicted value is 17.67 and the prediction

confidence interval [13.15001, 22.19856]. The interval is quite large. If we want to

predict the uninsured population for the whole Georgia state, we can simply get

individual predicted insured rate simultaneously and sum the uninsured population up.

To get a simultaneous confidence band, we can use Bonferroni method to control the

significance level.

7. Diagnostics

We use the following residual plots to check the assumptions we make on this

multiple linear regression model. First, we assume constant variance. From the

Residuals vs Fitted plot and the Standardized Residual plot in FIGURE 8 we find no

clear patterns of the distribution of residuals which shows that the constant variance

assumption does not be violated. The non-pattern residual plot also confirms the

linearity. However, we doubt slight dependence may exist in the data though there is

no efficient way to check it. Another assumption we make is that the errors are

normally distributed. From the Q-Q plot we can see that points almost fall on the line,

except a few values on the tail which is acceptable in our case, therefore we conclude

that the normality assumption holds. Finally, the Cook’s Distance plot is good with all

the distances are closed to zero. We can conclude that there are no outliers in the

model.

Page 14: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

13

FIGURE 8. Residual Analysis of Final Model

Evolution

The data used for modeling contains only demographic and geographical information

however other factors such as prices of insurance product are also very important

factors. Such data should be included into model in future study.

In the project, we use linear regression model to make a prediction. However, to have

a smaller prediction confidence interval, Poisson or Binomial Regression is a better

approach if count data with cross categories are available.

The results of PCA depend on the scaling of the variables. If we want to make a

prediction with PCA variable work1 in our model, we have to first scale variables

Unem, LaFrc, FT, NonFT and Nowork before obtaining the value work1. This step

causes problem because the mean and standard variance used to scale variables for

training data does not take the new data into account. But updated mean and standard

variance will change the coefficients for the estimators. More study on how to use

PCA in prediction should be focused on.

Page 15: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

1

Appendix

Appendix A. Code ############## Raw data is Data_Group8.csv ##

test=read.csv("Data_Group8.csv",header=TRUE) ############## First check of data ################

plot(test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure), "red", "black")

,ylab="Uninsure")

par(mfrow=c(4,3))

plot(test$Age_18_64,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="Age_18_64", ylab="Uninsure")

plot(test$HgSch_below,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="HgSch_below", ylab="Uninsure")

plot(test$Unem,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="Unem", ylab="Uninsure")

plot(test$LaFrc,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="LaFrc", ylab="Uninsure")

plot(test$HgSch_below,test$FT, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="FT", ylab="Uninsure")

plot(test$NonFT,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="NonFT", ylab="Uninsure")

plot(test$NoWork,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="NoWork", ylab="Uninsure")

plot(test$NtBrn,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="NtBrn", ylab="Uninsure")

plot(test$Female,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="Female", ylab="Uninsure")

plot(test$Black,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="Black", ylab="Uninsure")

plot(test$Inc2Pov_High,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="Inc2Pov_High", ylab="Uninsure")

plot(test$Inc2Income,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),

"red", "black"), xlab="Income", ylab="Uninsure") ############## Remove the outlier Obs 50 from data #######

test[which(test$Uninsure=max(test$Uninsure)),]

data=data.frame(test[-50,-1])

attach(data) ############## PCA ###########################

############## Check the colinearity #########

testpca=cbind(Unem,LaFrc,FT,NonFT,NoWork)

round(cor(testpca),3)

Page 16: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

2

testpcam= as.data.frame(testpca)

plot(testpcam)

############## Start PCA #####################

first<-princomp(testpcam,center=TRUE,scale=TRUE)

summary(first)

plot(first)

first$loadings

first$scores

first$scores[,1]

first$scores[,2]

############# Add scores in the table ########

data$work1 <- first$scores[,1]

data$work2<- first$scores[,2] ############# Reset data frame. data1 is the datasset used for modeling#####

data1=data[,-4:-8]

attach(data1) ############# Regression ###########

###Regression with PCA score

out=lm(Uninsure~Age_18_64+HgSch_below+NtBrn+

Black+Inc2Pov_High+as.factor(Yr)+Female+work1+work2+Income)

summary(out)

par(mfrow=c(2,2))

plot(out,which=c(1,2,3,4)) ############## Data transformation ##

library(HH)

ladder(Uninsure~HgSch_below , data=data1)

ladder(Uninsure~Inc2Pov_High , data=data1)

Uni<-log(Uninsure)

HS<-log(HgSch_below)

out2=lm(Uni~Age_18_64+HS+NtBrn+Black+Inc2Pov_High+work1+work2+Income+Female+as.fact

or(Yr))

plot(out2,which=c(1,2,3,4))

summary(out2) ############## Model selection ######

############## Exhaustic Search #####

library(leaps)

x=cbind(Age_18_64,HgSch_below,NtBrn,Black,Inc2Pov_High,as.factor(Yr),work1,work2,Income,F

emale)

outmc=leaps(x,Uninsure,method="Cp",nbest=2)

outmc

which(outmc$Cp==min(outmc$Cp))

# Outconme: exclude: HgScg_below, Black, Yr,work2

Page 17: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

3

############# Stepwise ##############

outstep=step(out)

# outstep is the Model 1 in report

summary(outstep)

plot(outstep,which=c(1,2,3,4))

# Outconme: exclude: HgScg_below, Black, Yr,work2 ############# LASSO #################

library(lars)

predictor=scale(x)

uniscale=scale(Uninsure)

outla=lars(x=predictor,y=uniscale)

outla

par(mfrow=c(1,1))

plot(outla)

outla$Cp

plot.lars(outla,xvar="df",plottype="Cp")

# Outconme: exclude: HgScg_below, Black, Yr,work2 ############## Model with interactions ##################

outi=lm(Uninsure~(Age_18_64+NtBrn+Black+Female+work1+work2+Income

+Inc2Pov_High+as.factor(Yr))*HgSch_below)

par(mfrow=c(2,2))

plot(outi,which=c(1,2,3,4))

summary(outi)

outistep=step(outi)

plot(outistep,which=c(1,2,3,4))

summary(outistep)

# outistep is the Model 2 in report ############# Reduced model with interactions############

outire=lm(Uninsure~Age_18_64+NtBrn+Female+work1+Income+Inc2Pov_High+HgSch_below

+Age_18_64:HgSch_below+NtBrn:HgSch_below+Income:HgSch_below+Inc2Pov_High:HgSch_bel

ow)

par(mfrow=c(2,2))

plot(outire,which=c(1,2,3,4))

anova(outire, outistep, test = "Chisq") ############ Prediction with Final model #########

############ Final model is outstep ##############

#######Use Fulton as an example

newdata = data.frame(Age_18_64=67.04666, NtBrn=87.0677, Inc2Pov_High=66.8271,

Female=51.46529, work1=-13.71593, Income=5766400)

predict(outstep, newdata, interval="prediction")

Page 18: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

4

Appendix B.ANOVA Table for Full Model

Page 19: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

5

Appendix C.Exhaustive Search Result

Page 20: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

6

Appendix D.Stepwise Process

Page 21: REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

7