income determinants: an empirical analysis · this idea that testing confounding variables gives us...

Income Determinants: An Empirical Analysis

Ryan T. Anzalone Wabash College

May 2014

Abstract: How do different personal factors affect the wage you will earn as a working

adult? This is a question that many have asked, but I will look deeper into this question to see

how things like age, educational attainment, gender, and length of time as a Unites States

resident affect a person’s predicted wage. This paper uses data from IPUMS USA to test the

effects of these variables. After regression, I found that age effects income with a quadratic

model, while educational attainment increases predicted income with an exponential model.

Table of Contents

I. Introduction…................…………………………………….1

II. Literature Review………………………………...………….2

III. Empirical Analysis……....…………………………...……….3

Data………………………........………………..3

Results….………………………………………..5

IV. Conclusions..........…………..…………………………………9

V. Tables……………………………….………………………..11

VI. Bibliography……………..…………….…….……………….13

1

I. Introduction

It is clearly evident that income isn’t randomly assigned to people, and that it is earned

through hard work along with many other factors. I would like to see what those other factors

are, and how significantly they affect income. It is clear that there is no be-all-end-all factor

which guarantees high income, but we can still look for positive or negative correlations in large

data sets for variables like: urban/rural status, age, gender, age at first marriage, times married,

race, education levels, public/private school education, what industry you work in, veteran status,

and how many years you’ve lived in the US.

I think that age, educational attainment, and public/private school education will have

significant influence on predicted wage. I believe there will be higher predicted wages for people

who are older because they likely have more experience and have higher paying jobs. I think

people with more education will also earn more because they can work in more complex jobs

which usually pay higher wages. I also think that private school education is superior to public

school education, and that this education gap will translate to an income gap. I will run

regressions and report the findings to see if these predictions are correct.

There has been extensive research conducted on the correlations between certain variable

factors and their effects on income. What I hope to achieve with this paper is to test the findings

of some and expand upon their research. I doubt I am the first person to look at these correlations

and run these regressions, but most of these variables have not been published on the matter.

Gender, education, and family situations all have well tested correlations, but what about things

2

like number of times married, or the number of years a person has lived in the US? Before

conducting any regressions, I believe there will be meaningful results in these data sets which

will shed some light on the topic of income and its factors. According the Lewis Solmon, there

are “statistically significant differences in the predicted earnings of graduates from the various

types of colleges” (Solmon, 75). I believe there will be statistically significant differences in the

predicted earnings for many other factors as well.

II. Literature Review

An article in the Public Administration Review, Unequal Pay: The Role of Gender,

explores the findings of Mohamad Alkadry on the difference between pay of males and females.

He starts out by acknowledging that there may have been confounding variables involved in the

common perception that men make more than women. Maybe women only seem to make less

because popular jobs among women may be lower paying jobs in general, regardless of gender.

“A 2003 study by the General Accounting Office found that women earned 79.9 percent of what

men earned” (Alkadry, pg 888). Alkadry thought that there could be a more refined study done

which looked at the pay of individuals who hold comparable positions at comparable agencies.

He addresses any question about confounding variables with his current study. The table below

was taken directly from his published paper. Note that this table only shows the pay disparities in

Procurement positions and that there may be different data in different types of positions. This

table shows that, even when many confounding variables are tested for, there is still a pay

disparity between men and women. It also found that some jobs (Senior Buyers in Procurement

Positions in this case) don’t have significant pay gaps at all.

3

This idea that testing confounding variables gives us a clearer picture of a research question

is the main driver behind my empirical paper. I’m going to look at many different factors for

income disparities, and try to find some omitted variables which will clear up my regressions.

III. Empirical Analysis

The data set that will be used in this empirical paper was obtained via ipums.org and the

2012 Census. The variables I chose to include are listed below.

4

This is a sample of 1,429,901 people in a 2012 census done in the United States. The

minimum age of any person in this sample is 18, and the max top-coded at 95, meaning that

anyone over the age of 95 is just listed at 95 to keep anonymity. The summary statistics for the

variable age can be found in Table 1.

The variable marrno represents each person’s number of marriages, which range from 0

to 3. Although it is not specifically stated, it is safe to assume that this variable is top-coded in a

similar way to age, so that even people with more than 3 marriages will only be listed at 3. The

summary statistics for the variable marrno can be found in Table 1.

In the process of breaking down these variables, I found it would be useful to create some

new dummy variables to better represent the race variable. Using the Stata command gen, I was

able to create a new variable with values 0 and 1. This variables, called white, can be seen, along

with its summary statistics in Table 1.

The variable YRSUSA1 represents the number of years each person has lived in the

United States. Summary statistics can be found in Table 1. The variable educ represents the level

of education each person has completed. Some notable values for educ are Grade 9 (educ=3),

Grade 12 (educ=6), Bachelor’s Degree (educ=10) and any additional education past the

undergraduate level (educ=11). The summary statistics can be found in Table 1.

There are some limitations with this data due to missing information and non-refined

variables. The ipums census had a variable available called schltype which I thought would be

the type of school the person attended, but it turned out to only survey for current enrollment, so

I dropped it from the data set. Because this was the only available ipums data on school type, I

can’t look at the effects of different types of school on income.

5

I would like to start my regressions with the following model of predicted incwage.

Model 1

Predicted incwage = β0 + β1(age)

From this regression’s output (Table 3), we can see that as age increases by 1, predicted

wage also increases by $754.91. When first looking at this regression, the results seem logical,

but is it possible that there are omitted variables which are biasing our results? Other variables

with strong correlations to age can definitely have a negative impact on the accuracy of the

above regression. To find these correlations, I will regress age on different variables. One such

output can be found in Table 2.

An increase in the number of times a person has been married increases their expected

age by 10.21 years. That’s a strong correlation and it is apparent that there is going to be bias

involved with our initial regression. A new regression which adds Times married follows this

model:

Model 2

Predicted incwage = β0 + β1(age) + β2(times married)

As you can see in Table 1, the coefficient of age to predicted incwage went down pretty

substantially. Instead of an extra year increases predicted wage by $755, it now only increases

predicted wage by $632.90. A completely unrestricted regression output is on the next page, and

it will take the form:

6

Model 3

Predicted incwage = β0 + β1(age) + β2(male) + β3(marrno) + β4(educ) + β5(YRSUSA1) +

β6(white)

This regression has some interesting results (Table 3). The coefficient of 509.48 on age

means that for every additional year a person is alive, their predicted wage goes up by $509.48.

Additionally, being male increases predicted wage by $20,050.74, being married multiple times

also increases predicted wage. Other variables which increases predicted wage are educ, and

YRSUSA1.

In order to see the relationships between these variables, I asked Stata to create some new

variables for me. The Stata command egen mInc = mean(incwage), by(age) created a new

variable called mInc which stores the mean value of incwage at each age. Plotting this new

variable by age gives us the following:

7

The parabolic shape of this curve suggests that we may want to consider age2 as a

possible variable to regress on incwage. The Stata regression of this new variable agesq = age2

and age on incwage follows this form:

Model 4

Predicted incwage = β0 + β1(agesq) + β2(age)

So this new model holds the form:

Predicted incwage = -51.66*(age2) + 5161.63(age) - 72355.18

The above graph is the predicted incwage values (blue dots) plotted against the mean

incwage values at each age. For everyone under the age of 75, this is almost a perfect fit. To

reduce the error in this data, I’m going to drop all the data points of ages above 75.

8

.

We also need to look at how educational attainment influences predicted incwage. Below

is a scatter plot of mean income at each level of education.

It seems to follow an exponential curve. Creating a variable to represent 2^educ is the

next step. A regression of this new variable on incwage produces the model:

Predicted incwage = 29.4*(2^educ) + 27,870

9

When plotted against the mean values of incwage for each level of educational

attainment, this model looks like a very good fit. The graph can be found below.

IV. Conclusion

After looking at these different relationships between age, education and income, it is

time to revise the unrestricted model from before.

Predicted incwage = β0 + β1(age) + β2(male) + β3(marrno) + β4(educ) + β5(YRSUSA1)

+ β6(white)

With the new variables agesq, 2educ (named expINCeduc), the regression will now look

like this:

10

Model 5

Predicted incwage = β0 + β1(agesq) + β2(age) + β3(expINCeduc) + β4(educ) + β5(male)

+ β6(white) + β7(YRSUSA1) + β8(marrno)

The results of this Model 4 are shown in Table 4.

To check this regression output for heteroskedasticity, the Breusch-Pagan (BP) test. The

BP tests to see if the variance of the residuals is homogenous. This homogeneity is the null

hypothesis. The alternative hypothesis is that the variance of the residuals is not homogenous.

The test returned a p-value of 0.000 which means we must reject the null hypothesis.

Heteroskedasticity is definitely present here, so we have to use robust SE’s in the regression.

Model 6 shows the regression output of the same variables as Model 5, except it uses robust

SE’s. Model 6 can be found in Table 4.

With this new regression, we finally have a model that describes predicted income well.

The massive t-values for each one of these variables show that they are all statistically

significant, and the difference seen in predicted incwage with an increase or decrease in any of

these variables cannot be due to chance alone, so they must have a true correlation with income.

There isn’t much we can do with this information except encourage people to get more education

because the other variables are mostly out of a person’s control.

11

Tables

Table 1: Summary Statistics

Table 2: Times married regressed on age

Table 3: Models 1, 2, 3

incwage regression Model 1 Model 2 Model 3

Age

754.908*

(2.95)

632.904*

(3.47)

509.483*

(3.20)

Times married

4424.356*

(66.37)

5050.308*

(60.90)

Male

20050.744*

(76.77)

Educational Attainment

8066.571*

(17.04)

Years in the US

243.043*

(4.29)

white

3767.599*

(96.85)

constant

-58054.23*

(187.44)

r2 0.044 0.047 0.204

Mean S.D. Min Max

age 42.819 14.559 18 95

marrno 0.913 0.743 0 3

YRSUSA1 3.252 9.308 0 95

educ 7.561 2.27 0 11

male 0.513 0.499 0 1

white 0.786 0.41 0 1

Summary Statistics

age Coefficient Std. Err. t-stat p-value

Times married 10.219 0.013 733.42 0.000

constant 33.475 0.016 2,039.57 0.000

12

Table 4: Models 4, 5, 6

incwage regression Model 4 Model 5 Model 6

age2

-51.661*

(0.21)

-40.250*

(0.19)

-40.250*

(0.19)

age

5161.633*

(17.81)

3976.387*

(17.21)

3976.387*

(16.07)

2^educ

19.601*

(0.11)

19.601*

(0.14)

Educational Attainment

2627.731*

(32.89)

2627.731*

(29.38)

male

19329.381*

(74.82)

19329.381*

(75.39)

white

4896.109*

(94.40)

4896.109*

(79.47)

Years in the US

86.238*

(4.23)

86.238*

(4.94)

Times Married

2472.004*

(60.91)

2472.004*

(58.23)

Constant

-72355.18*

(357.41)

-91351.13*

(399.89)

-91351.13*

(350.30)

r2 0.084 0.245 0.245

13

Works Cited

Lewis Solmon, The Effects on Income of Type of College Attended, Sociology of Education,

Volume 48 No. 1, pg 75-90. <http://www.jstor.org/stable/2112051>.

Mohamad Alkadry, Unequal Pay: The Role of Gender, Public Administration Review, Volume

66 No. 6, pg 888-898 <http://www.jstor.org/stable/4096605>.

"IPUMS-USA." Minnesota Population Center. University of Minnesota, n.d. Web. 08 May 2014.

<https://usa.ipums.org/usa/>.

income determinants: an empirical analysis · this idea that testing confounding variables gives us...

Documents