homework i: stata guide - university of california, san...

Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom

1

Homework I: Stata Guide

This will serve as a guide for you to learn Stata. A program used to process data for statistical inference. These instructions will aid you in completing your first homework assignment. If anything, really anything, is unclear four of your best resources will be: I. Office Hours (found on TED). II. Use the help command in Stata; help x or google help x stata. Replace x with the command that you are unsure of. III. E-mail [email protected] IV. http://www.ats.ucla.edu/stat/stata/modules/default.htm a great self help guide from UCLA. Commands will be in bold (type the phrase in bold then hit enter). describe will show the variables contained in the dataset. Stata is extremely case sensitive. If you enter a command and the variable cannot be found; it is possible that you entered happins, not Happins. Clicking will be in italics. Title will be what you click, -> indicates what you click next. E.g, File-> open-> documents-> school-> Stata Homework -> dataset.dta.

I. Logistics

A) If you are using your own computer then this first step may be redundant. Once you open Stata, clear will remove all previous variables in the program. This will ensure that the only variables in Stata are related to the homework assignment. B) set more off will make the analysis run faster. However, if you have a fast computer this may not be necessary. C) set mem 15 (only if you run Stata 11 or earlier, which is unlikely if you use it through VCL or a UCSD computer). D) cap log close this will close the existing log file. A log file is what records what is done in Stata. E) Choose Working Directory: File -> Change Working Directory -> select a folder F) Create a log file in which the results of the programming will be saved. E.g: File -> log -> begin -> selected the folder where you want to save it -> pick a name -> Save it

mailto:[email protected]

http://www.ats.ucla.edu/stat/stata/modules/default.htm


2

G) Open the dataset (dta file). File -> Open -> Find and select the file country_happiness.dta H) Save your data as a new file. This will make sure that you do not tamper with the original file. File -> Save As -> selected the folder where you want to save it -> pick a name -> Save

IA. Analysis: Happiness

A) describe allows you to see what variables are contained in the dataset. The dataset contains information about socioeconomic, and happiness scores for 75 countries. describe happins gdp2002 (the two variables that we are interested in for this homework assignment).

B) summarize will give you summary statistics on the variables that you enter. It will give you: number of observations, mean, Std. Dev., min/max values. summarize happins summarize gdp2002

C) sort will re-arrange the variable in ascending order. This will allow us to see which countries are the happiest/saddest sort happins browse will show you the data in cell-format (like excel). Enter the command to see for yourself that the variables are re-arranged

D) We want to find the least/most happy country in the dataset. In order to do so, we will use list. * _N is the total number of observations. * _n is the observation/ row number. E.g, _n==5 is the fifth unhappiest country in the dataset. i) To find the unhappiest country: list country_name happins if _n==1 ii) To find the happiest country: list country_name happins if _n==_N * List can be used to find the happiness index of particular countries. We want to see how happy people are in USA and Italy: iii) list cty happins if country_name == “United States” iv) list cty happins of country_name == “Italy”


3

E) We can use count to see how many countries that are happier than a specific country. Let’s see how many countries that are happier than the U.S. by using the happiness index for the U.S. It is also possible to see which those countries are, and values in between two countries: Portugal and USA. i) count if happins > 3.32452 ii) list country_name if happins > 3.32452 iii) a. list cty happins if country_name == “Portugal” b. List if happins > 2.9510 & happins < 3.32452

IB. Analysis: Religion

A. Religion is a string variable, non-numerical. Summarize won’t work for this.

Instead we will use the tabulate command. It gives us frequencies, percentage, and cumulative distribution for each religion type. * describe religion, see for yourself * tabulate religion

B. We can look at different countries to see what religion has a majority in a particular country. For example let’s see in which countries Shiites are in majority. It is also possible to see which countries that don’t practice certain religions. In order to do so we use != , does not equal command. Don’t forget quotation marks for string variables! i) list country_name happins religion if religion == “Shia Islam” ii) list country_name happins if religion != “Catholic Heavily”

C. Once again we want to look at the summary statistics for happiness scores and GDP/capita. * summarize happins gdp2002

D. As you saw previously, it is extremely easy to find the standard deviation, mean, etc. Let’s test your understanding of statistics by finding Std. Dev. manually in Stata. This will be done in a few steps. i) We need to create a variable for the deviation. We will subtract the mean from each observation of happins. generate happins_deviation = happins-3.043835 (we got the mean by using summarize happins).


4

ii) The deviation must be squared. generate happins_deviation_sq = happins_deviation^2 iii) Now it is time to add up all of the squared deviations. tabstat allows us to produce a table of statistics. tabstat happins_deviation_sq, statistics (sum) iv) In order to do calculations in Stata we use display. display 1+1, display 5*5, Display 1-1, Display 5/0 (j/k you can’t divide by 0). In order to get the sample variance we will divide the squared deviation by N-1. display 5.6961/74. Alternatively, display 5.6961/(_N-1). v) In order to get the Standard Deviation we need to take the square root of the sample variance. display sqrt(.07697432).

E. Now try to calculate the standard deviation for the GDP variable. i) generate gdp_deviation_sq = (gdp2002 – 14099.65)^2 ii) tabstat gdp_deviation_sq, statistics (sum) columns(variables) iii) display sqrt(1.05e+10/74) iv) The value won’t be exactly the same as the one shown by using summarize. This is due to rounding.

F. It is possible to plot the distribution using Stata graphical tools. We are to plot a normal distribution that has the same mean and standard deviation as happins. histogram happins, frequency normal

G. Let’s plot a histogram for GDP as well. histogram gdp2002, normal

H. We can look at the correlation between GDP and Happiness in two ways. i) corr happins gdp2002, which gives us the correlation between happiness and GDP. ii) scatter happins gdp2002, which will graph a scatter plot of their relationship. iii)Save the graph. In the graph window File -> Save As -> Save as type: Portable document format (*.pdf) -> select the folder where you want to save it -> pick a name -> Save File


5

I. Let’s figure out what country is the one with a GDP/capita closets to $60,000. This could be hard doing by eye. Fortunately, we can add labels to the scatter plot. i) scatter happins gdp2002, mlab(country_name) mlabsize(small). Luxembourg should be that country. Notice the two axis, which are dataset labels for our two variables, happins gdp2002. The variable you type first will be displayed on the y-axis. ii) Let’s make the graph user friendly. We can do so by naming the graph and the axis. scatter happins gdp2002, mlabel(country_name) mlabelsize(vsmall) title(Scatterplot: Happiness Score and GDP/capita) ytitle(Happiness Score) xtitle(GDP/capita) iii) Outliers can be dangerous in Econometrics. If consider Luxembourg an outlier we can easily get rid of the observation. By adding an “if” option we can graph the scatter diagram without displaying Luxembourg. drop if country == “Luxembourg”

J. We are done with the analysis for these variables. However, let’s save the dataset and close the log file before moving on. i) File -> save

II. Analysis: Money

A) It is now time to use a different dataset. Before getting started we need to use

some of the commands from the logistics section on page 2. i) clear ii) set more off iii) cap log close iv) File -> log -> begin -> selected a folder -> pick a name -> Save it v) Open the dataset.. File -> open -> find and open CEOSAL1.dta vi) Then save the file before getting started. File -> Save As -> selected a folder -> pick a name -> Save it

B) It’s generally a good thing to look at the variable in the dataset. describe

C) The two variables of interest are CEO salaries and return on equity. list salary roe if _n <25

D) sum

E) The industry that the data is drawn from should give additional information. This is a discrete variable. To better way to describe this types of variables is


6

through the command tab indus

F) It is possible to look at the cross-tab of two discrete variables. The cross-tab reports the relative frequency within its row for each cell. In our example, it gives the conditional distribution of financial firms given that industrial firms take value 0 for the first row or 1 for the second row. It is essentially the conditional distribution of the column variable given the row variable. tabulate indus finance, row

G) We can also find the conditional distribution of the row variable given the column variable. tabulate indus finance, column

H) Lastly it is possible to get the joint distribution of industrial firms and financial firms. tabulate indus finance, cell

I) Let us look at the correlation between salary and return on equity while excluding potential outliers. corr salary roe if salary <5000

J) It is time to create another scatter plot. In order for the axis to be easier to read we are going to format them. We want to see how many CEOS make more than $5,000,000/year and how many companies that have ROE of 50% or higher. scatter salary roe, yline(5000) xline(50)

K) Let’s plot a histogram for salary. i) hist salary ii) histogram salary, normal (this compares the histogram of salary to a normal plot) iii) Different representations of incomes, e.g, salary, are usually represented as the natural log of salary. histogram lsalary, normal. This creates a histogram that is more traceable compared to the previous one.

L) hist roe, normal

M) File -> Save

N) File -> log -> close

Good luck!


7

Summary Table of the Logical Expressions in Stata

Command Short description < less than

<= less than or equal == equal > greater than

>= greater than or equal != not equal & and | or ! not

Summary Table of the Stata Commands seen in Tutorial 1

Command Short description Example

describe will show characteristics of the variable/s

contained in the dataset

des variable_name

summarize will give you summary statistics on the variables

that you enter.

sum variable_name

sort will re-arrange the variable in ascending

order.

sort variable_name

browse will show you the data in cell-format (like excel).

list can be used to find the value of a particular

variable.

list country_name happins religion if religion == “Shia

Islam” count to see how many

countries that are happier than a specific country.

count if happins > 3.32452

generate to create a variable gen variable_name = insert_formula

tabstat allows us to produce a table of statistics.

tabstat variable_name

tabstat variable_name, statistics (sum)

add up all of the values stored for a certain

variable.

tabstat variable_name, statistics (sum)


8

display In order to do calculations in Stata

display sqrt(1.05e+10/74)

histogram plot a histogram histogram variable_name, normal

corr look at the correlation between variables

corr variable_name1 variable_name2

scatter will graph a scatter plot of their relationship.

scatter variable_name1 variable_name2

tab additional information to describe variables

tab indus

STATA Tutorial #2

If you need any additional guidance, or are having other issues with STATA, try the following:

Attend office hours, the exact times of which can be found on TED.

Use the “help” command on STATA or Google (i.e. help scatter if you want clarification

on how the “scatter” command works).

Send questions to [email protected].

1. → clear

2. → cap log close

a. The “cap log close” command, in this case, tells STATA to close any log files you may

currently have open.

3. □ File > □ Log > □ Begin

a. This allows you to begin a new log (which you will need to do in order to turn in your

homework assignments). Make sure to save your log as a .log to receive full points on

your homework assignment!

4. □ File > □ Open

a. Open your dataset (wine.dta).

b. Alternatively, you could choose to use STATA’s “use” command, which also tells STATA

to load a designated dataset.

5. → save wine_out.dta, replace

a. We don’t want to actually alter the original dataset (wine.dta) so we will save it under a

new name – in this case, “wine_out.dta.”

b. The “replace” command here tells STATA to replace our previous dataset file with our

new wine_out.dta.

6. → describe

a. The “describe” command shows us what our dataset contains: the number of

observations, variables, etc. Often, it will also give a brief description of what each

variable represents.

7. → scatter alcohol heart, mlabel(country) mlabsize(vsmall)

a. We are now using the “scatter” command to create a scatterplot representing the

relationship between alcohol consumption and heart disease. Note that alcohol

consumption, listed first here, is on the Y-axis; while heart disease, listed second here, is

on the X-axis.

KEY

→ Type into Command box □ Left Click


b. The “mlabel” option allows us to label the points by country, while the “mlabsize”

option allows us to manipulate the appearance of said labels (in this case, “vsmall” tells

STATA to make the label text very small).

c. We can see, based on the scatterplot produced, that the two variables appear to be

negatively correlated such that the higher the wine consumption, the lower the deaths

by heart disease.

8. → scatter alcohol liver, mlabel(country) mlabsize(vsmall)

a. We can create a similar scatterplot to observe the relationship between alcohol

consumption and deaths by liver disease (in this case, the variables appear to be

positively correlated).

9. → regress heart alcohol, robust

a. We now want to run a regression between deaths by heart disease and wine

consumption. The “regress” command tells STATA to run a linear regression.

i. Recall that if errors are not homoscedastic, we must use heteroscedastic robust

standard errors in order to make valid inferences. We can tag on the robust

option to accommodate this.

b. STATA gives us a lot of information: in the top right corner, we can see the sample size,

the standard error, and the R-Squared. We are also told the degrees of freedom,

estimated coefficients, and standard errors, displayed in other regions of the command

output.

10. → display 46817.5108/ 107044.286

a. We can manually calculate the R-squared using the “display” command.

i. The Explained Sum of Squares (ESS) is given to us by Stata as the Model SS; the

Unexplained Sum of Squared Residuals (SSR) is given to us as the Residual SS;

and the Total Sum of Squares (TSS) is given to us as the Total SS.

ii. To calculate the R-squared, divide the ESS value by the TSS value (46817.5108/

107044.286).

11. → display 1-(60226.7749/ 107044.286)

a. Alternatively, we can calculate the R-squared using the formula 1-(SSR/TSS). Again we

can show this on STATA using the “display” command.

12. → display _b[_cons] + _b[alcohol]* 8

a. STATA stores the coefficient values in the form of the variable “_b.” Thus “_b[_cons]”

gives me the coefficient of the constant term (the intercept). Meanwhile “_b[alcohol]”

gives us the slope of the regression line.

b. To predict the value of deaths by heart disease in a country with a wine-per-capita

consumption of 8 liters per year, use the display command as shown above. We are

essentially plugging “8 [liters]” into the regression line.

13. → twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall))

a. The “twoway” command produces a twoway graph according to our specifications.

i. The “lfit” option generates a line of best fit through our original scatterplot

(initially generated in step 9).

The next two steps (16-17) are somewhat irrelevant to the tutorial as a whole but will help you in the

completion of your second homework assignment.


(function y= 253.78 - 21.733*x, range(alcohol))

a. The “function” option appended to our command back in step 15 draws a function in

the above graph – in this case, y = 253.78 - 21.733*x.


(function y= 253.78 - 21.733*x, range(alcohol)), legend(order(1 2 "Observed" 3 "A function of

interest"))

a. Here we’ll attempt to make the graph legend a little clearer. The legend option allows us

to label our graph more deliberately (to better illustrate this, try also twoway (lfit heart

alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall)) (function y= 253.78 -

21.733*x, range(alcohol)) and see what your key would look like in this case).

16. → predict yhat_h

a. This saves all fitted values.

b. □ Data > □ Variables Manager shows the new variable “yhat_h,” labelled “Fitted

Values.”

17. → predict uhat_h, residuals

a. Let’s also save the residuals from the regression. Again, □ Data > □ Variables Manager

should show you the new variable “uhat_h,” labelled “Residuals.”

18. → generate uhat_alt= heart - yhat_h

a. Experimentally, we can verify that the difference between the actual observed value

and the value predicted by the model equals the residual.

19. → drop uhat_alt

a. Drop the variable uhat_alt.

20. → tabstat uhat_h, statistic(sum)

a. We can check to see that the sum of the residuals equals zero using the tabstat

command, with the statistic(sum) option.

21. → rvpplot alcohol, yline(0) mlabel(country) mlabsize(vsmall)

a. Using the “rvpplot” command, we can plot the residuals. Note that the value of the

residual are shown on the vertical axis, and that level of alcohol consumption is

displayed on the horizontal axis.

22. → rvfplot, yline(0) mlabel(country) mlabsize(vsmall)

a. Let’s now instead plot the residuals against the fitted values. We observe a plot of the

residuals against the fitted values, given by the “rvfplot” command.

23. → sort uhat_h

a. Use the “sort” command to organize the residuals in ascending order (recall the “sort”

command from the first tutorial and homework assignment).

24. → list country alcohol heart yhat_h uhat_h

a. Using the “list” command, try to observe the typical size of the residuals. By observing

the residual values, we can more readily see the countries that don’t work well with the

OLS regression.

25. → regress heart alcohol if country != "Japan"

a. We can see that Japan doesn’t seem to work well with this regression model (note its

large residual). Let’s try running the regression without Japan.

b. The “if country != Japan” tells STATA to run the regression if the country’s name is not

Japan.

26. → set seed 101040

a. STATA can be used to generate a random sample of size n; suppose this random sample

is called “bsample.” In order to generate a sample we must set a “seed” value, in this

case a number. The seed can be whatever number you like; let’s here use 101040.

27. → bsample 10

a. To take our random sample, we’ll use the “bsample” command, followed by our desired

sample size. We’ll use a sample size of n=10.

28. → describe

a. Use the “describe” command to see your 10 observations.

29. → regress heart alcohol

a. Let’s run the regression again, on our 10 observations.

30. → save wine_out.dta, replace

a. Close the current dataset.

31. → clear

a. Let’s begin anew.

32. □ File > □ Open

a. We will now use the dataset with CEO salaries. Locate and open it in STATA.

33. → save ceosal2_tut2.dta, replace

34. → describe

a. Use the “describe” command to familiarize yourself with the new dataset. Observe the

variables, their descriptions, etc.

35. → regress salary ceoten

a. Let’s run a regression between predicted salary (salary) and the number of years an

individual has been a CEO (ceoten).

36. → twoway (scatter salary ceoten) (lfit salary ceoten), legend(order(1 "Observed" 2 "Fitted by

Linear Model"))

a. Use the “twoway” command to create a twoway graph that illustrates the relationship

between salary and length of CEO tenure. Note the line of best fit that appears

alongside the data points on the scatterplot.

37. → regress lsalary ceoten

a. We’ll use the “regress” command to regress the log of salary on CEO tenure.

38. → twoway (scatter lsalary ceoten) (lfit lsalary ceoten)

a. Again, let’s use the “twoway” command to create a twoway graph that shows us visually

the line of best fit through a scatterplot of the data points.

39. → Predicted_salary = exp(bo_hat + b1_hat * ceoten)

a. It is possible for us to observe this relationship using salary instead of the log of salary.

Note that if Predicted_log(salary) = b0_hat + b1_hat ceoten, then we can find a value for

the predicte salary such that Predicted_salary = exp(bo_hat + b1_hat * ceoten).

40. → twoway (scatter salary ceoten) (function y = exp(_b[_cons] + _b[ceoten]*x), range(ceoten)),

legend(order(1 "Observed" 2 "Fitted by Log Model"))

a. From here, we can now graph a twoway graph that visually expresses the relationship

between salary and CEO tenure.

41. → regress lsalary lsales

a. Let’s regress the log of salary on the log of sales. We are effectively estimating a

constant elasticity model that relates the CEO’s salary to sales generated by the firm in

millions of dollars. This relationship is modeled by log(salary) = b0 + b1 log(sales) + u.

42. → regress salary ceoten

43. → summarize

a. Recall that the “summarize” command can be used to familiarize ourselves with the

dataset: here we can use it to find values such as the average salary and tenure of a

CEO.

44. → display _b[_cons] + _b[ceoten]*7.954802

a. If we plug the average tenure of the CEO in our estimated regression, we should get

back the average salary of a CEO. We can use STATA to verify this.

45. → regress salary ceoten, robust

a. Recall that if the errors are not homoscedastic, homoscedasticity-only standard errors of

the estimators are not appropriate. If errors are not homoscedastic, then we must use

heteroscedastic robust standard errors in order to make valid inferences.

b. To tell STATA that we want heteroscedasticity-robust errors (as opposed to

heteroscedasticity-only errors, which STATA gives us by default) we tag on the “robust”

option.

46. → set seed 101040

a. Again, STATA allows us to generate a random sample of size n. Recall that to do so, we

must set a seed value, here just a numeric value. Let’s use 101040.

47. → bsample 100

a. Let’s set our sample size to 100.

48. → describe

a. The “describe” command should show you that we do in fact have 100 observations in

our dataset now.


a. We can perform our last regression again, but this time with our new, reduced set of

100 observations.

50. → use CEOSAL2_tut2.DTA, clear

a. Let’s return to our old dataset.

51. → describe

a. Note that we are back to our original 177 observations.

52. → set seed 050735

a. Now we’ll take a different random sample and perform the regression again. In this

case, let’s now use a different seed value, 050735.

53. → bsample 100


a. Observe that the estimated coefficients are different than those obtained before, since

we took a different random sample of size 100.

55. → save CEOSAL2_tut2.dta, replace

56. □ File > □ Log > □ Close

a. Close the log and finish!



regress performs linear regression on variables

regress depvar indepvar,option

Note: depvar: vertical axis indepvar: horizontal axis

the option robust can be used to obtain correct standard errors when

errors are heteroskedastic

twoway plots twoway graphs (scatter, line, etc);

twoway scatter variable1 variable 2

Note: when the only type of graph is scatterplot or line, “twoway” may be

omitted when inputting the command

twoway lfit adds a line of best fit to the graph

twoway (scatter variable1 variable 2) (lfit variable1 variable2)

predict obtains predictions, residuals, etc., after

estimation

predict variable, option

Note: the option residuals generates residuals

rvpplot plots the residual on the vertical axis and the

specified variable on the horizontal axis

rvpplot variable

Note: variable can be for example the x variable a regression

rvfpplot plots residual on the vertical axis and the fitted y

on the horizontal axis

rvfplot, options

Note: some examples of options are yline(), mlabel(), mlabsize()

bsample draws bootstrap samples (random samples with

replacement) from the data in memory.

bsample sample_size

Note: before inputting the command, set seed number

set seed must set seed value before generating sample

set seed number

STATA Tutorial #3

If you need any additional guidance, or are having other issues with STATA, try the following:

Attend office hours, the exact times of which can be found on TED. Use the “help” command on STATA or Google (i.e. help scatter if you want clarification on how

the “scatter” command works).

Send questions to [email protected].

--------------------------------------------------------------------------------------------------------------------------------

1. clear 2. → cap log close

a. The “cap log close” command, in this case, tells STATA to close any log files you may

currently have open.

3. cd “CURRENT DIRECTORY PATH ” The “cd” command will set the current directory in Stata. This is the directory where your data are saved and where you want the log files, graphs etc… to be saved. In order for Stata to find that folder we need to indicate a “CURRENT DIRECTORY PATH ”. To get this to work, create a folder on your desktop. In that folder create other two folders, one called “logs”, the second one called “data”. Save your data (i.e. dta files) in the “data” folder. To find out the “CURRENT DIRECTORY PATH “ , right click on either the logs or data folder. Then click on “Properties” . In the window that pops up, copy and paste the path that you find on the right of “Location” in place of the words CURRENT DIRECTORY PATH after cd. Don’t forget to keep the quotes. Example: cd “C:\Desktop\Stata Tutorial 3\” will set the current directory to be the folder called “Stata Tutorial 3” on the “Desktop” of this computer “C”.

4. log using logs\tutorial3.log, replace

a. This allows you to begin a new log (which you will need to do in order to turn in your homework assignments). Make sure to save your log as a .log to receive full points on your homework assignment! The replace option will replace any existing log file.

5. use data\vote.dta, clear

a. Begin by opening the dataset (vote1.dta). The clear option will clear the memory in Stata from any existing data file.

6. → save vote1_out.dta, replace

a. We don’t want to actually alter the original dataset (vote1.dta) so we will save it under a new name – in this case, “vote1_out.dta.”

b. The “replace” command here tells STATA to replace our previous dataset file with our new vote1_out.dta.


7. → describe

a. The “describe” command shows us what our dataset contains: the number of observations, variables, etc. Often, it will also give a brief description of what each variable represents.

8. generate id=_n

a. Let’s generate and id each observation, using this command we now have the observations numbered.

9. browse a. Notice how there's a new variable (last column), the one you just generated (id). Also

notice the units in which the variables are: for example, voteA and prtystr are in percentage points, so a value of 43 for voteA means that candidate A got 43% of the votes.

10. reg voteA expendA expendB, robust a. Let’s start by regressing the percentage vote received by the incumbent, and the

campaign expenditures incurred by each candidate. b. In the top right corner, you will find, among others, the overall F-statistic (test of the

joint hypothesis that all the slope coefficients are zero), the R-squared and what we call SER (standard error of the regression), which STATA calls Root MSE (mean squared error). In the following table, you find the 3 estimates of the coefficients, the robust standard errors and the t-statistics (test the hypothesis that each individual coefficient is zero).

Now, let’s interpret the meaning of the estimated regression coefficients.

i. When expenditures for both parties are 0, the percentage of votes received by

candidate A (the incumbent) is predicted to be 49.6 percentage points, on average.

ii. An increase in expenditures by candidate A of $1000 is predicted to increase, on

average, his/her total vote by 0.38 percentage points, keeping candidate B's (the

challenger) expenditures constant.

iii. For each $1000 increase in expenditures by candidate B, candidate A will lose, on

average, about .036 percentage points, when candidate A's expenditures are held

constant. c.

11. display _b[cons] + _b[expendB]*2+_b[expendA] a. Use the command to show the estimated increase in the percentage of votes for $1000

more expendA when expendB=2

12. test expendA expendB a. To test the hypothesis that both coefficients are equal to zero

13. test expendA

a. To test the hypothesis that the coefficient on expendA is different from 0 we can use the command test as show above.

b. Being the P-value smaller than 0.01, we reject the null hypothesis

14. test (expendA=1) (expendB=0) a. We use this to test the joint hypothesis that the coefficient on expendA is equal to 1 and

that the coefficient on expendB equals 0.

To comment on the fit of the model, notice that both slope coefficients are highly

significant and the R-squared demonstrates that this model explains about 53% of the

variance of vote share.

i. The SER (Root MSE) indicates that the typical deviation from the predicted value of

each electoral district is about 11.6 percentage points, but this number is hard to

evaluate in isolation.

In short, this is a reasonably good fit for a model.

15. sum expendA expendB display _b[_cons]+ _b[expendA]* 310.611 + _b[expendB]*305.0885

a. To predict the fraction of votes for candidate A at the average expenditure of A and expenditure B, first find out the average of expendA and expendB using the command sum (above)

b. thus multiply the coefficient of each variable by the average found in point a

16. sum expendA a. We can see what happens to percent vote for the incumbent if incumbent campaign

spending increased by one standard deviation, while the challenger's expenditures remains fixed

17. display _b[expendA]* 280.9854 a. Multiply the coefficient for expendA by its standard deviation b. All else equal, a one standard deviation increase in expenditures by the incumbent

would lead to an increase in vote share in about 10.8 percentage points.

18. gen lnvoteA=log(voteA) gen lnexpendA=log(expendA) reg lnvoteA lnexpendA expendB, robust

a. Suppose you want to know the percentage change in voteA for a 1% change in expendA. You can directly obtain this result from the regression by running a log regression. Keeping expenditure for candidate B constant, a 1% increase in expenditure for candidate A corresponds to a 0.17% increase in the percentage of votes received by candidate A.

19. generate expendA_sq= expendA^2

reg voteA expendA expendA_sq, robust

a. Imagine you are the adviser for an incumbent candidate. You come across with a

theory that there are diminishing marginal returns to campaign expenditures by

incumbent candidates.

b. You want to test this theory, so you decide to model the relationship between

percent vote and expenditures for the incumbents as a quadratic function.

i. What does the regression results show you?

ii. There appear to be diminishing marginal returns to expenditures. Notice

that the coefficient on the squared value of incumbent expenditures is

negative.

iii. This indicates that each new increase in expenditures will yield less new

returns than the value before. Eventually, we will reach a point where

increasing expenditures actually cost an incumbent votes. How do you

explain this turn around point?

iv. A possible explanation is that airwaves become fully saturated and over-

exposure leads voters in a particular district to turn against the candidate.

20. twoway (scatter voteA expendA) (qfit voteA expendA), legend(order(1 2

"Quadratic Fit"))

a. We plot the estimated relation.

b. Scatter shows you the points in your sample, qfit plots the estimated quadratic

relationship

21. twoway (scatter voteA expendA) (qfit voteA expendA) (lfit voteA expendA),

legend(order(1 2 "Quadratic Fit" 3 "Linear Fit")

a. In this graph, we compare the quadratic fit with the linear fit.

To test the theory, beyond visual comparison of the two fits, we can formally test the

hypothesis that the relationship between voteA and expendA is linear, against the

alternative that it is nonlinear. If the relationship is linear, the coefficient on expendA_sq

is zero. The t-statistic for this test is -6, thus we reject the null hypothesis. There is

evidence that the relationship is nonlinear

22. display (_b[_cons]+ _b[expendA]*110+_b[expendA_sq]*110^2) -

(_b[_cons]+_b[expendA]*100+_b[expendA_sq]*100^2)

23. display (_b[_cons]+_b[expendA]*510+_b[expendA_sq]*510^2) -

(_b[_cons]+_b[expendA]*500+_b[expendA_sq]*500^2)

a. To show that there are diminishing marginal returns to campaign expenditures, we

compute the effect of increasing campaign expenditure by $10,000, when

spending is $100,000 and when spending is $500,000

i. Adding an additional $1000 in spending after having already spent

$100,000 will lead to an additional 0.69 percentage points in voting for

candidate A.

ii. But, adding an additional $1000 in spending after having already spent

$500,000 will only lead to an additional 0.23 percentage points in voting

for candidate A.

24. count if expendA > 700

a. The visual analysis of the scatter plot reveals that there is a turning point at

around $700,000 in spending. We want to see if there are a lot of districts with

incumbent expenditures over $700,000.

25. list id state district expendA if expendA > 700

a. To know which are those districts, you can use the list command.

26. gen shareA_dummy=(shareA>50)

gen voteA_dummy=(voteA>50)

tab shareA_dummy

tab voteA_dummy

reg voteA_dummy shareA_dummy, robust

a. Suppose candidate A wants to know: what's the effect of spending more than

candidate B on the probability of getting more than 50% of the votes. You can

find that out generating the variables above.

b. Having higher expenditure increases the probability of having the majority of

votes by (0.84*100) percentage points.

27. reg voteA expendA expendA_sq expendB prtystrA, robust

a. There is other factors besides just incumbent spending that influence votes. Vote

share of the incumbent is also affected by the opponent's spending (expendB) and

the strength of your own party (prtystrA). We run a regression controlling for

those factors.

b. All coefficients are significantly different from zero, at the 1% significance level.

There are still diminishing marginal returns to incumbent campaign expenditure.

c. With other variables held constant, an increase of $1000 in the opponent's

spending, will cost the incumbent -0.03 percentage points of the vote share.

d. An increase in the strength of the incumbent's party of 1 percentage point,

keeping all other variables constant, will yield 0.32 percentage point increase in

the incumbent's vote share.

e. With this model we have now explained 65% of the variation in the vote share of

the incumbent. More importantly, we have reduced the SER, which indicates that

we are starting to achieve a relatively good fit

28. sum expendA expendB prtystrA

29. display

_b[_cons]+_b[expendA]*310.611+_b[expendA_sq]*( 310.611^2)+_b[expendB]*305.0

885+_b[prtystrA]*65 a. You want to predict the incumbent share of the vote, if party strength were 65

percent, and the candidates kept their expenditures at their mean levels.

b. About 58.46% of the vote

30. reg voteA lexpendA, robust a. In general, when you want to do a regression with a variable in logarithm form,

you have to generate that variable, by writting for example, generate

ln_expendA=ln(expendA). In this case, the log of campaign expenditures for each

candidate are already variables in this dataset, so we don't need to generate them.

b. The coefficient in is highly significant and indicates that the 1% increase in

expenditure, would yield an increase in vote share of (6.51/100)=0.0651

percentage points.

31. twoway

(scatter voteA lexpendA) (lfit voteA lexpendA), legend(order(1 "Actual Values" 2 "Fitted

Values"))

a. Plot the relationship between voteA and log(expendA) and the fitted line.

32. reg voteA lexpendA lexpendB prtystrA, robust

a. Now, we keep the linear-log specification but, fearing omitted variable bias, we

add control variables log(expendB) and prtystrA.

b. Interpretation of results: A 1% increase in incumbent expenditures leads to an

increase in incumbent vote share in the amount of 0.608 percentage points,

keeping all other variables constant.

c. A 1% increase in challenger expenditures leads to a reduction in incumbent vote

share of 0.662 percentage points, keeping all other variables constant.

d. An increase in the incumbent's party strength of 1 percentage point, leads to an

increase in incumbent vote share of 0.15 percentage points, keeping all other

variables constant.

e. We are confident with the results of this model. All variables are highly

significant. We have explained 79% of the variation in incumbent vote share and

the SER has been reduced to only 7.7 percentage points

33. display_b[_cons]+_b[lexpendA]*(ln(400))+_b[lexpendB]*(ln(500))+_b[prtystrA]*

50

a. Compute the predicted vote share for your candidate if his/her expenditures are

$400,000 and the opponents are $500,000 and the incumbent's party strength is

50%

34. display_b[_cons]+_b[lexpendA]*(ln(600))+_b[lexpendB]*(ln(500))+_b[prtystrA]*

50

a. Compute what happens if your candidate increases expenditures to $600,000,

keeping the other variables constant.

35. display _b[lexpendA]*(ln(600)-ln(400))

a. The increase in your candidates' vote share would be 2.47 percentage points, from

48.01 to 50.48 percent. You can compute this increase directly by using the

command above.

36. save vote1_out.dta, replace clear

a. Close this dataset.

37. log close

a. Close the log.



regress Running a linear regression on multiple variables

running a log regression on

multiple variables

reg voteA expendA expendB, robust

reg lnvoteA lnexpendA expendB, robust

test To test the hypothesis that the coefficient is different

from 0

To test the joint hypothesis that the coefficient on

variable one is different from 1 and that the

coefficient on variable 2 is different from 0

test expendA

test (variable1=1) (expendB=0)

twoway To plot the estimated relation between two

variables

twoway (scatter voteA expendA)

(qfit voteA expendA) (lfit voteA

expendA), legend(order(1 2

"Quadratic Fit" 3 "Linear Fit")

generate To generate dummy variables

To generate and id for each

observation

gen shareA_dummy=(shareA>50)

generate id=_n

count To see how many districts are over a particular value

count if expendA > 700

list To show the name of the those districts that are over

the particular value

list id state district expendA if expendA > 700

homework i: stata guide - university of california, san...

Documents