stata for logistic regression - people.umass.edu for logistic regression.pdffit a logistic...

of 30 /30
BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration ….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 1 of 30 5. Logistic Regression Illustration – Stata version 14 March 2017 1. Tip: “1/2” Variables versus “0/1” Variables ………..….………….… 2. Tip: How to Create Quartile Groupings of Continuous Variable ……. 3. Fit a Logistic Regression Model …………………..………………… 4. Likelihood Ratio Test for 2 “Hierarchical” Models…………….………. 5. Regression Diagnostics for Logistic Regression: Numerical …….……. a. Numerical Measures of Fit Using fitstat …………………..…….. b. Test of Model Adequacy Using linktest …………………………. c. Test of Overall Goodness-of-Fit Using lfit ……………………….. 6. Regression Diagnostics for Logistic Regression: Graphical …….……. a. Plot of ROC Curve Using lroc …………………………………….. b. Plot of Standardized Residuals versus Observation Number ………. c. Plot of Influential Observations Using Cook’s Distances …………... 7. Tip: Save Your Commands to a DO File for Later Use ………….… 2 6 9 20 23 25 25 26 27 27 28 29 30 Preliminary – Download the stata data set illeetvilaine.dta. Note – This data set is accessible through the internet. Alternatively, you can download it from the course website. (a) In Stata, input directly from the internet using the command use use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear (b) From the course website, right click to download. Afterwards, in Stata, use FILE > OPEN See, http://people.umass.edu/biep640w/webpages/demonstrations.html

Upload: others

Post on 09-Mar-2020

122 views

Category:

Documents


2 download

TRANSCRIPT

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 1 of 30

5. Logistic Regression Illustration – Stata version 14

March 2017

1. Tip: “1/2” Variables versus “0/1” Variables ………..….………….… 2. Tip: How to Create Quartile Groupings of Continuous Variable ……. 3. Fit a Logistic Regression Model …………………..………………… 4. Likelihood Ratio Test for 2 “Hierarchical” Models…………….………. 5. Regression Diagnostics for Logistic Regression: Numerical …….……. a. Numerical Measures of Fit Using fitstat …………………..…….. b. Test of Model Adequacy Using linktest …………………………. c. Test of Overall Goodness-of-Fit Using lfit ……………………….. 6. Regression Diagnostics for Logistic Regression: Graphical …….……. a. Plot of ROC Curve Using lroc …………………………………….. b. Plot of Standardized Residuals versus Observation Number ………. c. Plot of Influential Observations Using Cook’s Distances …………... 7. Tip: Save Your Commands to a DO File for Later Use ………….…

2

6

9

20

23 25 25 26

27 27 28 29

30

Preliminary – Download the stata data set illeetvilaine.dta. Note – This data set is accessible through the internet. Alternatively, you can download it from the course website.

(a) In Stata, input directly from the internet using the command use use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear (b) From the course website, right click to download. Afterwards, in Stata, use FILE > OPEN See, http://people.umass.edu/biep640w/webpages/demonstrations.html

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 2 of 30

1. Tip - “1/2” Variables versus “0/1” Variables

Why the fuss? Answer – Sometimes the arrangement of rows and columns in a 2x2 table are not what you expected.

tab2 Stata will order the rows and columns according to the numeric values of the row and column variable. For a 0/1 variable, row 1 will be the value “0” row. Row 2 will be the value “1” row. For a 1/2 variable, row 1 will be the value “1” row. Row 2 will be the value “2” row. Columns are ordered similarly. cc, cs Stata assumes that you are using 0/1 variables here with 1= event and 0=non-event Stata will order the rows and columns according to event, with event being the first row (or column) Thus, row 1 will be the value “1=event” row. Row 2 will be the value “0=non-event” row. Columns are ordered similarly.

Ille-et-Vilaine Data: Illustration Suppose we are interested in the 2x2 table cross-classification of heavy smoking (30+ gm/day versus other) and case status (esophageal cancer case versus control): Disease (Esophageal Cancer) Exposure (Heavy Smoking) Yes No

Yes (30+ gm/day) 31 51 82 No 169 724 893

200 775 975 Preliminary: Introduction to the command recode Use recode to re-set the values of a variable. This is especially handy in the creation of a new variable. You can recode a single old value to a new value. Or you can recode a whole range of values to a new value. For example - . use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear . * recode variablename (oldvalue=newvalue) (rangelower/rangeupper=newvalue) etc. . generate age12=age . recode age12 (18=1) (19/max=2)

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 3 of 30

. * Create "1/2" variables when you want to use command tab2 . * “1/2” measure of heavy smoking (1=30+ gm/day versus 2=other) . * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day) . generate exposure12=tobgp . recode exposure12 (1=2) (2=2) (3=2) (4=1) (exposure12: 739 changes made) . label define exposure12f 2 "other" 1 "heavy" . label values exposure12 exposure12f . * "1/2" variable for case status (1=case versus 2=other) . generate case12=case . recode case12 (0=2) (case12: 775 changes made) . label define case12f 2 "control" 1 "case" . label values case12 case12f . * Check variable creations . tab2 tobgp exposure12 -> tabulation of tobgp by exposure12 Grouped | tobacco | exposure12 consum. | heavy other | Total -----------+----------------------+---------- 0-9 gm/day | 0 526 | 526 10-19 | 0 236 | 236 20-29 | 0 131 | 131 30+ | 82 0 | 82 -----------+----------------------+---------- Total | 82 893 | 975 . tab2 case case12 -> tabulation of case by case12 Case | status | (1=case, | case12 0=control) | case control | Total -----------+----------------------+---------- 0 | 0 775 | 775 1 | 200 0 | 200 -----------+----------------------+---------- Total | 200 775 | 975

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 4 of 30

. * Create "0/1" variables when you want to use commands cc, cs . * “0/1” measure of heavy smoking (1=30+ gm/day versus 0=other) . * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day) . generate exposure01=tobgp . recode exposure01 (1=0) (2=0) (3=0) (4=1) (exposure01: 975 changes made) . label define exposure01f 0 "other" 1 "heavy" . label values exposure01 exposure01f . * "0/1" variable for case status (1=case versus 0=other) . * This already exists as the variable case . * Check variable creations . tab2 tobgp exposure01 -> tabulation of tobgp by exposure01 Grouped | tobacco | exposure01 consum. | other heavy | Total -----------+----------------------+---------- 0-9 gm/day | 526 0 | 526 10-19 | 236 0 | 236 20-29 | 131 0 | 131 30+ | 0 82 | 82 -----------+----------------------+---------- Total | 893 82 | 975 . * The command cc works fine with 0/1 variables . cc case exposure01 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 31 169 | 200 0.1550 Controls | 51 724 | 775 0.0658 -----------------+------------------------+------------------------ Total | 82 893 | 975 0.0841 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 2.604014 | 1.557944 4.2894 (exact) Attr. frac. ex. | .6159775 | .3581283 .7668672 (exact) Attr. frac. pop | .0954765 | +------------------------------------------------- chi2(1) = 16.42 Pr>chi2 = 0.0001 The commands cc and cs are commands for epidemiological analyses of 2x2 tables where the convention is to have cases be in row 1 (controls in row 2) and exposed be in column 1 (non-exposed in column 2).

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 5 of 30

. * tab2 with 0/1 variables . tab2 exposure01 case -> tabulation of exposure01 by case | Case status (1=case, | 0=control) exposure01 | 0 1 | Total -----------+----------------------+---------- other | 724 169 | 893 heavy | 51 31 | 82 -----------+----------------------+---------- Total | 775 200 | 975 Heavy exposure is in row 2, the outcome of case=yes is in column 2. . * tab2 with 1/2 variables - more to your liking? . tab2 exposure12 case12 -> tabulation of exposure12 by case12 | case12 exposure12 | case control | Total -----------+----------------------+---------- heavy | 31 51 | 82 other | 169 724 | 893 -----------+----------------------+---------- Total | 200 775 | 975 Better. Heavy exposure is now row 1 and cases are now in column 1. . tab2 case12 exposure12 -> tabulation of case12 by exposure12 | exposure12 case12 | heavy other | Total -----------+----------------------+---------- case | 31 169 | 200 control | 51 724 | 775 -----------+----------------------+---------- Total | 82 893 | 975 Or, you might like this arrangement. Cases are now 1 and heavy exposure is now in column 1.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 6 of 30

2. Tip – How to Create Quartile Groupings of a Continuous Variable Creating Quartiles is Useful in Assessing Linearity of Logit Answer – In regression analysis, it is often of interest to explore linearity of the outcome in relationship to a continuous predictor. To do this, a new variable is created that is a grouped measure of the original continuous variable. Ille-et-Vilaine Data: Illustration This data set has two continuous variables: age (age, years) and tob (tobacco consumption, gm/day). Here we consider the variable age. In this lab session, let’s create two new variables. Each is a two step process.

age_quartile = Quartile of age, coded 1, 2, 3 or 4 age_qmedian = Median of age, within quartile of age

. *

. *** Create age_quartile = quartiles of age, coded 1, 2, 3, 4

. centile age, c(0 25 50 75 100) -- Binom. Interp. -- Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- age | 975 0 25 25 25* | 25 41 40 42 | 50 52 51 53 | 75 63 62 64 | 100 91 91 91* * Lower (upper) confidence limit held at minimum (maximum) of sample . generate age_quartile=age . recode age_quartile (min/41=1) (41.01/52=2) (52.01/63=3) (63/91=4) (age_quartile: 975 changes made)

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 7 of 30

. * . *** Create age_qmedian = quartile medians of age, in years . sort age_quartile . by age_quartile: centile age, c(50) ------------------------------------------------------------------------------------------ -> age_quartile = 1 -- Binom. Interp. -- Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- age | 250 50 35 34 35 ------------------------------------------------------------------------------------------ -> age_quartile = 2 -- Binom. Interp. -- Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- age | 248 50 47 46 48 ------------------------------------------------------------------------------------------ -> age_quartile = 3 -- Binom. Interp. -- Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- age | 239 50 59 59 60 ------------------------------------------------------------------------------------------ -> age_quartile = 4 -- Binom. Interp. -- Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- age | 238 50 69 68 69 . generate age_qmedian=age_quartile . recode age_qmedian (1=35) (2=47) (3=59) (4=69) (age_qmedian: 975 changes made) . label variable age_quartile "Quartile of Age" . label variable age_qmedian "Quartile Median Age"

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 8 of 30

. * Check variable creations . tab2 age_quartile age_qmedian -> tabulation of age_quartile by age_qmedian Quartile | Quartile Median Age of Age | 35 47 59 69 | Total -----------+--------------------------------------------+---------- 1 | 250 0 0 0 | 250 2 | 0 248 0 0 | 248 3 | 0 0 239 0 | 239 4 | 0 0 0 238 | 238 -----------+--------------------------------------------+---------- Total | 250 248 239 238 | 975

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 9 of 30

3. Fit a Logistic Regression Model

Summary The commands logit and logistic will fit logistic regression models. Using logit with no option will produce betas. Using logistic will produce odds ratios. You can also get odds ratios using the command logit with or as an option. Stata also has commands eststo, estout and esttab for producing comparisons of models that are easier to read. Ille-et-Vilaine Data: Illustration After creating some new variables for illustration purposes, 4 logistic regression models are fit and then compared side-by-side.

Model 1: Predictors = heavy drinking, age Model 2: Predictors = heavy smoking, age Model 3: Predictors = heavy drinking, heavy smoking, age Model 4: Predictors = heavy drinking, heavy smoking, drinking x smoking interaction, age

* Create some new variables for illustration purposes . * HEAVY DRINKER: Create alcohol_80plus = 0/1 measure of alcohol use >=80 gm/day. . generate alcohol_80plus=alcgp . recode alcohol_80plus (1=0) (2=0) (3=1) (4=1) (alcohol_80plus: 975 changes made) . label define alcoholf 0 "< 80 gm/day" 1 "80+ gm/day" . label values alcohol_80plus alcoholf . label variable alcohol_80plus "Heavy Drinker" . * Check variable creation . tab2 alcgp alcohol_80plus -> tabulation of alcgp by alcohol_80plus Grouped | alcohol | Heavy Drinker consum. | < 80 gm/d 80+ gm/da | Total ------------+----------------------+---------- 0-39 gm/day | 414 0 | 414 40-79 | 355 0 | 355 80-119 | 0 139 | 139 120+ | 0 67 | 67 ------------+----------------------+---------- Total | 769 206 | 975

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 10 of 30

. * HEAVY SMOKER: Create smoking_30plus = 0/1 measure of tobacco use >=30 gm/day. . generate smoking_30plus=tobgp . recode smoking_30plus (1=0) (2=0) (3=0) (4=1) (smoking_30plus: 975 changes made) . label define smokingf 0 "< 30 gm/day" 1 "30+ gm/day" . label values smoking_30plus smokingf . * Check variable creation . numlabel, add . tab2 tobgp smoking_30plus -> tabulation of tobgp by smoking_30plus Grouped | tobacco | smoking_30plus consum. | 0. < 30 g 1. 30+ gm | Total --------------+----------------------+---------- 1. 0-9 gm/day | 526 0 | 526 2. 10-19 | 236 0 | 236 3. 20-29 | 131 0 | 131 4. 30+ | 0 82 | 82 --------------+----------------------+---------- Total | 893 82 | 975 . * INTERACTION: Create drinker_smoker = interaction of heavy drinking and heavy smoking . generate drinker_smoker=alcohol_80plus*smoking_30plus . label variable drinker_smoker "Interaction alcohol*smoking" . * USER CREATED DESIGN VARIABLES FOR AGEGP . * Note – If you do not have the command fre, type findit fre and download. . fre agegp agegp -- Age group ------------------------------------------------------------- | Freq. Percent Valid Cum. ----------------+-------------------------------------------- Valid 1 25-34 | 116 11.90 11.90 11.90 2 35-44 | 199 20.41 20.41 32.31 3 45-54 | 213 21.85 21.85 54.15 4 55-64 | 242 24.82 24.82 78.97 5 65-74 | 161 16.51 16.51 95.49 6 75+ | 44 4.51 4.51 100.00 Total | 975 100.00 100.00 ------------------------------------------------------------- . generate age3544=(agegp==2) if agegp !=. . generate age4554=(agegp==3) if agegp !=. . generate age5564=(agegp==4) if agegp !=. . generate age6574=(agegp==5) if agegp !=. . generate age75plus=(agegp==6) if agegp !=.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 11 of 30

. * Check variable creations. . numlabel, add . tab2 agegp age3544 -> tabulation of agegp by age3544 | age3544 Age group | 0 1 | Total -----------+----------------------+---------- 1. 25-34 | 116 0 | 116 2. 35-44 | 0 199 | 199 3. 45-54 | 213 0 | 213 4. 55-64 | 242 0 | 242 5. 65-74 | 161 0 | 161 6. 75+ | 44 0 | 44 -----------+----------------------+---------- Total | 776 199 | 975 . tab2 agegp age4554 -> tabulation of agegp by age4554 | age4554 Age group | 0 1 | Total -----------+----------------------+---------- 1. 25-34 | 116 0 | 116 2. 35-44 | 199 0 | 199 3. 45-54 | 0 213 | 213 4. 55-64 | 242 0 | 242 5. 65-74 | 161 0 | 161 6. 75+ | 44 0 | 44 -----------+----------------------+---------- Total | 762 213 | 975 . tab2 agegp age5564 -> tabulation of agegp by age5564 | age5564 Age group | 0 1 | Total -----------+----------------------+---------- 1. 25-34 | 116 0 | 116 2. 35-44 | 199 0 | 199 3. 45-54 | 213 0 | 213 4. 55-64 | 0 242 | 242 5. 65-74 | 161 0 | 161 6. 75+ | 44 0 | 44 -----------+----------------------+---------- Total | 733 242 | 975

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 12 of 30

. tab2 agegp age6574 -> tabulation of agegp by age6574 | age6574 Age group | 0 1 | Total -----------+----------------------+---------- 1. 25-34 | 116 0 | 116 2. 35-44 | 199 0 | 199 3. 45-54 | 213 0 | 213 4. 55-64 | 242 0 | 242 5. 65-74 | 0 161 | 161 6. 75+ | 44 0 | 44 -----------+----------------------+---------- Total | 814 161 | 975 . tab2 agegp age75plus -> tabulation of agegp by age75plus | age75plus Age group | 0 1 | Total -----------+----------------------+---------- 1. 25-34 | 116 0 | 116 2. 35-44 | 199 0 | 199 3. 45-54 | 213 0 | 213 4. 55-64 | 242 0 | 242 5. 65-74 | 161 0 | 161 6. 75+ | 0 44 | 44 -----------+----------------------+---------- Total | 931 44 | 975 . * Logistic model with user defined design variables for agegp . logistic case alcohol_80plus age3544 age4554 age5564 age6574 age75plus Logistic regression Number of obs = 975 LR chi2(6) = 199.30 Prob > chi2 = 0.0000 Log likelihood = -395.09465 Pseudo R2 = 0.2014 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664 age3544 | 4.683066 4.991105 1.45 0.147 .5798813 37.82 age4554 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562 age5564 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404 age6574 | 52.77508 53.98653 3.88 0.000 7.107029 391.895 age75plus | 52.41941 55.81879 3.72 0.000 6.502652 422.5653 _cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704 --------------------------------------------------------------------------------

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 13 of 30

. * . * Logistic model with STATA defined design variables for agegp . logistic case alcohol_80plus i.agegp Logistic regression Number of obs = 975 LR chi2(6) = 199.30 Prob > chi2 = 0.0000 Log likelihood = -395.09465 Pseudo R2 = 0.2014 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664 | agegp | 2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82 3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562 4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404 5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895 6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653 | _cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704 -------------------------------------------------------------------------------- The two outputs match. For the rest of this lab session, I will use stata defined design variables. . * MODEL 1 – . * Logistic Regression Heavy Drinking Alone - adjusted for age . logistic case alcohol_80plus i.agegp Logistic regression Number of obs = 975 LR chi2(6) = 199.30 Prob > chi2 = 0.0000 Log likelihood = -395.09465 Pseudo R2 = 0.2014 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664 | agegp | 2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82 3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562 4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404 5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895 6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653 | _cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704 --------------------------------------------------------------------------------

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 14 of 30

. * ESTSTO to save results for later comparison . eststo model1 . * MODEL 2 – . * Logistic Regression Heavy Smoking Alone - adjusted for age . logistic case smoking_30plus i.agegp Logistic regression Number of obs = 975 LR chi2(6) = 145.72 Prob > chi2 = 0.0000 Log likelihood = -421.88661 Pseudo R2 = 0.1473 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- smoking_30plus | 4.211492 1.205916 5.02 0.000 2.402722 7.381906 | agegp | 2. 35-44 | 6.267996 6.675411 1.72 0.085 .7773197 50.54262 3. 45-54 | 38.39114 39.30348 3.56 0.000 5.161798 285.536 4. 55-64 | 65.17199 66.48418 4.09 0.000 8.82513 481.2834 5. 65-74 | 82.44814 84.59853 4.30 0.000 11.03516 616.0035 6. 75+ | 59.4483 63.32511 3.84 0.000 7.369337 479.5683 | _cons | .0060567 .0061361 -5.04 0.000 .0008315 .0441165 -------------------------------------------------------------------------------- . * ESTSTO to save results for later comparison . eststo model2 . * MODEL 3 – . * Logistic Regression Heavy Drinking and Heavy Smoking - adjusted for age . logistic case alcohol_80plus smoking_30plus i.agegp Logistic regression Number of obs = 975 LR chi2(7) = 219.23 Prob > chi2 = 0.0000 Log likelihood = -385.12755 Pseudo R2 = 0.2216 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432 smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035 | agegp | 2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074 3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861 4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294 5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486 6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835 | _cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757 --------------------------------------------------------------------------------

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 15 of 30

. * ESTSTO to save results for later comparison . eststo model3 . * MODEL 4 – . * Logistic Regression Heavy Drinking and Heavy Smoking PLUS INTERACTION - adjusted . logistic case alcohol_80plus smoking_30plus i.agegp drinker_smoker Logistic regression Number of obs = 975 LR chi2(8) = 219.35 Prob > chi2 = 0.0000 Log likelihood = -385.07068 Pseudo R2 = 0.2217 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591 smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546 | agegp | 2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986 3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657 4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187 5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001 6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273 | drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342 _cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218 -------------------------------------------------------------------------------- . * ESTSTO to save results for later comparison . eststo model4

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 16 of 30

. * ESTOUT for side-by-side comparison of the 4 models . * BETAs . estout model1 model2 model3 model4, prehead("Logistic Regression of Esophageal Cancer - BETA's") Logistic Regression of Esophageal Cancer - BETA's ---------------------------------------------------------------- model1 model2 model3 model4 b b b b ---------------------------------------------------------------- case alcohol_80~s 1.654102 1.633552 1.613106 1b.agegp 0 0 0 0 2.agegp 1.543953 1.835457 1.84063 1.881508 3.agegp 3.200302 3.647827 3.499904 3.53259 4.agegp 3.70612 4.17703 4.028732 4.062553 5.agegp 3.966039 4.412169 4.393586 4.425492 6.agegp 3.959277 4.085107 4.268525 4.306119 smoking_30~s 1.437817 1.383765 1.315578 drinker_sm~r .22443 _cons -5.049291 -5.106586 -5.506652 -5.533117 ---------------------------------------------------------------- Okay. We see the betas. . * ODDS RATIOS – use option eform . estout model1 model2 model3 model4, eform prehead("Logistic Regression of Esophageal Cancer - ODDS RATIO's") Logistic Regression of Esophageal Cancer - ODDS RATIO's ---------------------------------------------------------------- model1 model2 model3 model4 b b b b ---------------------------------------------------------------- case alcohol_80~s 5.228385 5.122038 5.018372 1b.agegp 1 1 1 1 2.agegp 4.683066 6.267996 6.300504 6.563396 3.agegp 24.53994 38.39114 33.11226 34.21248 4.agegp 40.6956 65.17199 56.18964 58.12253 5.agegp 52.77508 82.44814 80.93014 83.55392 6.agegp 52.41941 59.4483 71.41624 74.15214 smoking_30~s 4.211492 3.989895 3.726905 drinker_sm~r 1.251609 _cons .0064139 .0060567 .0040597 .0039536 ---------------------------------------------------------------- The option “eform” stands for “exponentiated coefficients.” Thus, these are the odds ratios.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 17 of 30

. * ESTTAB for side-by-side comparison of the 4 models . * BETAs with chi square statistics . esttab model1 model2 model3 model4, stats(n chi2 bic, star(chi2)) prehead("Logistic Regression of Esophageal Cancer - BETA's") Logistic Regression of Esophageal Cancer - BETA's (1) (2) (3) (4) case case case case ---------------------------------------------------------------------------- case alcohol_80~s 1.654*** 1.634*** 1.613*** (8.74) (8.49) (7.99) 1b.agegp 0 0 0 0 (.) (.) (.) (.) 2.agegp 1.544 1.835 1.841 1.882 (1.45) (1.72) (1.71) (1.73) 3.agegp 3.200** 3.648*** 3.500*** 3.533*** (3.13) (3.56) (3.37) (3.38) 4.agegp 3.706*** 4.177*** 4.029*** 4.063*** (3.64) (4.09) (3.90) (3.90) 5.agegp 3.966*** 4.412*** 4.394*** 4.425*** (3.88) (4.30) (4.22) (4.21) 6.agegp 3.959*** 4.085*** 4.269*** 4.306*** (3.72) (3.84) (3.95) (3.95) smoking_30~s 1.438*** 1.384*** 1.316*** (5.02) (4.50) (3.55) drinker_sm~r 0.224 (0.33) _cons -5.049*** -5.107*** -5.507*** -5.533*** (-5.00) (-5.04) (-5.35) (-5.33) ---------------------------------------------------------------------------- n chi2 199.3*** 145.7*** 219.2*** 219.3*** bic 838.4 892.0 825.3 832.1 ---------------------------------------------------------------------------- t statistics in parentheses * p<0.05, ** p<0.01, *** p<0.001 This tabulation shows the betas. Underneath are the values of t-statistic = (beta/standard error)

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 18 of 30

. * ODDS RATIOS and 95% CI’s . * Models 1, 2, and 3 . esttab model1 model2 model3, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic Regression of Esophageal Cancer - ODDS RATIO's") Logistic Regression of Esophageal Cancer - ODDS RATIO's (1) (2) (3) case case case ------------------------------------------------------------------------------------------ case alcohol_80~s 5.228*** 5.122*** [3.608,7.576] [3.512,7.469] 1b.agegp 1 1 1 [1,1] [1,1] [1,1] 2.agegp 4.683 6.268 6.301 [0.580,37.82] [0.777,50.54] [0.763,52.06] 3.agegp 24.54** 38.39*** 33.11*** [3.304,182.3] [5.162,285.5] [4.336,252.9] 4.agegp 40.70*** 65.17*** 56.19*** [5.529,299.5] [8.825,481.3] [7.409,426.1] 5.agegp 52.78*** 82.45*** 80.93*** [7.107,391.9] [11.04,616.0] [10.51,623.1] 6.agegp 52.42*** 59.45*** 71.42*** [6.503,422.6] [7.369,479.6] [8.588,593.9] smoking_30~s 4.211*** 3.990*** [2.403,7.382] [2.185,7.287] ------------------------------------------------------------------------------------------ n chi2 199.3*** 145.7*** 219.2*** bic 838.4 892.0 825.3 ------------------------------------------------------------------------------------------ Exponentiated coefficients; 95% confidence intervals in brackets * p<0.05, ** p<0.01, *** p<0.001 I am showing models 1, 2, and 3 only because the output wraps around (and is unreadable) if I try to show all 4 models. This tabulation shows the odds ratios and associated 95% CI.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 19 of 30

. * ODDS RATIOS and 95% CI’s . * Models 3 and 4 . esttab model3 model4, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic Regression of Esophageal Cancer - ODDS RATIO's") Logistic Regression of Esophageal Cancer - ODDS RATIO's (1) (2) case case ---------------------------------------------------------------- case alcohol_80~s 5.122*** 5.018*** [3.512,7.469] [3.379,7.453] smoking_30~s 3.990*** 3.727*** [2.185,7.287] [1.804,7.700] 1b.agegp 1 1 [1,1] [1,1] 2.agegp 6.301 6.563 [0.763,52.06] [0.778,55.39] 3.agegp 33.11*** 34.21*** [4.336,252.9] [4.404,265.8] 4.agegp 56.19*** 58.12*** [7.409,426.1] [7.529,448.7] 5.agegp 80.93*** 83.55*** [10.51,623.1] [10.67,654.3] 6.agegp 71.42*** 74.15*** [8.588,593.9] [8.750,628.4] drinker_sm~r 1.252 [0.336,4.660] ---------------------------------------------------------------- n chi2 219.2*** 219.3*** bic 825.3 832.1 ---------------------------------------------------------------- Exponentiated coefficients; 95% confidence intervals in brackets * p<0.05, ** p<0.01, *** p<0.001

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 20 of 30

4. Likelihood Ratio Test for 2 “Hierarchical” Models

Summary It is of interest to know whether the inclusion of extra predictors to a model is statistically significant. The smaller model (“reduced”) contains the control variables. The larger model (“full”) contains the control variables plus the extra variables in question. Models.

1 2 p 0

p+1 p+2 p+k p+1 p+1 p+k p+

1 1 p p

1 2 p 0 1 k1 p p

Reduced: logit[π | X ,X ...,X ] = β +β X +...+β X

Full: logit[π | X ,X ...,X , ] = β +β X +...+β X + X ,X ,...,X β X +...+β X

Null and Alternative Hypotheses:

O p+1 p+2 p+k

A

H : β = β = ... = β = 0

H : not

Definition Likelihood Ratio Test (LR) LR statistic = DevianceREDUCED - DevianceFULL = [ (-2) ln (Likelihood) REDUCED ] - [ (-2) ln (Likelihood) FULL ] Under the null hypothesis, LR is distributed Chi SquareDF=k Ille-et-Vilaine Data: Illustration A likelihood ratio test is performed to assess the stastistical significance of the interaction of heavy drinking and heavy smoking in the model, controlling for age and the main effects of each of heavy drinking and heavy smoking. Thus,

Model “reduced”: Predictors = age, heavy drinking, heavy smoking Model “full”: Predictors = age, heavy drinking, heavy smoking + (drinking x smoking)

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 21 of 30

. *** LR Test WITH DISPLAY OF INTERMEDIATE RESULTS . * Reduced model . logistic case i.agegp smoking_30plus alcohol_80plus Logistic regression Number of obs = 975 LR chi2(7) = 219.23 Prob > chi2 = 0.0000 Log likelihood = -385.12755 Pseudo R2 = 0.2216 (-2) Log likelihood Reduced Model = 770.2551 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- agegp | 2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074 3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861 4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294 5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486 6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835 | smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035 alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432 _cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757 -------------------------------------------------------------------------------- . estimates store reduced . * Full model . logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker Logistic regression Number of obs = 975 LR chi2(8) = 219.35 Prob > chi2 = 0.0000 Log likelihood = -385.07068 Pseudo R2 = 0.2217 (-2) Log likelihood Full Model = 770.14136 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- agegp | 2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986 3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657 4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187 5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001 6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273 | smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546 alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591 drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342 _cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218 -------------------------------------------------------------------------------- . estimates store full . lrtest reduced full Likelihood-ratio test LR chi2(1) = 0.11 (Assumption: reduced nested in full) Prob > chi2 = 0.7359 CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 22 of 30

. *** LR Test - WITHOUT DISPLAY OF RESULTS using quietly: . * Reduced model . quietly: logistic case i.agegp smoking_30plus alcohol_80plus . estimates store reduced . * Full model . quietly: logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker . estimates store full . lrtest reduced full Likelihood-ratio test LR chi2(1) = 0.11 (Assumption: reduced nested in full) Prob > chi2 = 0.7359 CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 23 of 30

5. Regression Diagnostics for Logistic Regression: Numerical Preliminary – Install the suite of commands in the package spost9_ado . * Step 1: Install SPost Using net install . net install spost9_ado checking spost9_ado consistency and verifying not already installed... installing into /Users/cbigelow/Library/Application Support/Stata/ado/plus/... installation complete. . * Step 2: Now obtain all the ancillary files . net get spost9_do checking spost9_do consistency and verifying not already installed... copying into current directory... copying st9all.do copying st9ch2tutorial.do copying st9ch3estimate.do copying st9ch4binary.do copying st9ch5ordinal.do copying st9ch6nomcase.do copying st9ch7nomalt.do copying st9ch8count.do copying st9ch9other.do copying binlfp2.dta copying couart2.dta copying gsskidvalue2.dta copying nomocc2.dta copying ordwarm2.dta copying science2.dta copying sciwork.dta copying travel2.dta copying travel2case.dta copying wlsrnk.dta ancillary files successfully copied.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 24 of 30

Summary Now you have a model that is your “candidate” final model. There are lots of further explorations you can do to assess whether this really is a “good” final model. Ille-et-Vilaine Data: Illustration Having retained the null hypothesis in our likelihood ratio test of the interaction of heavy smoking and heavy drinking, our “candidate” final model contains: heavy drinking, heavy smoking, and age.

. * Before requesting any diagnostics of a model, you must have fit it. . logistic case i.agegp smoking_30plus alcohol_80plus Logistic regression Number of obs = 975 LR chi2(7) = 219.23 Prob > chi2 = 0.0000 Log likelihood = -385.12755 Pseudo R2 = 0.2216 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- agegp | 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835 | smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035 alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432 _cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757 --------------------------------------------------------------------------------

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 25 of 30

. * . ***** 5a) Numerical measures of fit using command FITSTAT . fitstat Measures of Fit for logistic of case Log-Lik Intercept Only: -494.744 Log-Lik Full Model: -385.128 D(967): 770.255 LR(7): 219.233 Prob > LR: 0.000 McFadden's R2: 0.222 McFadden's Adj R2: 0.205 ML (Cox-Snell) R2: 0.201 Cragg-Uhler(Nagelkerke) R2: 0.316 McKelvey & Zavoina's R2: 0.466 Efron's R2: 0.224 Variance of y*: 6.157 Variance of error: 3.290 Count R2: 0.817 Adj Count R2: 0.110 AIC: 0.806 AIC*n: 786.255 BIC: -5885.062 BIC': -171.056 BIC used by Stata: 825.315 AIC used by Stata: 786.255 PARTIAL KEY: Log-Lik Intercept Only = -494.744: This is the log likelihood for the intercept only model Log-Lik Full Model = -385.128: This is the log likelihood for the current model LR(7) = 219.233 is the likelihood ratio chi square statistic which tests whether the current model predicts better than the intercept only model Prob > LR = .0001: This is the p-value for the LR(7) test Then there are a series of pseudo-R2 measures. Finally, there are a series of information criterion measures that are used to compare different models. . * . ***** 5b) Test of Model Adequacy Using command LINKTEST . linktest -- iteration output omitted -- Logistic regression Number of obs = 975 LR chi2(2) = 219.24 Prob > chi2 = 0.0000 Log likelihood = -385.12412 Pseudo R2 = 0.2216 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _hat | 1.009135 .1404655 7.18 0.000 .7338274 1.284442 _hatsq | .0039801 .0479037 0.08 0.934 -.0899094 .0978696 _cons | .0008299 .1243723 0.01 0.995 -.2429353 .2445952 ------------------------------------------------------------------------------ WHAT TO LOOK FOR: We expect the p-value for _HAT to be highly significant. Evidence of a GOOD FIT is reflected in a NON-SIGNIFICANT _HATSQ Here the p-value for _HATSQ is .934 This suggests good model adequacy

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 26 of 30

. * . ***** 5c) Test of Overall Goodness of Fit Using command LFIT . lfit, group(10) table Logistic model for case, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) (There are only 9 distinct quantiles because of ties) +--------------------------------------------------------+ | Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total | |-------+--------+-------+-------+-------+-------+-------| | 1 | 0.0159 | 0 | 0.6 | 106 | 105.4 | 106 | | 2 | 0.0249 | 6 | 3.9 | 153 | 155.1 | 159 | | 3 | 0.1158 | 4 | 5.1 | 45 | 43.9 | 49 | | 4 | 0.1185 | 16 | 17.5 | 132 | 130.5 | 148 | | 6 | 0.1857 | 27 | 29.7 | 133 | 130.3 | 160 | |-------+--------+-------+-------+-------+-------+-------| | 7 | 0.2473 | 42 | 38.0 | 115 | 119.0 | 157 | | 8 | 0.3462 | 0 | 0.3 | 1 | 0.7 | 1 | | 9 | 0.5388 | 66 | 62.8 | 67 | 70.2 | 133 | | 10 | 0.8704 | 39 | 41.9 | 23 | 20.1 | 62 | +--------------------------------------------------------+ number of observations = 975 number of groups = 9 Hosmer-Lemeshow chi2(7) = 4.43 Prob > chi2 = 0.7291 WHAT TO LOOK FOR: Evidence of a OVERALL GOODNESS OF FIT is reflected in a NON-SIGNIFICANT p-value Here the Hosmer-Lemeshow test p-value is .7291 This suggests good overall fit

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 27 of 30

6. Regression Diagnostics for Logistic Regression: Graphical

. *

. ***** 6a) Plot of ROC Curve using LROC

. predict xb, xb

. lroc Logistic model for case number of observations = 975 area under ROC curve = 0.8119

WHAT TO LOOK FOR: Classification that is no better than a coin toss is reference in the 45 degree line Evidence of GOOD FIT is reflected in an ROC curve that lies above the 45 degree line reference Area under the ROC curve = .8119 says that 81% of the observations are correctly classified.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 28 of 30

. * . ***** 6b) Plot of Y=Standardized Residual versus X=Observation Number . predict std_residual, rs . label variable std_residual "Standardized Residual" . generate index=_n . label variable index "Observation Number" . graph twoway (scatter std_residual index,msymbol(d)), xlabel(0(100)1000) ylabel(-4(2)4) title("Plot of Standardized Residuals versus Observation Number") xtitle("Observation Number") ytitle("Standardized Residual") yline(0) caption("stdresidual.png", size(vsmall))

WHAT TO LOOK FOR: Think of standardized residuals as Z-scores, approximately. We’d like the majority to be within 1.96 of the expected value of 0 Values outside + 1.96 are potentially extreme.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 29 of 30

. * . ***** 6c) Plot of Influential Observations: Y=Cook versus X=Observation Number . predict cook, dbeta . label variable cook "Cook Distance" . graph twoway (scatter cook index, msymbol(d)), xlabel(0(100)1000) title("Plot of Cook Distance versus Observation Number") xtitle("Observation Number") ytitle("Cook Distance") caption("cook.png", size(vsmall))

WHAT TO LOOK FOR: Look for a even ribbon of cook distance values with no spikes.

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 30 of 30

7. Tip – Save your Commands to a DO File for Later Use

Step 1 – Right click anywhere inside the review window. From drop down menu, choose SELECT ALL

Step 2– Right click again. From drop down menu, choose SEND TO DO-FILE EDITOR

Stata will put you into the Do-File Editor. You should see something like the following

Step 3– Click on the SAVE icon. At SAVE AS: provide a name. At WHERE: provide a path