introduction to epi info version 3.4.1) analyze data · pdf filei introduction to epi info...

39
i Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe, MD, MCTM, MPH Department of Epidemiology Rollins School of Public Health of Emory University Figure 1. Epi Info Introductory Screen. Intro to Epi Info 3.4.1 Analysis.doc October 2007

Upload: phungkhue

Post on 31-Jan-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

i

Introduction to EEppii IInnffoo (Version 3.4.1) Analyze Data Module

By

Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe, MD, MCTM, MPH

Department of Epidemiology

Rollins School of Public Health of Emory University

Figure 1. Epi Info Introductory Screen.

Intro to Epi Info 3.4.1 Analysis.doc October 2007

Page 2: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

ii

Page 3: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

iii

TABLE OF CONTENTS I. Introduction 1 II. Basic Commands 5

Reading Data and Writing Data 5 List Data 6 Display Variables 7

III. Simple Analytic Commands and Graphics 9

Frequencies 9 Means 10 Tables 12 Match 15 Summarize 18 Graph 20

Exercise 1 22

IV. Navigating and Managing the Output and the Program Editor windows 25 V. Data manipulation commands 29

Sort/Cancel Sort. 29 Select/Cancel Select 30 Define/Undefine 31 Assign 32 Recode 33 If 34

Exercise 2 38

VI. Setting System Defaults 39

VII. Advanced Statistics 41 Linear Regression 41 Simple linear regression 42 Multiple linear regression 44 Logistic Regression 45 Unconditional logistic regression 46 Conditional logistic regression 49 Survival Analysis 49 Kaplan-Meier 49 Cox Proportional Hazards 51 Complex sample commands 55 Complex Sample Frequencies 55 Complex Sample Tables 58 Complex Sample Means 59 Exercise 3 62

Page 4: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

iv

VIII. Statistics Command Options 63 Stratify by 63 Weight 70 IX. Advanced Data Management Topics 75 Write (Export) 75

Delete file/table 76 Delete/undelete records 77

Relate 78 Merge 80

Acknowledgments 83 References 83 APPENDICES 85

Appendix 1. Data Dictionaries 85 Appendix 2. Operators/ Functions 102 Appendix 3. Answers to Exercises 105 Appendix 4. Analysis Commands By Type of Variables 113

Page 5: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

1

I. INTRODUCTION Epi Info is a program developed by the Centers for Disease Control and Prevention (CDC) that runs under the Microsoft Windows® operating system and provides programs for data entry and analysis. The Epi Info program and help information can be found at www.cdc.gov/epiinfo. The purpose of this document is to introduce the Analyze Data module, discussing commands in a sequence appropriate for learning the program. More detailed information on some topics is provided in the later chapters and details of the commands can be found in Epi Info’s on-line help. Also presented is the program OpenEpi (www.OpenEpi.com) and how it can be used to supplement Epi Info. Figure 1 on the front of this document presents the Epi Info (3.4.1) introductory screen (release date: July 3, 2007). To start the Analyze Data module, click on the Analyze Data button in the lower left of the screen or from the pull-down menu by Programs → Analyze Data (see Figure 1). The main Analyze Data screen is shown in Figure 2. The screen is composed of three windows. There is a narrow window of the left side of the screen labeled Analysis that presents the commands, i.e., the “command tree”; the largest window labeled Analysis Output comprises the upper right portion of the screen and is where the output is presented; and a smaller window in the bottom right of the screen labeled Program Editor where text commands appear. Figure 2. Analyze Data main screen, Epi Info.

Before discussing the Analysis commands, let’s review some basics. To exit the Analyze Data module, you can click on the Exit button at the top of the left window; to minimize the module, click on the “_” button in the upper right-hand corner of the left window. The Help button at the bottom of the left window takes you to the Epi Info Help system. You can resize any of the three windows by placing the cursor where two windows meet, hold the left mouse button, and drag the border.

Page 6: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

2

The commands listed in the narrow window on the left are grouped according to their general functions, such as those related to reading, relating, writing and merging data files (the Data section); those relating to creating and assigning values to variables (Variables and Select/If sections), and the analytic commands (Statistics and Advanced Statistics sections). A brief description of some of these commands is presented in Figure 3. In this document, the Analyze Data commands are described in the following Chapters in this order:

II. Basic Commands o Read (Import) – usually the first command used; to “read” or “open” a data file o List – to view or update the data in a spreadsheet format o Display – to display variable names and types

III. Simple Analytic commands and Graphics o Frequencies – for viewing the frequencies of values for a variable o Means – similar to Frequencies for a single variable except for numeric data where the

Means command provides summary statistics; can also perform independent t-test, one-way ANOVA, and their nonparametric equivalents.

o Tables – for single and stratified 2x2 tables (where the odds ratio, risk ratio, and other measures of association are provided) or any size r x c table

o Match – for matched case-control data o Summarize – to create a new table containing a summary of descriptive statistics for the

current dataset. o Graph – graphing data

IV. Navigating and Managing the Output and the Program Editor Windows o Issues related to using and navigating the Output and the Program Editor windows

V. Data manipulation commands o Sort/Cancel Sort – sort the data/cancel a sort o Select/Cancel Select – “select” or “unselect” a subset of records for analysis o Define/Undefine – “define” or “undefine” new variables

Assign – “assign” values to a variable Recode – recode from one variable to another variable If – If commands for conditional logic, also Then and Else

VI. Setting System Defaults o Set – choose default settings

VII. Advanced Statistics o Linear Regression – simple linear and multiple linear regression o Logistic Regression – both unconditional and conditional logistic regression o Survival Analysis

Kaplan-Meier – simple survival analysis Cox Proportional Hazards – advanced survival analysis

o Complex sample commands – commands for use with cross-sectional data which include elements of cluster and/or stratification

Complex Sample Frequencies Complex Sample Tables Complex Sample Means

VIII. Statistics Command Options o Stratify by o Weight

IX. Advanced Data Management Topics o Write (Export) o Delete File/Table o Merge o Relate

Page 7: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

3

An introduction to these commands follows. Because the goal of this document is to provide an introduction to these commands, there will be some details of the commands that are not presented. For more detailed information on the commands, please consult the on-line help in Epi Info. In this document, for examples of the output of some commands, we have removed or slightly modified some of the output to reduce space. Figure 3. Short description of selected commands in Analyze Data

←The minimize button (“_”) on the Analysis commands window will minimize all 3 Analyze Data windows; the close button (“X”) closes all 3 windows, same as the Exit button described next. ←The Exit button will exit from the Analyze Data module. Data-Related commands Read – Read an Epi 2000 (i.e., Access 2000) file; can Import other file types, such as Epi6, dBase, and Excel Relate is for relating files, Write for writing new data files or Exporting to a different file (e.g., Epi Info 6, dBase, and others); and Merge to merge data files. Commands to Delete a file, table, records, or undelete records Variables – commands for creating variables & assigning values Define/Undefine to create/remove a temporary variable; Assign values to a variable; Recode to recode a variable (a short-hand version of If/Then/Else); and Display to display the variables by their variable names and types in the data set and any temporary defined variables. Select/If – commands for selecting a subset of records, If/Then/Else statements, and Sorting a file. Select/Cancel Select is for selecting/unselecting a subset of records. If/Then/Else commands Sort/Cancel Sort for sorting/unsorting the file on one or more variables. Statistics – commands for presenting data & statistical analyses List – data depicted as a spreadsheet (“grid”), allow data entry Frequencies – frequency, %, cum %, and confidence intervals Tables – cross-tabulations; stratified analysis Match – for matched case-control analysis; can have one or more controls for each case Means – frequency, descriptive statistics, independent t-test, one-way ANOVA, and nonparametric tests Summarize – Creates a new variable with descriptive statistics Graph – graphing module Map – mapping module Advanced Statistics Linear Regression – simple or multiple linear regression Logistic Regression – regression for dichotomous outcome variables Kaplan-Meier Survival analysis Cox Proportional Hazards survival analysis Complex Sample commands – for surveys using complex sample design, such as stratification and clusters, for Complex Sample Frequencies, Tables, and Means Help – to open the Epi Info Help program There are a number of commands lower in this window not described in this figure.

Page 8: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

4

Page 9: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

5

II. BASIC COMMANDS

Reading and Writing Data One of the first things you will want to do in the Analyze Data module is to Read a data set, or, as usually stated in most Windows programs, to open a file. To Read a file, click on Read (Import) in the command window; a dialog box will be presented as shown in Figure 4. Figure 4. Dialog box for Read (Import), Epi Info.

Epi Info uses the Microsoft Access data file format and in Epi Info referred to as Epi 2000 files. If you are not familiar with Access files, they can seem complicated because they may contain many “tables” which are the equivalent of data files such as Epi Info DOS .REC files and SPSS .SAV files. When Epi Info is installed on computer, the default Current Project (see top of dialog box in Figure 4) is usually on drive C:, the Epi_Info folder, and the Data Source drive, folder, and file name is C:\Epi_Info\Sample.mdb, a file distributed with Epi Info that contains a number of example files. Description and data dictionaries of these example files can be found in Appendix 1. Microsoft Access files have an .mdb file extension (which stands for Microsoft Data Base). An .mdb file can contain more than one file and there can be several file types. Let’s not get into the details of .mdb files right now but just read one of the Views. The example file to be used right now is viewEvansCounty, a file from the textbook by Kleinbaum, Kupper, and Morgenstern, Epidemiologic Research. To open the file either double click on viewEvansCounty with the mouse or click once to highlight the file and then click the OK button. The file will be read and in the Output window you should see something similar to:

Current View: C:\Epi_Info\Sample.mdb:viewEvansCounty Record Count: 609 (Deleted records excluded) Date: 5/16/2005 4:35:27 PM Of course, the date and time will differ depending on when the file is opened. More details on the viewEvansCounty data can be found in Appendix 1.

Page 10: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

6

Reading file types other than Epi2000 type Epi Info can also read other types of files, including Microsoft Access, Epi 6, dBase, FoxPro, Excel, Paradox, and text/ASCII files. To select a file type other than Epi 2000, click on the down arrow to the right of Data Formats in the dialog box and select the file type from the pull down menu. More details on reading other file types is provided in Section IX. Exporting to other file types The Write (Export) command allows users to save data into a different Epi Info .mdb data file or into another file format. The other formats to which a file can be written are similar to those that can be imported as described in the previous paragraph. More detail is provided in Section IX. Advanced Data Management Topics.

List Data Usually one of the first things you might want to do with a data set that you are not familiar with is view the data in a spreadsheet format. This provides information such as variable names, the type of coding used, etc. To do this in Epi Info, click on List in the command window and a dialog box will be presented as depicted in Figure 5. Figure 5. Dialog box for the List command, Epi Info.

There are three modes for displaying data listed under the Display Mode section of the dialog box: Web (HTML), Grid, and Allow Updates. The default selection is Grid, which presents the “spreadsheet-like” view of the data in the Output window as shown in Figure 6. As in a spreadsheet, the first row presents the variable names (e.g., ID, CHD, AGE, etc.). The rows below, in this data set, relate to individuals. The first individual in this data set was assigned an ID number 21, did not develop coronary heart disease (CHD), and was 56 years old. You can navigate through the data as in a spreadsheet. If you click on the Output window, the header will turn from gray to blue and you can then use the arrow keys and the Page Up and Page Down keys for navigation. Another option on the List dialog box is the Allow Updates option that looks just like the Grid option but allows the user to change values in cells. The changes are permanent, so be careful! The Web (HTML) mode differs from the Grid and Allow Updates in that rather than being spreadsheet-like in appearance, the data are written to the Output screen in HTML format, which is useful if you want to print

Page 11: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

7

the data. The default for all three List options is to present all the variables in a dataset. The user can select which variables to view and their order by clicking on the down arrow in the Variables box. Figure 6. Example of Output screen for the Grid mode in the List command, viewEvansCounty data, Epi Info.

Display Variables To view the variable names and types, use the Display command which will present a dialog box as shown in Figure 7. Click on Display in the left command window, and then click on the OK button; the output is presented in Figure 8. The variable name, the field type (in this example, either a number, Yes/No, or text field), and the format are presented. Figure 7. Dialog box for the Display command, Epi Info.

Page 12: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

8

Figure 8. Example of the Display command, viewEvansCounty data, Epi Info. DISPLAY DBVARIABLES Variable Table Field Type Format/Value Special Info PromptAGE EvansCounty NUMBER ## AGEC AGEG1 EvansCounty YES/NO AGEG1AGEG2 EvansCounty NUMBER # AGEG2CAT EvansCounty YES/NO CAT CHD EvansCounty YES/NO CHD CHL EvansCounty NUMBER ### CHL CHLG EvansCounty YES/NO CHLG DBP EvansCounty NUMBER ### DBP ECG EvansCounty YES/NO ECG HEM EvansCounty NUMBER ## HEM HPT EvansCounty YES/NO HPT ID EvansCounty NUMBER ##### ID MAR EvansCounty YES/NO MAR MP EvansCounty NUMBER ### MP OCC EvansCounty NUMBER # OCC PLS EvansCounty NUMBER ### PLS QTI EvansCounty NUMBER ##### QTI QTIG EvansCounty YES/NO QTIG SBP EvansCounty NUMBER ### SBP SES EvansCounty NUMBER ## SES SESG EvansCounty YES/NO SESG SMK EvansCounty YES/NO SMK Language Defined Text ENGLISH Predefined

Page 13: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

9

III. Simple Analytic Commands and Graphics

The simpler analytic commands, Frequencies, Means, Tables, and Match, are presented next. For additional information on when to use the various analytic commands (such as the type of variable, number of variables to be analyzed being considered at one time, etc.), see Appendix 4.

Frequencies The Frequencies command provides a table of the values and frequency of each of the levels of a variable. The dialog box is depicted in Figure 9. For example, perform a frequency using one of the variables in the Evans County data called CHD; the output is shown in Figure 10. Figure 9. Dialog box for the Frequencies command, Epi Info.

(Note: the dialog box for the Frequencies command is called FREQ) Figure 10. Example of output for Frequencies command, viewEvansCounty data, Epi Info. FREQ CHD

CHD Frequency Percent Cum Percent Yes 71 11.7% 11.7% No 538 88.3% 100.0% Total 609 100.0% 100.0%

95% Conf Limits Yes 9.3% 14.5% No 85.5% 90.7%

In Figure 10, the frequency or number of observations of each level of the CHD variable is presented, the percent at each level, and a cumulative percent. A small horizontal bar graph is presented of the frequencies, and 95% confidence intervals are provided for each level. For example, in Figure 10, 11.7% (71/609) of the men developed CHD during the study period with a 95% confidence interval of (9.3%, 14.5%).

Page 14: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

10

Means For a single variable, the Means command provide similar information as the Frequencies command. Three differences between Frequencies command and the Means command are: 1 The Frequencies command provides 95% confidence intervals – the Means command does not 2 The Means command works only with numeric, date, and time fields, and will not work with text fields,

whereas the Frequencies command will present the frequency of each level of a text field 3 The Means command provides summary statistics The dialog box for the Means command is shown in Figure 11. An abbreviated example of the Means command for the variable CHL (cholesterol) is shown in Figure 12. The summary statistics results include the total number of observations (“Obs”); the sum of all observations (“Total”); the mean, variance, and standard deviation (“Std Dev”) of the observations; minimum and maximum values; 25th, 50th (“median”), and 75th percentiles; and the mode. Note that if the variable has two or more modal values, the smallest modal value will be presented in the Epi Info output. Figure 11. Dialog box for the Means command, Epi Info

Figure 12. Output from Means command for a single variable, viewEvansCounty data, Epi Info MEANS CHL

CHL Frequency Percent Cum Percent 94 1 0.2% 0.2% 113 2 0.3% 0.5% … … … … … 336 2 0.3% 99.8% 357 1 0.2% 100.0% Total 609 100.0% 100.0%

Obs Total Mean Variance Std Dev 609 128949.0000 211.7389 1586.1834 39.8269

Minimum 25% Median 75% Maximum Mode 94.0000 184.0000 209.0000 234.0000 357.0000 211.0000

Note that the cholesterol levels between 113 and 336 are not presented to save space

Page 15: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

11

The Means command can also compare two or more means. When comparing two means an independent t-test is performed, and for comparing more than two means, a one-way analysis of variance (ANOVA) is performed. The independent t-test and ANOVA can be used if the following assumptions are met: • The outcome variable is normally distributed in each group • The underlying variances are the same in each group Epi Info does not provide a statistical test to determine if the data are normally distributed; the data could be graphed to see if the data visually seem to be normally distributed. Epi Info does perform a test to determine is the second assumption above is met called “Bartlett’s Test”; if the p-value is large (say >0.05) this would suggest the variances are approximately equal; if the p-value from Bartlett’s test is small (say <0.05), this would suggest that the underlying variances are not the same and therefore the t-test and ANOVA results may not be appropriate for the data. In the example in Figure 13, the cholesterol levels (CHL) of those who develop coronary heart disease (CHD) are compared to those without disease. The variances in Figure 13 can be assumed to be similar because the Bartlett’s test p-value is 0.9838. What should you do if the variances are not equal? One option is to transform the data, such as taking the log of the outcome variable. Another option is to use a nonparametric test described in more detail below. Epi Info does not provide an independent t-test assuming unequal variances that can be found in other programs such as SAS, SPSS, and OpenEpi. In the example in Figure 13, since the variances can be assumed equal, we may wish to see if the mean cholesterol differs between those who developed CHD compared to those who did not. In this example, the mean cholesterol level in those who developed CHD was 222 mg/100mL compared to 210 mg/100mL in those who did not develop CHD. The t-test has a p-value of 0.0215 suggesting that those who developed CHD had a significantly higher mean cholesterol level. The nonparametric test in the “Mann-Whitney/Wilcoxon …” section has a p-value (p=0.0196) similar in value to the t-test p-value. If more than two means are compared, a p-value for this comparison will be presented based on the F-test for a one-way ANOVA. Note that the Mann-Whitney/Wilcoxon two-sample test is the nonparametric equivalent of the independent t-test and the Kruskal-Wallis test is the nonparametric equivalent to the one-way ANOVA. Epi Info does not perform multiple comparison tests.

Tables The Tables command is used to compare two categorical variables, such as an exposure variable (exposed vs. unexposed) and an outcome variable (disease vs. no disease). To have the odds ratio (OR), risk ratio (RR), and risk difference (RD) calculated correctly, it is important the table be set up as shown in Table 1. Table 1. Table setup for Epi Info to correctly calculate the odds ratio, risk/prevalence ratio, and risk/prevalence difference. Disease No Disease Total Exposed a b a + b Not Exposed c d c + d Total a + c b + d n The OR, RR, and RD are calculated as: OR = (a x d) / (b x c) RR = [a / (a + b)] / [c / (c + d)] RD = [a / (a + b)] - [c / (c + d)] The dialog box for the Tables command is shown in Figure 14. As an example using a dichotomous exposure and disease variable, using the viewEvansCountry data, select CAT as the Exposure variable and CHD as the Outcome variable; the output is present in Figure 15.

Page 16: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

12

Figure 13. Output from Means command to compare two means, viewEvansCounty data, Epi Info. MEANS CHL CHD

Descriptive Statistics for Each Value of Crosstab Variable

Obs Total Mean Variance Std Dev Yes 71 15758.0000 221.9437 1580.1111 39.7506 No 538 113191.0000 210.3922 1574.3431 39.6780

Minimum 25% Median 75% Maximum Mode Yes 145.0000 195.0000 216.0000 242.0000 357.0000 228.0000 No 94.0000 182.0000 206.5000 232.0000 336.0000 211.0000

ANOVA, a Parametric Test for Inequality of Population Means (For normally distributed data only)

Variation SS df MS F statistic Between 8369.4658 1 8369.4658 5.3139 Within 956030.0219 607 1575.0083 Total 964399.4877 608

T Statistic =2.3052

P-value =0.0215

Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 0.0004 df=1 P value=0.9838

A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.

Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups) Kruskal-Wallis H (equivalent to Chi square) = 5.4504

Degrees of freedom = 1 P value = 0.0196

CHD CHL Yes No TOTAL

94 Row % Col %

0 0.0 0.0

1 100.0

0.2

1 100.0

0.2 113

Row % Col %

0 0.0 0.0

2 100.0

0.4

2 100.0

0.3

336 Row % Col %

0 0.0 0.0

2 100.0

0.4

2 100.0

0.3 357

Row % Col %

1 100.0

1.4

0 0.0 0.0

1 100.0

0.2 TOTAL

Row % Col %

71 11.7

100.0

538 88.3

100.0

609 100.0 100.0

Note that the cholesterol levels between 113 and 336 are not presented to save space

Page 17: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

13

Figure 14. Dialog box for the Tables command, Epi Info.

The information provided in Figure 15 presents a table that contains: the number of observations in each cell, the row percent (“Row %”), and the column percent (“Col %”). A graphic is provided to the right of the table depicts how the observations are distributed within the table – the larger the box, the larger the number of observations. Beneath the table is the Single Table Analysis that provides parameter estimates with confidence intervals and statistical tests. The first set of parameter estimates are based on the odds, with an odds ratio based on the cross product [i.e., (a x d) / (b x c)], and one based on the maximum likelihood estimation approach (MLE). Three different confidence intervals are provided, the Taylor series, mid-P exact, and the Fisher exact. Which one should you use? Our preference is the mid-P exact method. Note that the odds ratio calculated is an unmatched odds ratio – if the study used a matched case-control design, the Match command should be used. Next are the Risk-based estimates - the risk ratio and the risk difference with their confidence intervals. Note that if the outcome variable is based on prevalent disease, then substitute the terms “prevalence ratio” and “prevalence difference” for “risk ratio” and “risk difference”, respectively. Finally, a number of statistical test results are provided: three different chi square tests and two exact tests. The chi square tests are presented as two-sided p-values (although you could divide the two-sided p-value by 2 to calculate a one-sided p-value), and the exact tests are presented as one-sided p-values (you could multiply the one-sided p-values by 2 to get a two-sided p-value). Note that Epi Info refers to these as “1-tailed” and “2-tailed” p-values rather than “1-sided” and “2-sided” p-values. In the example in Figure 15, the conclusion would be that there is a statistically significant association between exposure and disease (p<.001), with individuals with high catecholamine (CAT=Yes) levels having a significantly higher risk of disease compared to those with “normal” catecholamine levels (CAT=No), 22.1% and 9.0%, respectively. For 2x2 tables, Epi Info will calculate the information as presented in Table 15. If the table is larger than 2x2, only an overall chi square test is provided. The Analysis module in Epi Info does not calculate a test-for-trend (i.e., when there are more than two exposure levels), but can be performed in OpenEpi.

Page 18: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

14

Figure 15. Example output from Tables command for single 2x2 table, viewEvansCounty data, Epi Info. TABLES CAT CHD

Single Table Analysis Point 95% Confidence Interval Estimate Lower Upper PARAMETERS: Odds-based Odds Ratio (cross product) 2.8615 1.6878 4.8514 (T) Odds Ratio (MLE) 2.8554 1.6690 4.8350 (M) 1.6148 4.9853 (F) PARAMETERS: Risk-based Risk Ratio (RR) 2.4495 1.5837 3.7887 (T) Risk Difference (RD%) 13.0962 5.3021 20.8903 (T) (T=Taylor series; C=Cornfield; M=Mid-P; F=Fisher Exact) STATISTICAL TESTS Chi-square 1-tailed p 2-tailed p Chi square - uncorrected 16.2465 0.0000567826 Chi square - Mantel-Haenszel 16.2198 0.0000575712 Chi square - corrected (Yates) 14.9998 0.0001086935 Mid-p exact 0.0000911051 Fisher exact 0.0001374257

CHD CAT Yes No TOTAL

Yes Row % Col %

27 22.1 38.0

95 77.9 17.7

122 100.0 20.0

No Row % Col %

44 9.0

62.0

443 91.0 82.3

487 100.0 80.0

TOTAL Row % Col %

71 11.7

100.0

538 88.3

100.0

609 100.0 100.0

To perform a stratified analysis you would provide an exposure variable, an outcome variable, and one or more stratifying variables. For example, using the CAT and CHD variables from the previous example, stratify on cholesterol group (CHLG) to perform a stratified analysis. The risk ratio for CAT → CHD relationship is 12.1 in those in the high cholesterol group vs. 1.8 in those in the low cholesterol group (results not shown). The summary information is depicted in Figure 16 and indicates a much stronger exposure-disease relationship in the former group.

Page 19: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

15

Figure 16. Example output from Tables command for a stratified 2x2 table, summary information only, viewEvansCounty data, Epi Info. SUMMARY INFORMATION

Point 95%Confidence IntervalParameters Estimate Lower Upper Odds Ratio Estimates Crude OR (cross product) 2.8615 1.6878, 4.8514 (T) Crude OR (MLE) 2.8554 1.6690, 4.8350 (M) 1.6148, 4.9853 (F) Adjusted OR (MH) 2.8716 1.6994, 4.8524 (R) Adjusted OR (MLE) 3.0375 1.7551, 5.2156 (M) 1.6962, 5.3868 (F) Risk Ratios (RR) Crude Risk Ratio (RR) 2.4495 1.5837, 3.7887 Adjusted RR (MH) 2.4648 1.6173, 3.7564

(T=Taylor series; R=RGB; M=Exact mid-P; F=Fisher exact) STATISTICAL TESTS (overall assoc) Chi-square 1-tailed p 2-tailed p MH Chi square - uncorrected 17.4807 0.0000 MH Chi square - corrected 16.1659 0.0001 Mid-p exact 0.0001 Fisher exact 0.0001

In the following two tests, low p values suggest that ratios differ by stratum Chi-square for differing Odds Ratios by stratum (interaction) 10.3638 0.0013 Chi-square for differing Risk Ratios by stratum 16.3645 0.0001

Epi Info presents both the crude odds ratio (which combines the strata into a single 2x2 table) and two different adjusted odds ratios (which “adjust” or “control” for the stratifying variable), one based on the Mantel-Haenszel method (MH) and one based on the maximum likelihood estimation method (MLE). Crude and adjusted risk ratios are also presented. Note that in Figure 16 the test-for-interaction for the risk ratio has a p-value of .0001, indicating that there is statistically significant interaction; therefore, the stratum-specific measures should be presented when describing the CAT → CHD association rather than the crude or adjusted risk ratio estimates. An example of confounding, not presented here, can be seen stratifying by the CAT → CHD example by age groups (AGEG1). The general approach to stratified analyses is to first determine if a variable modifies an exposure-disease relationship (i.e., assess for interaction). The statistical test for effect modification or interaction is shown at the bottom of the output (see Figure 16). If it is determined that a stratifying variable does not modify the exposure-disease relationship, then the next question is whether the variable confounds the relationship. See the section on Logistic Regression for an example of how to assess confounding. If interest is in the risk difference, attributable fraction, or prevented fraction, these analyses can be performed using OpenEpi.

Match The Match command is for use with matched case-control studies. This command can be used with R:1 matching, that is, each case is allowed to have one or more matched controls, and the number of controls can vary from case to case. The dialog box for the Match command is shown in Figure 17. To use the Match command, you need to specify an Exposure Variable, the Outcome Variable (i.e., case vs. control), and a Match Variable that links each case to their one or more controls. An example matched case-control data set in the Sample.mdb file is called viewRely; please Read this data set. These data are from a matched case-control study of toxic shock syndrome in which each case had three controls matched on potential confounders, such as age (see Appendix 1 for more details on this file). The primary exposure was the use of Rely tampons. After Reading the data, view the data layout using the List command (see Figure 18). The name of the first variable (i.e., first column) is ID which is an identification number that links each case with her three controls;

Page 20: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

16

note the second variable/column in the data set is CASE which has the values as either Yes (a case) or No (a control). The third variable/column reflects use of Rely tampons (Yes or No). To run the MATCH command for the viewRely data, enter the following into the dialog box:

Exposure Variable: Rely

Outcome Variable: Case Match Variables: ID Figure 17. Dialog box for the Match command, Epi Info.

Figure 18. List of the viewRely data file, Epi Info.

Make sure that the box next to Matched Analysis in the dialog box has a check mark in it, and then press the OK button. The output is shown in Figure 19. In the output, one or more tables are presented to show the relationship between whether or not a case was exposed and the number of controls that were exposed. Next, odds ratio and risk ratio information is presented. In general, the only useful information from this part of the output is the adjusted odds ratios and their confidence intervals. In this example, the adjusted MH odds ratio is 7.7 indicating a strong association between toxic shock syndrome and use of Rely tampons. A number of statistical tests are provided at the bottom of the output, which, in this example, indicates a statistically significant association.

Page 21: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

17

Figure 19. Example output from the Match command, viewRely data, Epi Info. RELY : CASE Match variables: ID

Matched Analysis of Tables with Non-Zero Marginals Matched Sets:12 Observations:48

Cases:1 Controls:3 Exposed Controls

Exposed Cases 3 2 1 0 1 1 1 5 4 0 0 1 1 1

SUMMARY INFORMATION Point 95%Confidence IntervalParameters Estimate Lower Upper Odds Ratio Estimates Crude OR (cross product) 13.0000 2.4125, 70.0530 (T) Crude OR (MLE) 12.2196 2.4762, 94.3799 (M) 2.0935, 133.9356 (F) Adjusted OR (MH) 7.6667 1.6061, 36.5973 (R) Adjusted OR (MLE) 8.3589 1.9281, 58.2541 (M) 1.6672, 81.7998 (F) Risk Ratios (RR) Crude Risk Ratio (RR) 7.0000 1.7166, 28.5455 Adjusted RR (MH) 7.6667 1.6615, 35.3767

(T=Taylor series; R=RGB; M=Exact mid-P; F=Fisher exact) STATISTICAL TESTS (overall assoc) Chi-square 1-tailed p 2-tailed p MH Chi square - uncorrected 9.5238 0.0020 MH Chi square - corrected 7.7143 0.0055 Mid-p exact 0.0014 Fisher exact 0.0026

A common question is “what would happen if the matching aspects of the study design were ignored in the analysis?” To ignore the matching of controls to each case, use the Tables command and provide Rely as the exposure variable and Case as the outcome variable. The odds ratio from the Tables command, ignoring the match, is 8.2 (results not shown), compared to a matched odds ratio of 7.7. In this particular example, ignoring the matching of cases and controls overestimates the odds ratio (i.e., a bias away from the null). [Note that the Match command is really a stratified tables approach with each case and their one or more controls forming each stratum.]

Summarize The Summarize command creates a new table (i.e., a dataset) containing descriptive statistics from the current dataset (Figure 20 shows the dialog box). Available Aggregate functions are COUNT, MIN, MAX, SUM, FIRST, LAST, AVG, VARIANCE and STANDARD DEVIATION. The basic principle is the same as that of Output To Table option in the TABLES, FREQ, and MEANS commands, but Aggregate functions are more powerful. The Summarize command can create a table that contains results from more than one function (e.g., COUNT, MIN, MAX, …….) specific to a single variable of interest or multiple functions for more than one variable. Let’s do an example. Using the viewEvansCounty data, if we wish to create a table that includes only mean values and standard deviations of AGE and diastolic blood pressure (DBP) along with total number of records, we can insert the following information into the Summarize dialog box as follows:

Page 22: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

18

Aggregate: Average (choose ‘average’ from available multiple functions, using down-arrow key) Variable: AGE (choose AGE from the list of available variables) Into Variable: average_age (a name created by the user) Then, click the Apply button. You can use the same principle for standard deviation (SD) of AGE, and means and SD of DBP as shown in the example (Figure 20). The Aggregate function Count was matched with the ‘AGE’ variable to get overall numbers of records in the dataset. Then, name the table you want to create, in this example, summary_table, and that table will be saved in the currently opened project dataset, in this example, “C:\Epi_Info\Sample.mdb”. Now Read the file summary_table [note that you will need to click on All in the Show section of the Read dialog box, see Figure 4) and the new table can be viewed using the List command (Figure 21)]. Figure 20. Dialog box for the Summarize command, viewEvansCounty data, Epi Info.

Figure 21. Example of the Summarize command using List, Epi Info.

Moreover, if we wish to create a table that summarizes the number of records and mean values of age and DBP stratifying on CAT (catecholamine) and coronary heart disease CHD (present/absent), we can follow the same procedure as above, and then place the variables CAT and CHD in Group By box (Figure 22).

average_age average_dbp number_records std_age std_dbp 53.7060755336617 91.1806239737274 609 9.25838769076145 14.4988731051949

Page 23: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

19

Figure 22. Dialog box for the Summarize command using the Group By option, viewEvansCounty data, Epi Info.

After clicking OK, Read the file stratified_table from the current project “C:\Epi_Info\Sample.mdb” and look at the table by using List command similarly to that described for the previous SUMMARY example (Figure 23). Figure 23. Example of the Summarize command with the Group By option using List, Epi Info. CAT CHD average_age average_dbp number_records No No 51.6433408577878 88.2121896162528 443 No Yes 54.0909090909091 93.7272727272727 44 Yes No 60.6736842105263 101.621052631579 95 Yes Yes 62.4074074074074 99 27 Please note that Aggregate function ‘FIRST’ and ‘LAST’ are based on the current sort order of data set. For a variable with numeric value, ‘Minimum’ and ‘Maximum’ Aggregate function defines minimum and maximum value of that variable. For date variables, Minimum value denotes earliest date and Maximum the latest date.

Graph The Graph module provides for a number of different types of graphs and only the most commonly used graphs and their options are presented in this document. The types of graphs include: Bar (vertical bars), Rotated Bar (horizontal bars), Histogram, Line, and Scatter XY. Figure 24 shows the dialog box for the Graph command.

Page 24: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

20

Figure 24. Dialog box for the Graph command, Epi Info.

The Graph Type selected will affect the options displayed on the Graph dialog box. The types of graphs that can be made are, in alphabetical order: • Area • Bar • Box-Whisker • Hi-Low • Histogram • Line

• Moving Average • Pareto • Pie • Points • Polar • Pyramid

• Rotated Bar • Scatter 3D • Scatter XY • Spline • Stacked Histogram • Step

On the Graph dialog box the 1st Title | 2nd Title is optional and printed at the top of the graph. Figure 25 shows an example of a Scatter XY graph. For a Scatter XY graph, the X-AXIS presents data along the horizontal line and Y-AXIS the vertical. The scatter plot in Figure 25 plots AGE on the X-Axis and CHL (cholesterol) on the Y-Axis (this and following examples are based on the viewEvansCounty table). Graphs are presented full screen; to close the graph, click on the “x” in the upper right corner; the graph will then appear in the Output window. An example of a Bar graph is shown in Figure 26 with the variable AGE on the X-Axis and “Count %” on the Y-Axis. To “customize” a graph, double click anywhere on the graph; the Customization... dialog box will appear on the screen as shown in Figure 27. (Note: other ways to view the Customization... dialog box is through the pull-down menu system (Edit→Launch Dialog Box) or by right clicking on the on the screen and then selecting the option Customize Dialog …). From this dialog box a number of options are available to modify or “customize” a graph, including: • Specifying a Main and Sub Title • Font types and size • Colors • Grid lines

• Shadows and 3D presentations • Linear vs. Log plots • Minimum and maximum values

Page 25: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

21

Figure 25. Example of a Scatter XY graph from Analysis, viewEvansCounty data, Epi Info

Figure 26. Example of a Bar graph from Analysis, viewEvansCounty data, Epi Info

To change an axis label, left click on the label and a dialog box will appear which allows the user to specify the label text. A similar approach can be used to change data point labels. To alter features such as the font style, font size, numeric precision, right click on the item. The image files created by the graphing command are automatically included in the HTML output (discussed later). The Graph module of Epi Info can do much more than presented here. To become more familiar with the options, it is suggested that you experiment with the different types of graphs and options.

Page 26: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

22

Figure 27. Customization dialog box in the Graph module, Epi Info.

Epi Info Exercise 1 – Use of Means, Tables, and Graph commands Using the viewEvansCounty file, answer the following questions:

1. What is the mean hematocrit (HEM)?

2. Does hematocrit (HEM) appear to be normally distributed? (Note: use a graph to display the distribution of the values.)

3. Does the mean hematocrit (HEM) differ between younger individuals (<55 years of age, variable

AGEG1=No) and older individuals (>55 years of age, variable AGEG1=Yes)?

4. What is the mean McGuire-White socioeconomic status score (SES)?

5. Does SES appear to be normally distributed? (Note: use a graph to display the distribution of the values.)

6. Does SES vary by the seven age group categories (variable AGEG2)?

7. What is the odds ratio and risk ratio when assessing the relationship between the cholesterol group

variable (CHLG) and CHD? Is there a statistically significant association? 8. Assess whether the variables in the table below modify or confound the CAT-CHD relationship based

on the odds ratio. “Modify” (also referred to as effect modification or interaction) is considered present for the odds ratio when the Chi-square for differing Odds Ratios by stratum (interaction) p-value is < 0.05. If there is no interaction, the assessment of confounding will be a 10% or greater difference between the crude and adjusted odds ratio:

100RO

RORO

adjusted

adjustedcrudex

Page 27: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

23

Third Variable

Interaction

p-value

Crude OR1

Adjusted

OR2

Conclusion?3

ECG MAR SMK AGEG1 QTIG HPT 1 Crude OR (cross-product) 2 Adjusted OR (MH) 3 Interaction, confounding, or neither

Page 28: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

24

Page 29: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

25

IV. Navigating & Managing the Output & Program Editor Windows In this section issues related to the use of the Output and Program Editor windows are described.

Output Window As Epi Info’s Analyze Data module is used, the statistical output, graphs, and other information are presented in the Analysis Output window (see Figure 28). At this point, let’s review how the Analyze Data module saves output and the various options that are available for the Analysis Output window. If there is more than one screen of output, you can use the vertical scroll bar or the Page Up and Page Down keys to move up and down through the output. Where is the output being written and can you review output from a previous day’s work? Analyze Data places output into .HTM files (Hyptertext Markup); these files are numbered sequentially, starting with OUT1.HTM, OUT2.HTM, …, OUT10.HTM, OUT11.HTM, etc. Analysis will sequentially number the output files and the default setting is that these files are stored in the Epi_Info folder – more on this later. The location of where the output is being stored and the name of the output file are presented in the upper left corner of the Analysis Output window (Figure 28). In this example, the output is on drive C:, in the Epi_Info folder, and the name of the output file is OUT227.HTM. From the command list in the left Analysis window you can choose a name for the next output file with the RouteOut command; close an output file with the CloseOut command; or print an output file using the PrintOut command. Other options for working with output files are contained in the buttons at the top of the Analysis Output window for various tasks as shown in Figure 28. Clicking on the Previous button will move to the top of the output from the previous command; clicking the Next button moves to the top of the output for the next command. Figure 28. Output Window in Analysis, Epi Info.

The History button shows the history of the current output file, such as the commands used and the time the commands were run. Click on one of the commands previously submitted and you will be taken to the output for that specific command. You can open any output file by clicking on the Open button which will present a dialog box allowing you to select an output file. The output window can be maximized to fill the entire computer screen by clicking on the Maximize button (note that this button changes to Restore while in full screen mode to allow you to switch back to the usual setting).

Page 30: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

26

Other aspects of Analysis Output can be controlled by commands listed in the Analysis command window on the left. For example, click on Storing Output and the dialog box shown in Figure 29 will be presented which allows users to specify certain aspects of the output files, such as output file prefix, sequence number, and folder where files are stored. Figure 29. The Storing Output dialog box in Analysis, Epi Info.

Note the inconsistency between the command name (“Storing Output”) and the name on the dialog box (“Result Storage”).

Program Editor Window The Analyze Data module can be operated in one of three ways:

1. By clicking on the commands in the Analysis command window on the left and completing dialog boxes (“dialog box” or “interactive” mode), the method most commonly presented in this document.

2. By typing commands directly into the Program Editor window and running the command (“text command interactive” mode).

3. By opening and running a file that contains Analysis commands (this file usually has a .PGM extension; “program” or “syntax” mode).

Note that each dialog box (such as the Read command and TABLES command) assists in writing commands; each dialog box creates commands that are written into the Program Editor and then Run or executed. For example, Figure 30 shows an example of the Program Editor. The first command line starting with READ was created by the Read dialog box. This command is followed by two LIST commands and then a FREQ command. Figure 30. The Program Editor window in Analyze Data, Epi Info.

Most of the time it is easier to use the dialog boxes to create and run commands. However, with more complicated dialog boxes/command lines, sometimes it is easier to modify a command in the Program Editor. As a simple example, the last command in the window in Figure 30 is “FREQ AGE AGE2 CHD CHLG”. Say

Page 31: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

27

you want to add one more variable “SMK” to the list. Rather than using the dialog box, you could click on the command line after “FREQ AGE AGE2 CHD CHLG”, type “SMK”, and then click on the Run This Command button; this will Run the one line of code in which the cursor is placed. For programs that need to be “executed” or “run” on a routine basis, such as running a monthly report on the number of cases of diseases reported, rather than using the dialog boxes each month (which could be very tedious and subject to error), the commands in the Program Editor can be saved in a program file and the entire file Run once a month. Another reason for saving commands is to document the program code used to create new variables and perform specific analyses. The commands for dealing with programs are in the Program Editor menu system File Edit View Fonts Run Help. Frequently used commands are also available as the buttons: New, Open, Save, Print, Run, and Run This Command. Please see the Epi Info manual for more details on the use of the Program Editor window.

Page 32: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

28

Page 33: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

29

V. Data Manipulation Commands At this point you have Read data, looked at the data as a “spreadsheet” using the List command, and performed some analyses using the Frequencies, Means, Tables, Summarize, and Match commands. Some basic elements of the Graph command were presented, and the use of the Output and Program Editor windows described. In the next nine pages we describe data manipulation in Analyze Data, such as how to sort data, select a subset of records, define new variables, assign values to new variables, and recode variables. Sort/Cancel Sort Usually the Sort command is used to reorder the records for the List command. When a data set is Listed, the records are presented in their order in the file, frequently the order in which data were entered. As an example of using the Sort command, say you want to order the data in the viewEvansCounty table from the youngest to the oldest. Click on the Sort command in the Analysis Command window and a dialog box would appear as shown in Figure 31. You can sort on one or more variables and can sort the records in Ascending order (from low to high) or Descending order (from high to low). To do this, double click on a variable in the left part of the dialog box titled Available Variables. This will place the variable in the right part of the dialog box under the area titled Sort Variables. A (++) after the variable name means that particular variable will be sorted in ascending order and a (--) means that variable will be sorted in descending order. If there is more than one variable in the Sort Variables listing, the sorting of the file will be based on the first variable listed; the second variable will be sorted within each level of the first variable, and so on. Note that sorting of the file is not permanent; if you were to reRead the data, the records would be in their original order. If you would like to keep the file sorted, you could Write the data to a new table using the Write command. You can also “unsort” the file using the Cancel Sort command as shown in Figure 32. Figure 31. Sort dialog box, Epi Info.

Page 34: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

30

Figure 32. Cancel Sort command dialog box, Epi Info.

Select/Cancel Select Sometimes you may want to select a subset of records for analysis. For example, you might want to perform some analyses on males only, or those <50 years of age, or smokers. This can be done using the Select command. Clicking on the Select command in the Analysis command window and you will be presented with a dialog box as shown in Figure 33. Within the dialog box beneath where it says Select Criteria you would usually have statements such as: SEX=”M”

AGE<50 SMK= (+) AGE<50 AND SMK= (+) AGE<50 OR SMK= (+) where SEX, AGE, and SMK are variables from the data file. The first example SEX=”M” shows that, for character fields, you must enclose the character(s) in double quotes. The second example would select a subset of individuals less than 50 years of age; because in this data AGE is a numeric variable, you would not use double quotes as shown in the previous example which was selecting on a character variable. The third example would select smokers [note that SMK is a Yes/No field, which are represented as (+) and (-), respectively, equivalent to the “Yes” and “No” buttons in Figure 33]. The fourth example would select individuals who were both less than 50 years of age and smokers. The fifth example would select those who are either less than 50 years of age or who smoke. All analyses performed after executing a Select command are limited to records that meet the selection criteria until the Select is cancelled or another file is Read. Figure 33. Select command dialog box, Epi Info.

You can select variables from the Available Variables part of the dialog box in Figure 33. A number of arithmetic, comparison, and Boolean operators can be selected by clicking on the appropriate buttons (e.g., +, -, *, /, etc.). See Appendix 2 for more information on operators and functions. You can also click on the Functions button in the dialog box for help. To cancel a selection, use the Cancel Select command; a dialog box as shown in Figure 34 will be presented. Canceling a selection has the effect of selecting all records in a data file.

Page 35: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

31

Figure 34. Cancel Select command dialog box, Epi Info.

Define/Undefine A common task when analyzing data is to create or DEFINE new variables. These new variables may be a categorization of an existing variable (e.g., converting blood pressure information into hypertensive vs. not hypertensive), or a calculated field, such as calculating body mass index from weight and height. You can think of the Define command as creating a new column in a spreadsheet – there is a column heading (the variable name) and the column beneath it is blank. To place values in the column, you can use the Assign, Recode, or If commands described later. The dialog box for the Define command is shown in Figure 35. Just provide a name for the variable. There are some rules concerning variable names: no spaces in the name; do not use a variable name that already exists in the data file; and do not use a name that is the same as any of the commands, operators, or functions (e.g., do not attempt to use names like AND, OR, LIST, etc.). Figure 35. Define command dialog box, Epi Info.

Unlike some programs, including Epi Info 6 (DOS version), you do not have to specify the type (e.g., numeric, character, or date) or length of the variable. However, as an option you can provide the type of variable (Date, Numeric, Text, or Yes-No) and a prompt. If you do not provide a variable type, the type and length of the variable will be determined by the program when you use subsequent commands to give values to the new variable. For the majority of situations, the Scope of the variable will be Standard; “power” Epi Info users may have a use for Global and Permanent variables. Using the Standard Scope variable, Defined variables are only temporary; if you reRead the file, the variable will not be part of the data set. If you want to make the newly Defined variable(s) a permanent part of the data file, you can use the Write command to create a file with the original data plus newly Defined variables. You can remove a Defined variable by using the Undefine command (see Figure 36). Figure 36. Undefine command dialog box, Epi Info.

Page 36: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

32

Assign The Assign command is usually used after you have Defined a new variable. The dialog box for the Assign command is shown in Figure 37. (Note: for Epi 6 users, the Assign command is the same as the Let command; you can use Let rather than Assign in the Program Window of the Windows version of Epi Info.) For example, we might want to Define a new variable called agesq and Assign this new variable the value of age squared. The Assign Variable in the dialog box would be agesq, and the Expression would be: age^2 The “^2” means to square the value of the variable; “^0.5” would be the square root of the value of the variable. See Appendix 2 for more information on operators and functions. As another example, we might want to Define a new variable chl_ln to be the natural log of the cholesterol value. The Expression would be: LN(chl) Again, see Appendix 2 for a listing of functions and operators or click on the Functions button in the dialog box. Figure 37. Assign command dialog box, Epi Info.

Recode Like Assign, Recode usually follows a Define command. The Recode command is frequently used to categorize an existing variable, such as recoding age into years to age groups. The dialog box for the Recode command is shown in Figure 38. For those who are familiar with the use of If statements, the Recode command can be thought of as a shorthand version of If statements. As an example, let’s recode age into age groups using the viewEvansCounty file. First, Define a variable called agegroup. Next, click on the Recode command in the Analysis Command Window. The dialog box as shown in Figure 38 will be displayed. In the From box, select the variable Age; in the To box, select the variable agegroup. The three boxes below the From/To section are, from left to right, the lowest category value, the highest category value, and the new category name. One way to complete these boxes automatically is to click on the Fill Ranges button near the bottom left of the dialog box; clicking on this button will present another dialog box shown in Figure 39. In Figure 39, the From variable is age, the To variable is the recently defined variable called agegroup. There are three boxes below with the words Start, End, and By. For the age to agegroup recode, Start would be the youngest age you want to categorize, End is the oldest age, and By is the interval, which in this example could be 5 or 10 or any year interval you desire. In the viewEvansCounty data, the youngest person was 40 and the oldest 76, so the Start could be 39, the End 80, and the By 10. Pressing the OK button will result in the completed dialog box as shown in Figure 40. Let me mention why 39 was entered as the starting number: the Start value is treated as >39, which in this example, would be 40-49 years of age. If you were to enter 40, the first age group would be 41-50, the second 51-60, etc. By entering 39 as the Start value, the first age group would be 40-49, the second as 50-59, etc.

Page 37: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

33

Figure 38. Recode command dialog box, Epi Info.

Figure 39. Recode command, Fill Ranges button dialog box, Epi Info.

You can double click in the boxes in Figure 40 to change values in the categories or the Recoded Value (i.e., the label for the category). Some notes on the use of the Recode command. Text must be enclosed in quotation marks. Numeric ranges are separated by a space, hyphen, and space, as in 1 - 5. Negative values are permitted, as in -9 – -8. The words LOVALUE and HIVALUE may be used to indicate the smallest and largest values for the variable (see Figure 40). The word ELSE may be used to indicate all values not falling in the preceding ranges. Recodes take place in the order stated; if two ranges overlap, the first in order will apply. In general, you cannot have more than 12 levels, although sometimes the command will work with more than 12 levels, even after receiving a warning message. Whenever using Recode, it is recommended that you List the From and To variables to make sure the recoding worked as expected.

Page 38: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

34

Figure 40. Recode command dialog box and completing of the Fill Ranges dialog box using AGE and agegroup example, viewEvansCounty data, Epi Info

If As mentioned in the previous section, the Recode command can be used to categorize variables. Another way is through the use of the If command, which can do everything the Recode command can but is more powerful. On the other hand, it tends to require more time and effort and is also subject to the user making “coding” errors or mistakes. The dialog box for the If command is shown in Figure 41. Figure 41. IF command dialog box, Analyze Data, Epi Info.

Page 39: Introduction to Epi Info Version 3.4.1) Analyze Data · PDF filei Introduction to Epi Info (Version 3.4.1) Analyze Data Module By Kevin M. Sullivan, PhD, MPH, MHA and Minn Minn Soe,

35

As an example of the use of If command we will Define a variable called agegroup2 and then use the If dialog box to give values to our new variable agegroup2. (Note: the reason for using the variable name agegroup2 is to use a variable name different from the one used previously for the Recode command example). For the youngest age group, the dialog box is as shown in Figure 42. You would need to complete one dialog box for each age group category, which gets tedious with three or more groups. Figure 42. Example of the IF command dialog box, Analyze Data, Epi Info.

Another way to use the If command is in the Program Editor. You could enter an If/Then statement using the dialog box approach, and then use the Program Editor to make subsequent commands by copying, pasting, and editing the command.

IF AGE<39 THEN agegroup2="<=39" END IF AGE>=40 AND age<50 THEN agegroup2="40-49.9" END

A Note to Epi Info DOS users: Note the change in the If/Then/Else command structure in the Windows version of Epi Info. In the DOS version of Epi Info, in the Analysis module, If/Then/Else commands were on a single line. In the Windows version of Epi Info, in the Program editor window the command is three or more lines with the last line having the command END. The If command can perform more complex recoding and mathematical calculations than the Recode command. For example, when using hemoglobin to define anemia status, adult females have a different cutoff value than males. An example of the code is below where HB is the variable name for hemoglobin value: DEFINE anemic ASSIGN anemic= (-) IF HB<12 and SEX=“F” THEN anemic= (+) END IF HB<13 and SEX=“M” THEN anemic= (+) END