preparing data for statistical analysis data cleaning data cleaning dataset preparation dataset...

PREPARING DATA FOR PREPARING DATA FOR STATISTICAL STATISTICAL

ANALYSISANALYSIS

Data CleaningData Cleaning Dataset PreparationDataset Preparation Documentation Documentation

9 September 2008

Beverly Musick

Indiana University

Raw Data CleaningRaw Data CleaningFor data that are stored in Access, Excel, or text files data cleaning For data that are stored in Access, Excel, or text files data cleaning

should begin with the original table, spreadsheet or file.should begin with the original table, spreadsheet or file.

Back-up the original data files.Back-up the original data files.

Eliminate blank records and any records used for testing.Eliminate blank records and any records used for testing.

Locate duplicate records and resolve.Locate duplicate records and resolve.

For numeric variables, identify outliers by sorting and reviewing For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc.continuous variables such as dates, ages, weights etc.

For categorical variables such as gender or travel time to clinic, For categorical variables such as gender or travel time to clinic, sorting will reveal invalid response codes or use of mixed case (f, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender).F, m, M for gender).

Can also assess the amount of missing data when records are Can also assess the amount of missing data when records are sorted. Does it make sense that x records have no value for sorted. Does it make sense that x records have no value for variable y? variable y?

Raw Data to SAS DatasetsRaw Data to SAS DatasetsCreate a SAS program that converts the database file(s) to Create a SAS program that converts the database file(s) to

permanent SAS dataset(s).permanent SAS dataset(s). For Access or Excel files can use ‘Proc Import’For Access or Excel files can use ‘Proc Import’

PROC IMPORT OUT= WORK.demog PROC IMPORT OUT= WORK.demog DATATABLE= "tblDEMOG" DATATABLE= "tblDEMOG"

DBMS=ACCESS REPLACE;DBMS=ACCESS REPLACE; DATABASE="I:\Projects\Kenya\CFAR\DATABASE="I:\Projects\Kenya\CFAR\

cfar.mdb"; cfar.mdb"; dbpwd=‘password' ;dbpwd=‘password' ;

RUN;RUN;

For text files can write specific input statement For text files can write specific input statement

data copd ; infile 'c:\kenya\hiv\copd.txt' ;data copd ; infile 'c:\kenya\hiv\copd.txt' ; input @1 patientid $9. @@ ;input @1 patientid $9. @@ ;run ;run ;

Raw Data to SAS Datasets Raw Data to SAS Datasets (cont.)(cont.)

Merge or append (concatenate) tables as necessary.Merge or append (concatenate) tables as necessary.

Double-check the merging process by looking at the number of observations in Double-check the merging process by looking at the number of observations in each dataset before and after the merge. each dataset before and after the merge.

831 data visit ; set h.hivvisit2(keep=patientid apptdate age weight height bmi cd4) ;831 data visit ; set h.hivvisit2(keep=patientid apptdate age weight height bmi cd4) ;832 if patientid in ('1271BS-1','26277-4','3280CH-4','4709KT-6','625NT-5') ;832 if patientid in ('1271BS-1','26277-4','3280CH-4','4709KT-6','625NT-5') ;833 run ;833 run ;NOTE: There were 933654 observations read from the data set H.HIVVISIT2.NOTE: There were 933654 observations read from the data set H.HIVVISIT2.NOTE: The data set WORK.VISIT has 71 observations and 7 variables.NOTE: The data set WORK.VISIT has 71 observations and 7 variables.

843 data vis2 ; set h.hivvisit2(keep=patientid apptdate clinic hgb sao2) ;843 data vis2 ; set h.hivvisit2(keep=patientid apptdate clinic hgb sao2) ;844 if patientid in ('13836MT-4','4709KT-6','625NT-5') ;844 if patientid in ('13836MT-4','4709KT-6','625NT-5') ;845 run ;845 run ;NOTE: There were 933654 observations read from the data set H.HIVVISIT2.NOTE: There were 933654 observations read from the data set H.HIVVISIT2.NOTE: The data set WORK.VIS2 has 46 observations and 5 variables.NOTE: The data set WORK.VIS2 has 46 observations and 5 variables.

846 data bothvis ; merge visit vis2 ;846 data bothvis ; merge visit vis2 ;847 by patientid apptdate ;847 by patientid apptdate ;848 run ;848 run ;NOTE: There were 71 observations read from the data set WORK.VISIT.NOTE: There were 71 observations read from the data set WORK.VISIT.NOTE: There were 46 observations read from the data set WORK.VIS2.NOTE: There were 46 observations read from the data set WORK.VIS2.NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables.NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables.

The number of records is dependent on the overlap among the datasets. This The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed. relationship should be known in advance and the expected outcome confirmed.

Confirm that the total number of variables in the Confirm that the total number of variables in the merged dataset is correct. merged dataset is correct.

The number should be the sum of all variables The number should be the sum of all variables minus the (number of key fields * (number of minus the (number of key fields * (number of datasets in merge minus 1)). datasets in merge minus 1)).

In the previous example: 7 + 5 – 2*(2-1) = 10In the previous example: 7 + 5 – 2*(2-1) = 10

If the number of variables is less than this, then If the number of variables is less than this, then you know that you have the same variable(s) in you know that you have the same variable(s) in one or more of the datasets. This should be strictly one or more of the datasets. This should be strictly avoided. avoided.


Investigate messages such as Investigate messages such as

"NOTE: MERGE statement has more than one "NOTE: MERGE statement has more than one data set with repeats of BY values."data set with repeats of BY values."

““Variable _____ is uninitialized”Variable _____ is uninitialized” ““Variable _____ has never been referenced”Variable _____ has never been referenced” ““Character values have been converted to Character values have been converted to

numeric…”numeric…” ““Variable _____ has been defined as both Variable _____ has been defined as both

character and numeric”character and numeric” ““Warning: Multiple lengths were specified for Warning: Multiple lengths were specified for

the BY variable _____ by input data sets. This the BY variable _____ by input data sets. This may cause unexpected results.”may cause unexpected results.”


SAS Dataset CreationSAS Dataset CreationTo create permanent datasets for analysis:To create permanent datasets for analysis:

Recode missing values used in the raw data tables/files to Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data used to indicate missing data for numeric fields in a data table then these should be converted to .'s. table then these should be converted to .'s.

Calculate appropriate summary scores (ex. AUDIT-3, BMI)Calculate appropriate summary scores (ex. AUDIT-3, BMI)

Calculate differences between dates such as time from Calculate differences between dates such as time from enrollment to ART initiation. enrollment to ART initiation.

Label all calculated and created variables.Label all calculated and created variables.

Attach formats to the variable values where necessary.Attach formats to the variable values where necessary.

Cleaning Data in SASCleaning Data in SASCreate a cleanup program.Create a cleanup program.

Generate frequencies, means, and univariates to Generate frequencies, means, and univariates to better understand the dataset and to check for better understand the dataset and to check for invalid data. invalid data.

Plot the data. Plot the data.

For the numeric and date fields look at minimums For the numeric and date fields look at minimums and maximums to verify all values are within and maximums to verify all values are within expected range.expected range.

Locate duplicate records and resolve.Locate duplicate records and resolve.

Compare fields when appropriate (i.e. dob and age, Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up). confirm date of initial visit < date of follow-up).

Cleaning Data in SAS (cont.)Cleaning Data in SAS (cont.)

Identify important fields such as summary Identify important fields such as summary scores and verify their values.scores and verify their values.

Merge all longitudinal datasets to identify Merge all longitudinal datasets to identify date inconsistencies, variable format date inconsistencies, variable format inconsistencies, and to locate missing inconsistencies, and to locate missing questionnaires.questionnaires.

Merge cross-sectional (demographics) Merge cross-sectional (demographics) dataset with longitudinal datasets to dataset with longitudinal datasets to identify subjects in one but not the other.identify subjects in one but not the other.

SAS Program FilesSAS Program Files

Save all logs and outputs from SAS Save all logs and outputs from SAS programs especially when creating programs especially when creating analysis datasets for publicationanalysis datasets for publication

Naming conventions – studyx.sas, Naming conventions – studyx.sas, studyx.log, studyx.lststudyx.log, studyx.lst

Only the program that generates the Only the program that generates the permanent dataset should overwrite it. permanent dataset should overwrite it.

Never overwrite a permanent dataset Never overwrite a permanent dataset (even with a proc sort) from any other (even with a proc sort) from any other program.program.

DocumentationDocumentation Internally document SAS programs. At Internally document SAS programs. At

minimum include file name, location, minimum include file name, location, purpose, author, date, and revisions. purpose, author, date, and revisions.

May be helpful to include the names of May be helpful to include the names of any permanent SAS datasets created any permanent SAS datasets created within the program.within the program.

All SAS printouts should have at least one All SAS printouts should have at least one title, which includes the project name. title, which includes the project name. (“title” statement)(“title” statement)

It’s helpful to use the footnote option to It’s helpful to use the footnote option to display the path and file name of the SAS display the path and file name of the SAS program on the listing. [EX: options program on the listing. [EX: options footnote ‘I:\alz\clin\cperm.sas’ ; ]footnote ‘I:\alz\clin\cperm.sas’ ; ]

Documentation (cont.)Documentation (cont.) If any variable values have been If any variable values have been

formatted, include a copy of the formatted, include a copy of the “proc format” section in the “proc format” section in the documentation.documentation.

Generate form keys.Generate form keys.

Provide a description of any variables Provide a description of any variables included in the datasets that are not included in the datasets that are not found on the form keys.found on the form keys.

Documentation (cont.)Documentation (cont.)

Detailed algorithms of how summary scores are calculated Detailed algorithms of how summary scores are calculated should include the following:should include the following:

a. which variables are used to calculate which a. which variables are used to calculate which summary scoressummary scores

b. which variables (if any) are recoded and howb. which variables (if any) are recoded and howc. what is the minimum number of non-missing c. what is the minimum number of non-missing

items needed to calculate the scoreitems needed to calculate the scored. how are missing values addressed. Typically d. how are missing values addressed. Typically

when calculating a total or sum score the mean should be when calculating a total or sum score the mean should be imputed for missing data. If the summary score is a mean imputed for missing data. If the summary score is a mean itself then the missing data can be ignored. In both of itself then the missing data can be ignored. In both of these cases it is essential that c. above is followed and that these cases it is essential that c. above is followed and that summary scores are coded as missing if there is insufficient summary scores are coded as missing if there is insufficient data to calculate.data to calculate.

e. what is the meaning of the score and how is it e. what is the meaning of the score and how is it scaled. Indicate the possible range and how a high score scaled. Indicate the possible range and how a high score differs from a low score. For example include something differs from a low score. For example include something like “Higher score indicates more depression”.like “Higher score indicates more depression”.

SAS General NotesSAS General Notes

If the study is longitudinal, at least two datasets If the study is longitudinal, at least two datasets are needed: one containing the demographics and are needed: one containing the demographics and other information which does not change over other information which does not change over time; and one containing the data for multiple time time; and one containing the data for multiple time points.points.

Never put cross-sectional variables such as gender Never put cross-sectional variables such as gender in the longitudinal dataset.in the longitudinal dataset.

Format all date fields with 4-digit year (ddmmyy10. Format all date fields with 4-digit year (ddmmyy10. or date9.)or date9.)

Choose data type numeric whenever possible. Choose data type numeric whenever possible.

Distributing SAS DatasetsDistributing SAS DatasetsAfter a senior data manager has reviewed the datasets and After a senior data manager has reviewed the datasets and

documentation, the statistician should be given READ ONLY access documentation, the statistician should be given READ ONLY access to:to:

The form keysThe form keys

All appropriate SAS datasets (should have the extension .sas7bdat)All appropriate SAS datasets (should have the extension .sas7bdat) A description of any variables included in the datasets that are not A description of any variables included in the datasets that are not

found on the form keysfound on the form keys

Notes on calculation of the summary scoresNotes on calculation of the summary scores

Proc format statementsProc format statements

Any other documents or notes which would further explain the data.Any other documents or notes which would further explain the data.

Distributing SAS Datasets Distributing SAS Datasets (cont.)(cont.)

Statisticians should not be given nor have access to:Statisticians should not be given nor have access to: Any Protected Health Information (PHI) such as Any Protected Health Information (PHI) such as

study subject’s name, address, phone numbers, study subject’s name, address, phone numbers, social security number, hospital id number. Date social security number, hospital id number. Date of birth should only be included if absolutely of birth should only be included if absolutely necessary. But usually age can be calculated and necessary. But usually age can be calculated and given instead.given instead.

Your SAS generation programs. These often Your SAS generation programs. These often contain PHI. If you must share SAS programs with contain PHI. If you must share SAS programs with the statisticians, please carefully review the the statisticians, please carefully review the programs and then copy to a separate folder to programs and then copy to a separate folder to which they have read access rather than giving which they have read access rather than giving access to your main folder.access to your main folder.

Distributing SAS Datasets Distributing SAS Datasets (cont.)(cont.)

For your own records at minimum, you should have:For your own records at minimum, you should have:

A copy of everything you give to the statistician and the date given.A copy of everything you give to the statistician and the date given. A copy of the log of all the SAS programs especially those that create A copy of the log of all the SAS programs especially those that create

any permanent SAS datasets which were passed along to othersany permanent SAS datasets which were passed along to others

Grant protocols, meeting notes, scoring algorithms, instructions for Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc.data entry, corrections made, etc.

It may be helpful to maintain a subdirectory that exactly mirrors the It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and subdirectory would include all the RDMS programs, format files, and tables. tables.

For longitudinal studies in particular, it is important to archive For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications.abstracts, papers, grant proposals, and other publications.

Organizing Project FoldersOrganizing Project Folders Example of folder structure:Example of folder structure:

– I:\projects\studyname – contains raw data, documentation, SAS I:\projects\studyname – contains raw data, documentation, SAS programs, etc.programs, etc.

– I:\projects\studyname\Datasets – stores datasets that have been I:\projects\studyname\Datasets – stores datasets that have been approved for distribution. May also include the SAS formats in approved for distribution. May also include the SAS formats in this folder. Statisticians should have READ ONLY access to this this folder. Statisticians should have READ ONLY access to this folder.folder.

– I:\projects\studyname\Keys – stores the form keys, the scoring I:\projects\studyname\Keys – stores the form keys, the scoring algorithms and other data documentation. Statisticians should algorithms and other data documentation. Statisticians should have READ ONLY access to this folder.have READ ONLY access to this folder.

– I:\projects\studyname\Grant – stores the original grant I:\projects\studyname\Grant – stores the original grant application, protocols, papers, etc. All data management staff application, protocols, papers, etc. All data management staff and statisticians involved in this project should have full access and statisticians involved in this project should have full access to this folder.to this folder.

DM Working with DM Working with BiostatisticiansBiostatisticians

Attend study meetingsAttend study meetings Date all documents and meeting notesDate all documents and meeting notes Comment on proposed study changesComment on proposed study changes Understand the statistical analysis planUnderstand the statistical analysis plan Review statistical reports (preferably Review statistical reports (preferably

before presented to research team)before presented to research team) Review and critique abstracts/manuscriptsReview and critique abstracts/manuscripts

preparing data for statistical analysis data cleaning data cleaning dataset preparation dataset...

Documents