data handling i: data preparation and data cleaning dr yanzhong wang lecturer in medical statistics...

41
Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's College London Email: [email protected] Drug Development Statistics & Data Management

Upload: juliet-vallie

Post on 31-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Data Handling I: Data Preparation and Data Cleaning

Dr Yanzhong WangLecturer in Medical StatisticsDivision of Health and Social Care ResearchKing's College LondonEmail: [email protected]

Drug Development Statistics & Data Management

Page 2: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Session objectives• Part 1:

– How to create a simple data file ready for analysis– How to create data files for large scale studies– Case study on data checking/cleaning

• Part 2– Advantages and disadvantages of various summary

statistics– Select appropriate summary statistics for categorical,

binary, ordinal and continuous data. Reading: Statistics as Square One, Chapter 2.

2

Page 3: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Outline of computerisation of data

• Plan – at protocol stage of study.• Data entry.• Data checking and editing of individual files.• Merging and appending files if necessary.• Cross-checking of merged files.• Data analysis.

3

Page 4: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

4

Planning stage

• Design of data collection forms, e.g. questionnaires, clinical data sheets.

• Coding instructions.• Decide on data entry program.• Decide on eventual data analysis program.• Ensure compatibility between data to be entered in

different files.• Set up data entry.

Page 5: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

5

Design of data collection formsExample: European Community Respiratory Health Survey

6

Page 6: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

6

Questionnaires - layout

• Readability and attractiveness to responder or interviewer.

• Readability and lack of ambiguity for data entry clerk.

• Collect dates of birth and occasion, not age.• Other issues later in course.

Page 7: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

7

Coding instructions

• Unique identification required for each individual in the study – must be included in each separate set of data.

• Assign unique NUMERIC code to all categorical/qualitative data, e.g. male – 1, female – 2.

• Codes may be printed on questionnaire, implemented in data entry; or later text to numeric conversion.

• Decide on code for ‘missing’ data – should be a number well away from possible data, e.g. 9 for gender, 999 for weight in kg (if use ‘blank’ need to be sure that this will be transferred as ‘missing’).

Page 8: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

8

ECRHS coding instructionsGeneral

ECHRS-European Community Respiratory Health Survey

Page 9: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

9

ECRHS coding instructionsSpecific

Page 10: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

10

Data entry

• Excel – part of Microsoft Office package so almost always available.

• Access – part of Microsoft Office Professional.• Stata and SPSS – statistical analysis programs,

available at King’s.• Epi-Data – freeware from

http://www.epidata.dk/

Page 11: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

11

Data entry - Epi-Data

Page 12: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

12

Data entry programs

• Small amounts of data can be entered in Excel, Stata or SPSS.

• Verification/double entry? – necessary except for small amounts of data.

• Verification/double entry most easily carried out if data entered using Epi-Data.

• Microsoft Access – sophisticated data entry, but verification requires complex programming.

Page 13: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

13

Data analysis programs

• Stata – popular with medical statisticians and epidemiologists.– flexible, powerful, very few drawbacks

• SPSS – popular with sociologists and psychologists.• SAS – popular with statisticians in pharmaceutical R&D.• R or S-plus – popular with academic statisticians.• Beware of little-known packages.

– Unknown, limited validation

Page 14: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

14

File transfer between programs

• Excel can read and write text delimited files. (A delimited text file is one in which each line of text is a

record, and the fields are separated by a known character such as comma and tab)

• Stata, SPSS and most statistical packages can read and write text delimited files or Excel files.

• Data can be exported directly from Epi-Data to Stata or SPSS.

• Variable names/labels preserved in most cases.

Page 15: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

15

Program formats

• Each program has its own special format• File extensions tell you the file format

– Excel .xls– Stata .dta– SPSS .sav– Access .mdb– Epi-Data .rec– Comma separated file .csv

Page 16: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

16

Spreadsheet & comma-separated files

• Each spreadsheet has its own ‘format’, but it is possible to write a ‘comma-separated file’ which can be read by other programs.

Page 17: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

17

Spreadsheet & comma-separated files

PID,CENTRE,REC DATE,BASE_SS,TPA,DOSE,AGE,TIME TO TREAT,DAY 90 OK,FINAL_DATE, SSS,CHANGE,1,26620,24-Nov-00,30,0,0.1,81,3.42,Y,22-Feb-01,42,12,3,26620,28-Nov-00,39,0,0.2,74,3.75,Y,01-Mar-01,58,19,5,25224,29-Nov-00,39,0,0.1,51,4.5,Y,27-Feb-01,58,19,7,30912,30-Nov-00,31,0,0.8,61,4.25,Y,27-Feb-01,52,21,9,30969,05-Dec-00,40,0,0.2,96,5.25,Y,06-Mar-01,55,15,11,27460,08-Dec-00,28,0,0,80,3.02,Y,08-Mar-01,58,30,

Page 18: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

18

Steps for most studies

• Data entered in Excel (or Epi-Data).

• Data transferred (‘exported’) to Stata or SPSS.

• Data checking and editing.

• Data analysis.

Page 19: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

19

Setting up data entry in Epi-Data

• Should correspond to questionnaire or other data collection form.

• Allowable data determined by coding instructions.

• Set ranges for quantitative data, e.g. Height.• Data entry “clerk” should not be constantly

checking.• Decide how dates are to be handled.

Page 20: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

20

Preliminary editing

• If data entered as text codes convert to numeric codes.

• Text to numeric conversion simple in Excel.

Page 21: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

21

Data checking

• Data correspond with coding instructions.• Data correspond with plausible/possible

distribution.• Graphs, tables can identify if there is a

problem, e.g. outliers and missing values.• Listing selected data required to identify

where there is a problem.

Page 22: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

22

Multiple files

• Data from different centres need to be appended – add more rows (more records).

• Data from different sources/questionnaires/time periods for the same individuals need to be merged (e.g. cohort studies and RCTs) – add more columns (more variables).

• Efficient to enter data in separate files if not all data apply to all individuals, e.g. special questionnaire for women.

Page 23: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

23

Compatible files

• If files are to be appended they must contain the same data variables names (columns) for different people (rows).

• If files are to be merged they should contain different data (columns) for the same individuals. The identification number(s) needs to be the same on each file for each individual and the identification variable name needs to be the same in each file.

Page 24: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

24

Graphs for checking data (1)

• Single continuous variable–Histogram can detect ‘outliers’, e.g. in

height (also dot plot, some box and whisker plots).

Page 25: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Histogram of age of SLSR patients Boxplot of age of SLSR patients

Page 26: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

26

Graphs for checking data (2)

• Two continuous variables– Scatter plot can detect ‘outliers’, e.g. in weight for

height.• Follow with list of aberrant values.• Graphs less useful for categorical data

Page 27: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

The relationship between Aortic pulse wave velocity (Ao-PWV) and Ambulatory arterial stiffness index (AASI) in patients with type2

diabetes, microalbuminuria and systolic hypertension at baseline

0.2 0.3 0.4 0.5 0.6 0.7 0.8

81

01

21

41

61

82

0

x

y

Linear relationship between Ao-PWV and AASI in patients with type2 diabetes

BaseAASI

Ba

seP

WV

Fitted line

95% Confidence Interval

Page 28: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

28

Tables for categorical data

Wheeze in last 12 months

Frequency (n) %

No 1945 75.0

Yes 642 24.7

Not known 8 0.3

Total 2595 100.0

Page 29: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

29

Tables for checking categorical data. tab q1

q1 | Freq. Percent Cum.------------+----------------------------------- 1 | 1945 74.95 74.95 2 | 642 24.74 99.69 9 | 8 0.31 100.00------------+----------------------------------- Total | 2595 100.00

. list area id if q1==9

area id 640. 110 6401853. 110 18533280. 110 32803624. 110 36243663. 110 36634509. 110 45094623. 110 46234923. 110 4923

Page 30: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

30

Missing data

• Convert to program missing value code before calculating summary statistics or plotting graphs

• E.g. in Stata– mvdecode gender, mv(9)– mvdecode weight, mv(999)

Page 31: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Case study: Scottish Family History Study Data

• SFHS Data Quality Report: PCQ data •  • Report description:

Draft Summary Report • • Prepared by:

Yanzhong Wang • • Last run on:

07/11/2008 by Yanzhong Wang • • Report file name:

SFHS_PCQ_check_report.doc • • Created by program:

//Rcb-file-2000/Filestore/Studies/SFHS/statistics/programs/PCQ_datacheck_prog/SFHS_PCQ_check_v7.R• • Created using software:

R version 2.5.1 (2007-06-27) for Windows • • Checked by

The program for producing this report has not been checked by second statistician

Page 32: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Overview of SFHS PCQ data

• Duplicate records

• The analysis data set PCQ combines subjects from both Pre-clinic questionnaire version 1 and version 2. There are total 6882 records in the PCQ data set and each record has 415 variables. 6863 records have unique subject numbers. 19 subject numbers appear more than once.

• • The number of subject numbers that appear more than twice is 0. The duplicate

subject numbers are• SFT0400662 SFT0400659 SFT0400656 SFT0400660 SFT0400985 SFT0434277• SFG9500780 SFT0441461 SFT0435134 SFT0441530 SFT0435427 SFT0435155• SFT0441473 SFT0435157 SFT0441602 SFT0435584 SFT0435438 SFG9501775• SFT0435630.• • The 19 duplicated records are omitted from further analysis, leaving a total of 6863

records with unique subject numbers.

Page 33: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Overview of SFHS PCQ data (continue)

• Blank variables

• There are 5 variables which contain all NAs for all the subjects. They are:

• • "PCQCIV4" – “Prescribed injection / Suppository 4”, • "PCQCIV6" – “Prescribed injection / Suppository 6”,• "PCQEHB" – “Family Health / Breast cancer – brother (version 1)”,• "PCQEKS" – “Family Health / Prostate cancer – sister (version 1)”,• "PCQELB" – “Family Health / Hip fracture – brother (version 1)”.

Page 34: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Overview of SFHS PCQ data (continue)

• Pre-processing data

All variable are annotated in the following format:

Variable name Description Date type

e.g. PCQH1 Troubled by pain/discomfort? Categorical

Variables are formatted into numeric, categorical and text data accordingly.

“NA” denotes a missing value.

Summary tables and histograms are produced based on the pre-processed data.

Page 35: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Methods

Summary tables

For each variable the type of data (numeric, categorical or text) and the number (%) of

non-missing data are given. Additional summary statistics are given for numeric and

categorical variables:

Numeric variable: minimum, 1st quartile, median, mean, 3rd quartile,

maximum, number of missing data points.

Categorical variable: number (%) in each category.

Histogram

For each numeric variable, a histogram is used to show how the data is distributed.

(frequences in specified categories)

Bar chart

For each categorical variable, a bar chart is used for comparing values (counts or

percentages) in different categories.

Page 36: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Output 2. Summary of variables for Chest pain (Angina). For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical: number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:52 2008

Variable Data type

N (%) recorded data

Summary

Pre-clinic questionnaire date PRESENCE OF ANGINA SEVERITY OF ANGINA PAIN OF POSSIBLE INFARCTION

Num Cat Cat Cat

6833 (99.6%) 6863 (100.0%) 6863 (100.0%) 1725 (25.1%)

Min. 2002-02-20; 1st Qu. 2007-03-07; Median 2007-09-18; Mean 2007-09-16; 3rd Qu. 2008-05-13; Max. 2009-09-01 Yes: 134 (2.0%); No: 6729 (98.0%) Grade 1: 107 (1.6%); Grade 2: 24 (0.3%); NA: 6732 (98.1%) Yes: 256 (14.8%); No: 1469 (85.2%)

Page 37: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Output 5. Summary of variables for smoking history. For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical: number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:57 2008

Variable Data type

N (%) recorded data

Summary

EVER SMOKED TOBACCO? AGE WHEN STARTED SMOKING CIGARETTES/WEEK (MAX) PACKETS OF TOBACCO/WEEK (MAX) CIGARS/WEEK (MAX) HOW LONG SINCE GIVING UP SMOKING? WHY DID YOU GIVE UP SMOKING?

Cat Num Num Num Num Num Cat

6777 (98.7%) 3165 (46.1%) 2749 (40.1%) 622 (9.1%) 343 (5.0%)

1837 (26.8%) 1804 (26.3%)

1: 1201 (17.7%); 2: 222 (3.3%); 3: 1748 (25.8%); 4: 3606 (53.2%) Min. 0; 1st Qu. 14; Median 16; Mean 16.78; 3rd Qu. 18; Max. 62; NA's 3698 Min. 0; 1st Qu. 35; Median 100; Mean 95.68; 3rd Qu. 140; Max. 560; NA's 4114 Min. 0; 1st Qu. 0; Median 2; Mean 3.322; 3rd Qu. 6; Max. 81; NA's 6241 Min. 0; 1st Qu. 0; Median 0; Mean 6.449; 3rd Qu. 3.5; Max. 99; NA's 6520 Min. 0; 1st Qu. 4; Median 13; Mean 15.49; 3rd Qu. 25; Max. 92; NA's 5026 1: 60 (3.3%); 2: 1641 (91.0%); 3: 103 (5.7%)

Page 38: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Education and occupation

Page 39: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Smoking history

Page 40: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

PCQ data: outliers/extreme values report

PCQ data: outliers/extreme values report by Yanzhong Wang on Fri Nov 07 15:27:37 2008

Variable Description Low_limit Outlier_low_SNO_value High_limit Outlier_high_SNO_value

1 PCQN1 TOTAL YEARS IN FULL-TIME STUDY

2 SFG9500481, 2, SFT0434849, 0, SFG9500819, 1, SFG9501105, 0

25

SFT0400530, 25, SFG9500408, 27, SFT0434350, 25, SFG9501237, 27, SFT0435564, 31, SFT0435386, 28, SFT0441832, 29, SFT0435764, 39, SFG9502609, 25, SFG9501887, 27, SFG9502363, 26, SFT0442041, 27, SFT0442200, 30

2 PCQN5III

HOURS/WEEK WORKING AT NIGHT 7PM-7AM

50

SFT0400539, 50, SFT0400564, 60, SFT0400595, 60, SFT0400627, 50, SFT0400777, 73, SFT0400793, 50, SFT0401048, 56, SFG9500441, 54, SFT0401182, 70, SFT0434230, 50, SFT0434367, 60, SFT0434391, 60, SFT0441084, 72, SFG9500487, 55, SFG9500466, 60, SFG9500845, 60, SFG9501128, 60, SFG9501326, 50, SFT0441418, 50, SFT0435323, 50, SFT0435389, 56, SFT0441788, 60, SFG9502202, 59, SFG9502156, 65, SFT0441972, 50, SFT0435854, 54

3 PCQO1 NO OF PEOPLE LIVE IN HOUSEHOLD?

0

SFG9500124, 0, SFT0401027, 0, SFT0434305, 0, SFT0434390, 0, SFT0434428, 0, SFT0441113, 0, SFT0434541, 0, SFT0434828, 0, SFG9500674, 0, SFG9500949, 0, SFT0435099, 0, SFT0435778, 0,

8 SFG9500474, 10, SFG9500998, 8, SFG9501593, 9, SFT0441511, 10, SFT0435430, 8, SFT0441995, 9

Page 41: Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Lunch time

See you at 1pm for Part 2