data handling i: data preparation and data cleaning dr yanzhong wang lecturer in medical statistics...

Post on 31-Mar-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Handling I: Data Preparation and Data Cleaning

Dr Yanzhong WangLecturer in Medical StatisticsDivision of Health and Social Care ResearchKing's College LondonEmail: yanzhong.wang@kcl.ac.uk

Drug Development Statistics & Data Management

Session objectives• Part 1:

– How to create a simple data file ready for analysis– How to create data files for large scale studies– Case study on data checking/cleaning

• Part 2– Advantages and disadvantages of various summary

statistics– Select appropriate summary statistics for categorical,

binary, ordinal and continuous data. Reading: Statistics as Square One, Chapter 2.

2

Outline of computerisation of data

• Plan – at protocol stage of study.• Data entry.• Data checking and editing of individual files.• Merging and appending files if necessary.• Cross-checking of merged files.• Data analysis.

3

4

Planning stage

• Design of data collection forms, e.g. questionnaires, clinical data sheets.

• Coding instructions.• Decide on data entry program.• Decide on eventual data analysis program.• Ensure compatibility between data to be entered in

different files.• Set up data entry.

5

Design of data collection formsExample: European Community Respiratory Health Survey

6

6

Questionnaires - layout

• Readability and attractiveness to responder or interviewer.

• Readability and lack of ambiguity for data entry clerk.

• Collect dates of birth and occasion, not age.• Other issues later in course.

7

Coding instructions

• Unique identification required for each individual in the study – must be included in each separate set of data.

• Assign unique NUMERIC code to all categorical/qualitative data, e.g. male – 1, female – 2.

• Codes may be printed on questionnaire, implemented in data entry; or later text to numeric conversion.

• Decide on code for ‘missing’ data – should be a number well away from possible data, e.g. 9 for gender, 999 for weight in kg (if use ‘blank’ need to be sure that this will be transferred as ‘missing’).

8

ECRHS coding instructionsGeneral

ECHRS-European Community Respiratory Health Survey

9

ECRHS coding instructionsSpecific

10

Data entry

• Excel – part of Microsoft Office package so almost always available.

• Access – part of Microsoft Office Professional.• Stata and SPSS – statistical analysis programs,

available at King’s.• Epi-Data – freeware from

http://www.epidata.dk/

11

Data entry - Epi-Data

12

Data entry programs

• Small amounts of data can be entered in Excel, Stata or SPSS.

• Verification/double entry? – necessary except for small amounts of data.

• Verification/double entry most easily carried out if data entered using Epi-Data.

• Microsoft Access – sophisticated data entry, but verification requires complex programming.

13

Data analysis programs

• Stata – popular with medical statisticians and epidemiologists.– flexible, powerful, very few drawbacks

• SPSS – popular with sociologists and psychologists.• SAS – popular with statisticians in pharmaceutical R&D.• R or S-plus – popular with academic statisticians.• Beware of little-known packages.

– Unknown, limited validation

14

File transfer between programs

• Excel can read and write text delimited files. (A delimited text file is one in which each line of text is a

record, and the fields are separated by a known character such as comma and tab)

• Stata, SPSS and most statistical packages can read and write text delimited files or Excel files.

• Data can be exported directly from Epi-Data to Stata or SPSS.

• Variable names/labels preserved in most cases.

15

Program formats

• Each program has its own special format• File extensions tell you the file format

– Excel .xls– Stata .dta– SPSS .sav– Access .mdb– Epi-Data .rec– Comma separated file .csv

16

Spreadsheet & comma-separated files

• Each spreadsheet has its own ‘format’, but it is possible to write a ‘comma-separated file’ which can be read by other programs.

17

Spreadsheet & comma-separated files

PID,CENTRE,REC DATE,BASE_SS,TPA,DOSE,AGE,TIME TO TREAT,DAY 90 OK,FINAL_DATE, SSS,CHANGE,1,26620,24-Nov-00,30,0,0.1,81,3.42,Y,22-Feb-01,42,12,3,26620,28-Nov-00,39,0,0.2,74,3.75,Y,01-Mar-01,58,19,5,25224,29-Nov-00,39,0,0.1,51,4.5,Y,27-Feb-01,58,19,7,30912,30-Nov-00,31,0,0.8,61,4.25,Y,27-Feb-01,52,21,9,30969,05-Dec-00,40,0,0.2,96,5.25,Y,06-Mar-01,55,15,11,27460,08-Dec-00,28,0,0,80,3.02,Y,08-Mar-01,58,30,

18

Steps for most studies

• Data entered in Excel (or Epi-Data).

• Data transferred (‘exported’) to Stata or SPSS.

• Data checking and editing.

• Data analysis.

19

Setting up data entry in Epi-Data

• Should correspond to questionnaire or other data collection form.

• Allowable data determined by coding instructions.

• Set ranges for quantitative data, e.g. Height.• Data entry “clerk” should not be constantly

checking.• Decide how dates are to be handled.

20

Preliminary editing

• If data entered as text codes convert to numeric codes.

• Text to numeric conversion simple in Excel.

21

Data checking

• Data correspond with coding instructions.• Data correspond with plausible/possible

distribution.• Graphs, tables can identify if there is a

problem, e.g. outliers and missing values.• Listing selected data required to identify

where there is a problem.

22

Multiple files

• Data from different centres need to be appended – add more rows (more records).

• Data from different sources/questionnaires/time periods for the same individuals need to be merged (e.g. cohort studies and RCTs) – add more columns (more variables).

• Efficient to enter data in separate files if not all data apply to all individuals, e.g. special questionnaire for women.

23

Compatible files

• If files are to be appended they must contain the same data variables names (columns) for different people (rows).

• If files are to be merged they should contain different data (columns) for the same individuals. The identification number(s) needs to be the same on each file for each individual and the identification variable name needs to be the same in each file.

24

Graphs for checking data (1)

• Single continuous variable–Histogram can detect ‘outliers’, e.g. in

height (also dot plot, some box and whisker plots).

Histogram of age of SLSR patients Boxplot of age of SLSR patients

26

Graphs for checking data (2)

• Two continuous variables– Scatter plot can detect ‘outliers’, e.g. in weight for

height.• Follow with list of aberrant values.• Graphs less useful for categorical data

The relationship between Aortic pulse wave velocity (Ao-PWV) and Ambulatory arterial stiffness index (AASI) in patients with type2

diabetes, microalbuminuria and systolic hypertension at baseline

0.2 0.3 0.4 0.5 0.6 0.7 0.8

81

01

21

41

61

82

0

x

y

Linear relationship between Ao-PWV and AASI in patients with type2 diabetes

BaseAASI

Ba

seP

WV

Fitted line

95% Confidence Interval

28

Tables for categorical data

Wheeze in last 12 months

Frequency (n) %

No 1945 75.0

Yes 642 24.7

Not known 8 0.3

Total 2595 100.0

29

Tables for checking categorical data. tab q1

q1 | Freq. Percent Cum.------------+----------------------------------- 1 | 1945 74.95 74.95 2 | 642 24.74 99.69 9 | 8 0.31 100.00------------+----------------------------------- Total | 2595 100.00

. list area id if q1==9

area id 640. 110 6401853. 110 18533280. 110 32803624. 110 36243663. 110 36634509. 110 45094623. 110 46234923. 110 4923

30

Missing data

• Convert to program missing value code before calculating summary statistics or plotting graphs

• E.g. in Stata– mvdecode gender, mv(9)– mvdecode weight, mv(999)

Case study: Scottish Family History Study Data

• SFHS Data Quality Report: PCQ data •  • Report description:

Draft Summary Report • • Prepared by:

Yanzhong Wang • • Last run on:

07/11/2008 by Yanzhong Wang • • Report file name:

SFHS_PCQ_check_report.doc • • Created by program:

//Rcb-file-2000/Filestore/Studies/SFHS/statistics/programs/PCQ_datacheck_prog/SFHS_PCQ_check_v7.R• • Created using software:

R version 2.5.1 (2007-06-27) for Windows • • Checked by

The program for producing this report has not been checked by second statistician

Overview of SFHS PCQ data

• Duplicate records

• The analysis data set PCQ combines subjects from both Pre-clinic questionnaire version 1 and version 2. There are total 6882 records in the PCQ data set and each record has 415 variables. 6863 records have unique subject numbers. 19 subject numbers appear more than once.

• • The number of subject numbers that appear more than twice is 0. The duplicate

subject numbers are• SFT0400662 SFT0400659 SFT0400656 SFT0400660 SFT0400985 SFT0434277• SFG9500780 SFT0441461 SFT0435134 SFT0441530 SFT0435427 SFT0435155• SFT0441473 SFT0435157 SFT0441602 SFT0435584 SFT0435438 SFG9501775• SFT0435630.• • The 19 duplicated records are omitted from further analysis, leaving a total of 6863

records with unique subject numbers.

Overview of SFHS PCQ data (continue)

• Blank variables

• There are 5 variables which contain all NAs for all the subjects. They are:

• • "PCQCIV4" – “Prescribed injection / Suppository 4”, • "PCQCIV6" – “Prescribed injection / Suppository 6”,• "PCQEHB" – “Family Health / Breast cancer – brother (version 1)”,• "PCQEKS" – “Family Health / Prostate cancer – sister (version 1)”,• "PCQELB" – “Family Health / Hip fracture – brother (version 1)”.

Overview of SFHS PCQ data (continue)

• Pre-processing data

All variable are annotated in the following format:

Variable name Description Date type

e.g. PCQH1 Troubled by pain/discomfort? Categorical

Variables are formatted into numeric, categorical and text data accordingly.

“NA” denotes a missing value.

Summary tables and histograms are produced based on the pre-processed data.

Methods

Summary tables

For each variable the type of data (numeric, categorical or text) and the number (%) of

non-missing data are given. Additional summary statistics are given for numeric and

categorical variables:

Numeric variable: minimum, 1st quartile, median, mean, 3rd quartile,

maximum, number of missing data points.

Categorical variable: number (%) in each category.

Histogram

For each numeric variable, a histogram is used to show how the data is distributed.

(frequences in specified categories)

Bar chart

For each categorical variable, a bar chart is used for comparing values (counts or

percentages) in different categories.

Output 2. Summary of variables for Chest pain (Angina). For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical: number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:52 2008

Variable Data type

N (%) recorded data

Summary

Pre-clinic questionnaire date PRESENCE OF ANGINA SEVERITY OF ANGINA PAIN OF POSSIBLE INFARCTION

Num Cat Cat Cat

6833 (99.6%) 6863 (100.0%) 6863 (100.0%) 1725 (25.1%)

Min. 2002-02-20; 1st Qu. 2007-03-07; Median 2007-09-18; Mean 2007-09-16; 3rd Qu. 2008-05-13; Max. 2009-09-01 Yes: 134 (2.0%); No: 6729 (98.0%) Grade 1: 107 (1.6%); Grade 2: 24 (0.3%); NA: 6732 (98.1%) Yes: 256 (14.8%); No: 1469 (85.2%)

Output 5. Summary of variables for smoking history. For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical: number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:57 2008

Variable Data type

N (%) recorded data

Summary

EVER SMOKED TOBACCO? AGE WHEN STARTED SMOKING CIGARETTES/WEEK (MAX) PACKETS OF TOBACCO/WEEK (MAX) CIGARS/WEEK (MAX) HOW LONG SINCE GIVING UP SMOKING? WHY DID YOU GIVE UP SMOKING?

Cat Num Num Num Num Num Cat

6777 (98.7%) 3165 (46.1%) 2749 (40.1%) 622 (9.1%) 343 (5.0%)

1837 (26.8%) 1804 (26.3%)

1: 1201 (17.7%); 2: 222 (3.3%); 3: 1748 (25.8%); 4: 3606 (53.2%) Min. 0; 1st Qu. 14; Median 16; Mean 16.78; 3rd Qu. 18; Max. 62; NA's 3698 Min. 0; 1st Qu. 35; Median 100; Mean 95.68; 3rd Qu. 140; Max. 560; NA's 4114 Min. 0; 1st Qu. 0; Median 2; Mean 3.322; 3rd Qu. 6; Max. 81; NA's 6241 Min. 0; 1st Qu. 0; Median 0; Mean 6.449; 3rd Qu. 3.5; Max. 99; NA's 6520 Min. 0; 1st Qu. 4; Median 13; Mean 15.49; 3rd Qu. 25; Max. 92; NA's 5026 1: 60 (3.3%); 2: 1641 (91.0%); 3: 103 (5.7%)

Education and occupation

Smoking history

PCQ data: outliers/extreme values report

PCQ data: outliers/extreme values report by Yanzhong Wang on Fri Nov 07 15:27:37 2008

Variable Description Low_limit Outlier_low_SNO_value High_limit Outlier_high_SNO_value

1 PCQN1 TOTAL YEARS IN FULL-TIME STUDY

2 SFG9500481, 2, SFT0434849, 0, SFG9500819, 1, SFG9501105, 0

25

SFT0400530, 25, SFG9500408, 27, SFT0434350, 25, SFG9501237, 27, SFT0435564, 31, SFT0435386, 28, SFT0441832, 29, SFT0435764, 39, SFG9502609, 25, SFG9501887, 27, SFG9502363, 26, SFT0442041, 27, SFT0442200, 30

2 PCQN5III

HOURS/WEEK WORKING AT NIGHT 7PM-7AM

50

SFT0400539, 50, SFT0400564, 60, SFT0400595, 60, SFT0400627, 50, SFT0400777, 73, SFT0400793, 50, SFT0401048, 56, SFG9500441, 54, SFT0401182, 70, SFT0434230, 50, SFT0434367, 60, SFT0434391, 60, SFT0441084, 72, SFG9500487, 55, SFG9500466, 60, SFG9500845, 60, SFG9501128, 60, SFG9501326, 50, SFT0441418, 50, SFT0435323, 50, SFT0435389, 56, SFT0441788, 60, SFG9502202, 59, SFG9502156, 65, SFT0441972, 50, SFT0435854, 54

3 PCQO1 NO OF PEOPLE LIVE IN HOUSEHOLD?

0

SFG9500124, 0, SFT0401027, 0, SFT0434305, 0, SFT0434390, 0, SFT0434428, 0, SFT0441113, 0, SFT0434541, 0, SFT0434828, 0, SFG9500674, 0, SFG9500949, 0, SFT0435099, 0, SFT0435778, 0,

8 SFG9500474, 10, SFG9500998, 8, SFG9501593, 9, SFT0441511, 10, SFT0435430, 8, SFT0441995, 9

Lunch time

See you at 1pm for Part 2

top related