data cleaning techniques

13
Data Cleaning Techniques Workshop on Emergency Information Management Neuhausen, Germany, 18-22 June, 2012 Christian Oxenbøll, Registration Officer, UNHCR Tips and tricks for data management in Excel

Upload: tia

Post on 18-Jan-2016

90 views

Category:

Documents


0 download

DESCRIPTION

Data Cleaning Techniques. Tips and tricks for data management in Excel. Workshop on Emergency Information Management Neuhausen , Germany, 18-22 June, 2012 Christian Oxenbøll, Registration Officer , UNHCR. Data Cleaning. Why is it important? Bad data leads to wrong results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Cleaning Techniques

Data Cleaning Techniques

Workshop on Emergency Information ManagementNeuhausen, Germany, 18-22 June, 2012

Christian Oxenbøll, Registration Officer, UNHCR

Tips and tricks for data management in Excel

Page 2: Data Cleaning Techniques

Data Cleaning

Why is it important?

• Bad data leads to wrong results

• Operational and management decisions should not be based on wrong information

• Even “a few bad data” can make a whole dataset useless for statistics

Page 3: Data Cleaning Techniques

What is data cleaning?Existing data:

– Reviewing logic consistency of data– Reviewing reliability of data– Correction of wrong values– Deletion or suppression of erroneous values

Subsequent data cleaning can be reduced by proper design of data collection:

– Make a data management strategy

– Make sure you know how you will process collected data

– Ensure consistency in design

– Validation rules in Excel

Page 4: Data Cleaning Techniques

What are we looking for?Common errors include:

– 0 when it should be “N/A” (not available/not applicable)– Totals do not match underlying data– Typing errors (and use of different location names)– Wrong interpretation of questions– Mismatch of units (cases/persons, days/months, square

metres/hectares, pct/ratios, flow/stock, etc.)– Missing data– Percentages e.g. indicator values >100%– Date formats (12/01/06 or 01/12/06)

Page 5: Data Cleaning Techniques

How do you clean data?Think logic!

– Look at the data

– Reflect over whether it makes sense• Logical consistence (Mathematical/Statistical) e.g. Total

population vs. children < 18 years

• Meaningful (e.g. is it really true that refugees survive without water and the camp is 2 square meters?)

– Reliability of source• Ask the data source about how data was collected

• What is covered

• What was the methodology Note that logical consistency alone does not imply that data is correct. Always check if data is meaningful

Page 6: Data Cleaning Techniques

How do you clean data?Be creative!

– Use graphs• To spot outliers (high/low values)

– Pivot tables• To create summary tables of large datasets

– Filters• Easy to spot outliers (note the limit in Excel of

1,000 in drop-down list)

– Sorting• To spot outliers and spelling

– Conditional formatting• To spot invalid and dubious values or outliers

Page 7: Data Cleaning Techniques

Example from Uganda SIR

Percentage of refugee students enrolled in Grades 1-6: Uganda 2008

0%

100%

200%

300%

400%

500%

Nakivale

Oruch

inga

Kyangw

ali

Kyaka

II

Kiryand

ongo

Ikafe

Mad

i-Oko

llo

Imve

pi

Rhino-C

amp

Adjum

ani

Palorin

ya

Female

Male

Total

Page 8: Data Cleaning Techniques

How do you clean data?Be creative!

– Lookup functions• Easy to find non-existing codes (typos)

– Formulas• Check of mathematical and logic consistency

– Compare with other sources (Triangulation)• Validation of values/expected ranges (do we have

approximately the same)

– Compare with previous years• Validation of values/expected ranges (do we have

approximately the same)

Page 9: Data Cleaning Techniques

Useful Excel Tools

• Validation (allows only certain values)

• Auto filters

• Conditional Formatting

• Pivot Tables

• Formulas

Page 10: Data Cleaning Techniques

Some useful Excel functions

• Logic– And– Or– If – Not

• Mathematical/Statistical– Average– Count– CountA– CountBlank– CountIf– Dsum– SumIf– Rank

• Information– Trim– Concatenate– Left– Right– Mid– Len– Find– Proper– Lower– Upper– IsBlank– Vlookup– Yearfrac– Today

Use the help in Excel which gives guidance on the use of each formula

Page 11: Data Cleaning Techniques

Data cleaning: some tips

• Design good data collection forms

• Checking plausibility

• Outliers

• Trends analyses

• Using graphical views

• Triangulation

• Using filters, functions and formulas

Page 12: Data Cleaning Techniques

Useful websites

Google your questions

Microsoft Online Help

www.functionx.com/excel

www.pivottableguru.com

Page 13: Data Cleaning Techniques

Exercises

Open the file: “Excel Training.xls”

Follow the instructions. Ask your neighbour or the facilitators if you need assistance.

In the file: “Excel Training Result.xls” you will find the result of the exercises including the formulas.