sadc course in statistics exploratory data analysis (eda) in the data analysis process module b2...

30
SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

Upload: gavin-weaver

Post on 28-Mar-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

SADC Course in Statistics

Exploratory Data Analysis (EDA) in the data analysis process

Module B2 Session 13

Page 2: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

2To put your footer here go to View > Header and Footer

Learning Objectives

students should be able to

• Construct a dot plot for a numeric variable• split by a categorical variable

• Apply EDA concepts to a large dataset

• Explain the use of Excel’s pivot tables• and filters, in the EDA process

• Explain the importance of EDA • for data checking and at the start of the analysis

• Relate EDA • to the principles of official statistics ….

Page 3: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

3To put your footer here go to View > Header and Footer

EDA with small and large data sets

• Session 12:• Stressed the importance of EDA• Introduced 2 new tools (dot and stem)• Practiced with small data sets

• In this session we scale up• Look at large data sets• The tools do not scale up easily• But the concepts do scale up • EDA becomes even more crucial

• Most data sets are large!• at least compared with teaching examples

Page 4: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

4To put your footer here go to View > Header and Footer

The essence of a stem and leaf plot

Stem and leaf plot Stacked dot plot

The “leaf” shows the next digit.

This can be useful in the exploration

phase

data5.35.46.0…..11.111.9

Page 5: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

5To put your footer here go to View > Header and Footer

What are the key points?

• We look at individual data points• not summaries at this stage• this is general for EDA

• The stem and leaf plot in particular• keeps the actual numbers as far as possible• This can be important

• An example uses the Tanzania survey

Page 6: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

6To put your footer here go to View > Header and Footer

Tanzania agriculture survey

This is the variable we wish to explore. It is a value between 0 and 100

Page 7: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

7To put your footer here go to View > Header and Footer

The data in Excel

The variable to explore before analysis

Page 8: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

8To put your footer here go to View > Header and Footer

How to explore this value

• Can we do a stem and leaf plot?• “By hand” in Excel – but there are 16628 values!

• Even if automated, that is too many!

• The essence of a stem and leaf plot • is to look at all the possible values

• Try a pivot table• a powerful feature in Excel• used previously on categorical data

Page 9: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

9To put your footer here go to View > Header and Footer

The pivot table

Page 10: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

10To put your footer here go to View > Header and Footer

Some results

Page 11: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

11To put your footer here go to View > Header and Footer

Page 12: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

12To put your footer here go to View > Header and Footer

What do you deduce?

• There are oddities in “rounding”• Perhaps enumerator differences• Can this question be answered to 1%?

• So what should be done before analysis?

• First – look further at the data

• Excel can help – it can “drill down” to examine individual records

• The concept:• Use the table to look for oddities• Then examine them in more detail

Page 13: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

13To put your footer here go to View > Header and Footer

Drilling down – an example

Make the 6 corresponding to 2% the active cell

Then double click to give the detail

4 of these values are from the same village – so same enumerator

Page 14: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

14To put your footer here go to View > Header and Footer

Page 15: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

15To put your footer here go to View > Header and Footer

What do you conclude – technique/results

•Technique• Stem and leaf plots when looking at small datasets• Pivot tables when datasets are large

–But the principle is general• Numbers must be looked at carefully!• The principle can be adapted for the data• and explored effectively in Excel

•Results– Did enumerators have different interpretations

• of the “precision” required in the percentages• This needs further exploration• and the analysis needs to take account of this

Page 16: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

16To put your footer here go to View > Header and Footer

Another new element in this session

• Exploratory analysis includes • looking for oddities in the data

• Unexplained oddities cause variation • that can make it difficult to detect the pattern• because they add unnecessary noise to the data

• How do you “tame the variation”

• One way is to examine related variables

• This is important in the analysis• the next slide is a repeat from Session 3

• It is also a key weapon in data exploration• and is covered in the practical

Page 17: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

17To put your footer here go to View > Header and Footer

Slide from Module B2 Session 3

• To do good statistics you must• fight the curse of variation

• Two main strategies to overcome variation

• 1. Take enough observations• In the Tanzania survey there were 3223 households

just from this one region

• 2. Measure characteristics that explain variation

• Variation itself is not necessarily the problem• Variation you do not understand is the problem

• Here we start understanding variation• at the exploration stage

Page 18: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

18To put your footer here go to View > Header and Footer

Practical – three parts

• Tanzania data • practice what has been done in these slides

• Dot plots – split by a factor• demonstration and practice

• Swaziland data • apply the concepts• checking factors• as well as numeric columns

• Then the key points are reviewed

Page 19: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

19To put your footer here go to View > Header and Footer

Points for review after the practical

• Looking for individual problems• And surprising patterns

• Exploratory graphics• need to help the analyst and data checker• see dot plots on next slide

• Tables are also useful• especially with the facility to drill down

• Look at individual variables• and at records as a whole

• Trust your common sense• It is useful to estimate results• And question the computer if they are very different

Page 20: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

20To put your footer here go to View > Header and Footer

Dot plots - yield by variety

Outliers (typing errors) are clear, but only because of the 2nd variable

They are not outliers overall

Page 21: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

21To put your footer here go to View > Header and Footer

EDA is a continuous process

• EDA effectively is a continuation of the data checking process

• The example on the previous slide shows• how some oddities only become clear once the analysis

is undertaken

• This continues into the formal analysis• where it involves looking at the “residuals”

• They are the unexplained variation• As discussed in Session 3!

• So analysis is not just a set of rules• It is a thoughtful process• Where you become the data detective!

Page 22: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

22To put your footer here go to View > Header and Footer

Swaziland data was for checking

Page 23: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

23To put your footer here go to View > Header and Footer

Investigating the column called Presence

What does 0 mean?

Why are there blanks?

Next steps:

1. Look at the questionnaire

2. Select these records

You are becoming detectives!

Page 24: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

24To put your footer here go to View > Header and Footer

Codes for the column

Seems clear enough. Zeros and blanks still a puzzle

Page 25: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

25To put your footer here go to View > Header and Footer

Selecting the blank records

i.e. serious problems with the whole record

Missing also

Too young and all the same

Crop code not recognised Areas too large

Page 26: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

26To put your footer here go to View > Header and Footer

Dot plot of area by Presence

Odd crop areas were ALL associated with odd codes for the column PRESENCE

It was found to be a data transfer problem with one byte missing in these records

Page 27: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

27To put your footer here go to View > Header and Footer

Checking data quality and EDA

Where Why How By Whom

Before data entry

To ensure complete data set received

Manual check

supervisor

During data entry

To highlight anomalies

Filter, dot plots etc

Supervisor and helpers

Before analysis

Double check As above Analyst/ statistician

During analysis

Remain critical

Residuals Analyst/statistician

Page 28: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

28To put your footer here go to View > Header and Footer

Importance – principles of official statistics

• Principle 2: Professional standards• It is unprofessional to analyse the data and report

results without exploring critically at all stages

• Principle 4: Prevention of misuse• We risk misusing the data unless we explore the data

critically

• Principle 5: Sources of statistics• Includes a requirement to avoid undue burden on

respondents• We must process the data fully and effectively. This

needs EDA• Otherwise the burden imposed on respondents is to

some extent wasted

Page 29: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

29To put your footer here go to View > Header and Footer

Can you now:

• Apply EDA concepts to a large dataset

• Explain the importance of EDA for data checking and at the start of the analysis

• Relate EDA to the principles of official statistics

Page 30: SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

30To put your footer here go to View > Header and Footer

Now you can organise the data for analysis

And then do an exploratory analysis

We show next how the analysis is easy IF your objectives are clear