sadc course in statistics exploratory data analysis (eda) in the data analysis process module b2...
TRANSCRIPT
SADC Course in Statistics
Exploratory Data Analysis (EDA) in the data analysis process
Module B2 Session 13
2To put your footer here go to View > Header and Footer
Learning Objectives
students should be able to
• Construct a dot plot for a numeric variable• split by a categorical variable
• Apply EDA concepts to a large dataset
• Explain the use of Excel’s pivot tables• and filters, in the EDA process
• Explain the importance of EDA • for data checking and at the start of the analysis
• Relate EDA • to the principles of official statistics ….
3To put your footer here go to View > Header and Footer
EDA with small and large data sets
• Session 12:• Stressed the importance of EDA• Introduced 2 new tools (dot and stem)• Practiced with small data sets
• In this session we scale up• Look at large data sets• The tools do not scale up easily• But the concepts do scale up • EDA becomes even more crucial
• Most data sets are large!• at least compared with teaching examples
4To put your footer here go to View > Header and Footer
The essence of a stem and leaf plot
Stem and leaf plot Stacked dot plot
The “leaf” shows the next digit.
This can be useful in the exploration
phase
data5.35.46.0…..11.111.9
5To put your footer here go to View > Header and Footer
What are the key points?
• We look at individual data points• not summaries at this stage• this is general for EDA
• The stem and leaf plot in particular• keeps the actual numbers as far as possible• This can be important
• An example uses the Tanzania survey
6To put your footer here go to View > Header and Footer
Tanzania agriculture survey
This is the variable we wish to explore. It is a value between 0 and 100
7To put your footer here go to View > Header and Footer
The data in Excel
The variable to explore before analysis
8To put your footer here go to View > Header and Footer
How to explore this value
• Can we do a stem and leaf plot?• “By hand” in Excel – but there are 16628 values!
• Even if automated, that is too many!
• The essence of a stem and leaf plot • is to look at all the possible values
• Try a pivot table• a powerful feature in Excel• used previously on categorical data
9To put your footer here go to View > Header and Footer
The pivot table
10To put your footer here go to View > Header and Footer
Some results
11To put your footer here go to View > Header and Footer
12To put your footer here go to View > Header and Footer
What do you deduce?
• There are oddities in “rounding”• Perhaps enumerator differences• Can this question be answered to 1%?
• So what should be done before analysis?
• First – look further at the data
• Excel can help – it can “drill down” to examine individual records
• The concept:• Use the table to look for oddities• Then examine them in more detail
13To put your footer here go to View > Header and Footer
Drilling down – an example
Make the 6 corresponding to 2% the active cell
Then double click to give the detail
4 of these values are from the same village – so same enumerator
14To put your footer here go to View > Header and Footer
15To put your footer here go to View > Header and Footer
What do you conclude – technique/results
•Technique• Stem and leaf plots when looking at small datasets• Pivot tables when datasets are large
–But the principle is general• Numbers must be looked at carefully!• The principle can be adapted for the data• and explored effectively in Excel
•Results– Did enumerators have different interpretations
• of the “precision” required in the percentages• This needs further exploration• and the analysis needs to take account of this
16To put your footer here go to View > Header and Footer
Another new element in this session
• Exploratory analysis includes • looking for oddities in the data
• Unexplained oddities cause variation • that can make it difficult to detect the pattern• because they add unnecessary noise to the data
• How do you “tame the variation”
• One way is to examine related variables
• This is important in the analysis• the next slide is a repeat from Session 3
• It is also a key weapon in data exploration• and is covered in the practical
17To put your footer here go to View > Header and Footer
Slide from Module B2 Session 3
• To do good statistics you must• fight the curse of variation
• Two main strategies to overcome variation
• 1. Take enough observations• In the Tanzania survey there were 3223 households
just from this one region
• 2. Measure characteristics that explain variation
• Variation itself is not necessarily the problem• Variation you do not understand is the problem
• Here we start understanding variation• at the exploration stage
18To put your footer here go to View > Header and Footer
Practical – three parts
• Tanzania data • practice what has been done in these slides
• Dot plots – split by a factor• demonstration and practice
• Swaziland data • apply the concepts• checking factors• as well as numeric columns
• Then the key points are reviewed
19To put your footer here go to View > Header and Footer
Points for review after the practical
• Looking for individual problems• And surprising patterns
• Exploratory graphics• need to help the analyst and data checker• see dot plots on next slide
• Tables are also useful• especially with the facility to drill down
• Look at individual variables• and at records as a whole
• Trust your common sense• It is useful to estimate results• And question the computer if they are very different
20To put your footer here go to View > Header and Footer
Dot plots - yield by variety
Outliers (typing errors) are clear, but only because of the 2nd variable
They are not outliers overall
21To put your footer here go to View > Header and Footer
EDA is a continuous process
• EDA effectively is a continuation of the data checking process
• The example on the previous slide shows• how some oddities only become clear once the analysis
is undertaken
• This continues into the formal analysis• where it involves looking at the “residuals”
• They are the unexplained variation• As discussed in Session 3!
• So analysis is not just a set of rules• It is a thoughtful process• Where you become the data detective!
22To put your footer here go to View > Header and Footer
Swaziland data was for checking
23To put your footer here go to View > Header and Footer
Investigating the column called Presence
What does 0 mean?
Why are there blanks?
Next steps:
1. Look at the questionnaire
2. Select these records
You are becoming detectives!
24To put your footer here go to View > Header and Footer
Codes for the column
Seems clear enough. Zeros and blanks still a puzzle
25To put your footer here go to View > Header and Footer
Selecting the blank records
i.e. serious problems with the whole record
Missing also
Too young and all the same
Crop code not recognised Areas too large
26To put your footer here go to View > Header and Footer
Dot plot of area by Presence
Odd crop areas were ALL associated with odd codes for the column PRESENCE
It was found to be a data transfer problem with one byte missing in these records
27To put your footer here go to View > Header and Footer
Checking data quality and EDA
Where Why How By Whom
Before data entry
To ensure complete data set received
Manual check
supervisor
During data entry
To highlight anomalies
Filter, dot plots etc
Supervisor and helpers
Before analysis
Double check As above Analyst/ statistician
During analysis
Remain critical
Residuals Analyst/statistician
28To put your footer here go to View > Header and Footer
Importance – principles of official statistics
• Principle 2: Professional standards• It is unprofessional to analyse the data and report
results without exploring critically at all stages
• Principle 4: Prevention of misuse• We risk misusing the data unless we explore the data
critically
• Principle 5: Sources of statistics• Includes a requirement to avoid undue burden on
respondents• We must process the data fully and effectively. This
needs EDA• Otherwise the burden imposed on respondents is to
some extent wasted
29To put your footer here go to View > Header and Footer
Can you now:
• Apply EDA concepts to a large dataset
• Explain the importance of EDA for data checking and at the start of the analysis
• Relate EDA to the principles of official statistics
30To put your footer here go to View > Header and Footer
Now you can organise the data for analysis
And then do an exploratory analysis
We show next how the analysis is easy IF your objectives are clear