exploratory data analysis outliers correlations
TRANSCRIPT
Exploratory data analysisOutliers
CorrelationsDr Kaz Negishi, MD, PhD, FACC, FESC
Menzies Research Institute Tasmania
‘Look before you leap’
When you’ve finished putting all the data into your spreadsheet…
What will you do next?
Why is this important??
x y10.34 10.1311.57 8.7910.01 11.3810.49 9.959.38 6.14
10.51 9.877.54 11.798.39 13.157.61 10.68.02 10.14
10.14 9.8410.41 10.6812.3 11.510.7 10.93
11.39 10.0613.68 12.1510.68 11.7311.39 7.597.41 9.69
10.82 9.67Mean 10.14 10.29SD 1.62 1.54Correlation ‐0.02
x y10.34 10.1311.57 8.7910.01 11.3810.49 9.959.38 6.14
10.51 9.877.54 11.798.39 13.157.61 10.68.02 10.14
10.14 9.8410.41 10.6812.3 11.510.7 10.93
11.39 10.0613.68 22.1510.68 11.7311.39 7.597.41 9.69
10.82 9.67Mean 10.14 10.29SD 1.62 1.54Correlation 0.36
Examples of outliers
What is an Outlier ??
Definition of Hawkins [Hawkins 1980]:
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”
Hawkins D. Identification of Outliers. Chapman and Hall, London 1980
Statistical methods to detect outliers
• Normal distribution‐based– Smirnov‐Grubbs test– Dixon test– Thompson test– Tietjen‐Moore test– Chi‐squared test– Cochran test
• Distance‐based– Maharanobis distance– MSD (Modified Stahel‐
Donoho)– LOF (Local Outlier Factor– Nearest Neighour
• Clustering‐based• Density ratio estimation
– SVM (Support Vector Machine)
– One‐class SVM– Kernel– KLIEP– LSIF– uLSIF
Outlier
Error
Typo?
Measurement Error?
True By chance
If you have a very nice friend…=IF(sheet1!A1=Sheet2!A1,"",Sheet2!A1)
My old friend…
What can we do??
What can we do effectively??
Exploratory data analysis
• Summarise the data– Five-number summary
• Visualize the data– Scatter plots– Histograms
In Excel
However…
• This is doable. But bit painful, tedious or boring.• When you preform advanced analyses in stat SW,
this can be a source of confusion, because they are below the actual data.
Stat SW have handy functions
• “summary”
• “sum”
> summary(SBP)Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 65.0 114.2 124.5 127.9 138.0 200.0 1
Ex. SPSS Analysis‐> Frequencies
Output
Ex. SPSS Analysis‐> Descriptive
Tips: Matrix Scatter
Boxplot (or box‐whisker plot)
median
max
Q1 (25%) Outlier (> 1.5 IQR below Q1)
min
Q3(75%)
IQR
Correlation
• Pearson’s product-moment correlation coefficient (r)• Spearman’s rank correlation coefficient (ρ) (= rho) • Kendall tau rank correlation (τ) (=tau)
Data distribution Methods for correlation
Normal distribution Pearson’s r
Non‐normal distributionSpearman’s ρ
Kendall tau τ
Can I use Pearson’s always?
Pearson’s r = 0.88
Spearman’s rho = 1
Regression and CorrelationBoth of the blow scatter plots give you “ Y=X”, but there is a significant difference in the degree how close the relationships are.
Y=X
Spearman or Kendall??
Each dot was from the dataset with sample size of 30
Results from Simulation exercise (x1000) .
Each dot was from the dataset with sample size of 10
Guildford’s Rule of Thumb
Rule of Thumb for Interpreting the Size of a Correlation Coefficient
Size of Correlation
(absolute value)Interpretation
.90 to 1.00 Very strong correlation
.70 to .89 Strong correlation
.40 to .69 Moderate correlation
.20 to .39 Weak correlation
.00 to .19 No or negligible
Guildford,1956
Take home messages
Before you ran fancy stats….• Check if your data was correctly inputted.• Run Exploratory analysis (incl. summary stats).• Making a habit of drawing scattergrams always
– is important and useful to know what your data is, – and is also a handy method for detecting outliers
(including Matrix-plots, histograms and boxplots)
FIN