exploratory data analysis outliers correlations

27
Exploratory data analysis Outliers Correlations Dr Kaz Negishi, MD, PhD, FACC, FESC Menzies Research Institute Tasmania

Upload: others

Post on 03-Oct-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploratory data analysis Outliers Correlations

Exploratory data analysisOutliers

CorrelationsDr Kaz Negishi, MD, PhD, FACC, FESC

Menzies Research Institute Tasmania

Page 2: Exploratory data analysis Outliers Correlations

‘Look before you leap’ 

When you’ve finished putting all the data into your spreadsheet…

What will you do next?

Page 3: Exploratory data analysis Outliers Correlations

Why is this important??

x y10.34 10.1311.57 8.7910.01 11.3810.49 9.959.38 6.14

10.51 9.877.54 11.798.39 13.157.61 10.68.02 10.14

10.14 9.8410.41 10.6812.3 11.510.7 10.93

11.39 10.0613.68 12.1510.68 11.7311.39 7.597.41 9.69

10.82 9.67Mean 10.14 10.29SD 1.62 1.54Correlation ‐0.02

x y10.34 10.1311.57 8.7910.01 11.3810.49 9.959.38 6.14

10.51 9.877.54 11.798.39 13.157.61 10.68.02 10.14

10.14 9.8410.41 10.6812.3 11.510.7 10.93

11.39 10.0613.68 22.1510.68 11.7311.39 7.597.41 9.69

10.82 9.67Mean 10.14 10.29SD 1.62 1.54Correlation 0.36

Page 4: Exploratory data analysis Outliers Correlations

Examples of outliers 

Page 5: Exploratory data analysis Outliers Correlations

What is an Outlier ??

Definition of Hawkins [Hawkins 1980]:

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”

Hawkins D. Identification of Outliers. Chapman and Hall, London 1980

Page 6: Exploratory data analysis Outliers Correlations

Statistical methods to detect outliers

• Normal distribution‐based– Smirnov‐Grubbs test– Dixon test– Thompson test– Tietjen‐Moore test– Chi‐squared test– Cochran test

• Distance‐based– Maharanobis distance– MSD (Modified Stahel‐

Donoho)– LOF (Local Outlier Factor– Nearest Neighour

• Clustering‐based• Density ratio estimation

– SVM (Support Vector Machine)

– One‐class SVM– Kernel– KLIEP– LSIF– uLSIF

Page 7: Exploratory data analysis Outliers Correlations

Outlier

Error

Typo?

Measurement Error?

True By chance

Page 8: Exploratory data analysis Outliers Correlations

If you have a very nice friend…=IF(sheet1!A1=Sheet2!A1,"",Sheet2!A1)

Page 9: Exploratory data analysis Outliers Correlations

My old friend…

Page 10: Exploratory data analysis Outliers Correlations

What can we do??

Page 11: Exploratory data analysis Outliers Correlations

What can we do effectively??

Page 12: Exploratory data analysis Outliers Correlations

Exploratory data analysis

• Summarise the data– Five-number summary

• Visualize the data– Scatter plots– Histograms

Page 13: Exploratory data analysis Outliers Correlations

In Excel

Page 14: Exploratory data analysis Outliers Correlations

However…

• This is doable. But bit painful, tedious or boring.• When you preform advanced analyses in stat SW,

this can be a source of confusion, because they are below the actual data.

Page 15: Exploratory data analysis Outliers Correlations

Stat SW have handy functions

• “summary”

• “sum”

> summary(SBP)Min.  1st Qu.  Median    Mean  3rd Qu.    Max.     NA's 65.0    114.2    124.5    127.9    138.0    200.0       1 

Page 16: Exploratory data analysis Outliers Correlations

Ex. SPSS Analysis‐> Frequencies

Page 17: Exploratory data analysis Outliers Correlations

Output

Page 18: Exploratory data analysis Outliers Correlations

Ex. SPSS Analysis‐> Descriptive

Page 19: Exploratory data analysis Outliers Correlations

Tips: Matrix Scatter

Page 20: Exploratory data analysis Outliers Correlations

Boxplot (or box‐whisker plot)

median

max

Q1 (25%) Outlier (> 1.5 IQR below Q1)

min

Q3(75%)

IQR

Page 21: Exploratory data analysis Outliers Correlations

Correlation

• Pearson’s product-moment correlation coefficient (r)• Spearman’s rank correlation coefficient (ρ) (= rho) • Kendall tau rank correlation (τ) (=tau)

Data distribution Methods for correlation

Normal distribution Pearson’s r

Non‐normal distributionSpearman’s ρ

Kendall tau τ

Page 22: Exploratory data analysis Outliers Correlations

Can I use Pearson’s always?

Pearson’s r = 0.88

Spearman’s rho = 1

Page 23: Exploratory data analysis Outliers Correlations

Regression and CorrelationBoth of the blow scatter plots give you “ Y=X”, but there is a significant difference in the degree how close the relationships are. 

Y=X

Page 24: Exploratory data analysis Outliers Correlations

Spearman or Kendall??

Each dot was from the dataset with sample size of 30

Results from Simulation exercise (x1000) . 

Each dot was from the dataset with sample size of 10

Page 25: Exploratory data analysis Outliers Correlations

Guildford’s Rule of Thumb

Rule of Thumb for Interpreting the Size of a Correlation Coefficient

Size of Correlation

(absolute value)Interpretation

.90 to 1.00 Very strong correlation

.70 to .89 Strong correlation

.40 to .69 Moderate correlation

.20 to .39 Weak correlation

.00 to .19 No or negligible

Guildford,1956

Page 26: Exploratory data analysis Outliers Correlations

Take home messages

Before you ran fancy stats….• Check if your data was correctly inputted.• Run Exploratory analysis (incl. summary stats).• Making a habit of drawing scattergrams always

– is important and useful to know what your data is, – and is also a handy method for detecting outliers

(including Matrix-plots, histograms and boxplots)

Page 27: Exploratory data analysis Outliers Correlations

FIN