exploratory data analysis outliers correlations

Exploratory data analysisOutliers

CorrelationsDr Kaz Negishi, MD, PhD, FACC, FESC

Menzies Research Institute Tasmania

‘Look before you leap’

When you’ve finished putting all the data into your spreadsheet…

What will you do next?

Why is this important??

x y10.34 10.1311.57 8.7910.01 11.3810.49 9.959.38 6.14

10.51 9.877.54 11.798.39 13.157.61 10.68.02 10.14

10.14 9.8410.41 10.6812.3 11.510.7 10.93

11.39 10.0613.68 12.1510.68 11.7311.39 7.597.41 9.69

10.82 9.67Mean 10.14 10.29SD 1.62 1.54Correlation ‐0.02

x y10.34 10.1311.57 8.7910.01 11.3810.49 9.959.38 6.14

10.51 9.877.54 11.798.39 13.157.61 10.68.02 10.14

10.14 9.8410.41 10.6812.3 11.510.7 10.93

11.39 10.0613.68 22.1510.68 11.7311.39 7.597.41 9.69

10.82 9.67Mean 10.14 10.29SD 1.62 1.54Correlation 0.36

Examples of outliers

What is an Outlier ??

Definition of Hawkins [Hawkins 1980]:

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”

Hawkins D. Identification of Outliers. Chapman and Hall, London 1980

Statistical methods to detect outliers

• Normal distribution‐based– Smirnov‐Grubbs test– Dixon test– Thompson test– Tietjen‐Moore test– Chi‐squared test– Cochran test

• Distance‐based– Maharanobis distance– MSD (Modified Stahel‐

Donoho)– LOF (Local Outlier Factor– Nearest Neighour

• Clustering‐based• Density ratio estimation

– SVM (Support Vector Machine)

– One‐class SVM– Kernel– KLIEP– LSIF– uLSIF

Outlier

Error

Typo?

Measurement Error?

True By chance

If you have a very nice friend…=IF(sheet1!A1=Sheet2!A1,"",Sheet2!A1)

My old friend…

What can we do??

What can we do effectively??

Exploratory data analysis

• Summarise the data– Five-number summary

• Visualize the data– Scatter plots– Histograms

In Excel

However…

• This is doable. But bit painful, tedious or boring.• When you preform advanced analyses in stat SW,

this can be a source of confusion, because they are below the actual data.

Stat SW have handy functions

• “summary”

• “sum”

> summary(SBP)Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 65.0 114.2 124.5 127.9 138.0 200.0 1

Ex. SPSS Analysis‐> Frequencies

Output

Ex. SPSS Analysis‐> Descriptive

Tips: Matrix Scatter

Boxplot (or box‐whisker plot)

median

max

Q1 (25%) Outlier (> 1.5 IQR below Q1)

min

Q3(75%)

IQR

Correlation

• Pearson’s product-moment correlation coefficient (r)• Spearman’s rank correlation coefficient (ρ) (= rho) • Kendall tau rank correlation (τ) (=tau)

Data distribution Methods for correlation

Normal distribution Pearson’s r

Non‐normal distributionSpearman’s ρ

Kendall tau τ

Can I use Pearson’s always?

Pearson’s r = 0.88

Spearman’s rho = 1

Regression and CorrelationBoth of the blow scatter plots give you “ Y=X”, but there is a significant difference in the degree how close the relationships are.

Y=X

Spearman or Kendall??

Each dot was from the dataset with sample size of 30

Results from Simulation exercise (x1000) .

Each dot was from the dataset with sample size of 10

Guildford’s Rule of Thumb

Rule of Thumb for Interpreting the Size of a Correlation Coefficient

Size of Correlation

(absolute value)Interpretation

.90 to 1.00 Very strong correlation

.70 to .89 Strong correlation

.40 to .69 Moderate correlation

.20 to .39 Weak correlation

.00 to .19 No or negligible

Guildford,1956

Take home messages

Before you ran fancy stats….• Check if your data was correctly inputted.• Run Exploratory analysis (incl. summary stats).• Making a habit of drawing scattergrams always

– is important and useful to know what your data is, – and is also a handy method for detecting outliers

(including Matrix-plots, histograms and boxplots)

exploratory data analysis outliers correlations

Documents