statistics 1: tests and linear models. how to get started? exploring data graphically: scatterplot...

Statistics 1: tests and linear models

How to get started?

• Exploring data graphically:

Scatterplot Scatterplot

HistogramBoxplot

Important things to check

• Are all the variables in correct format?• Do there seem to be outliers?

– Mistake in data coding?

Initial structure of the analyses

• What is the response variable?• What are the explanatory variables?• Explore patterns visually

– Correlations?– Differences between groups?

Summary statistics

• summary(data), summary(x)• mean(x), median(x)• range(x)• var(x), sd(x)• min(x), max(x)• quantile(x,p)• tapply(), table()

Tests

• Test for normality– Shapiro’s test: shapiro.test()

– QQ plot: qqnorm(), qqline()

• Homogeneity of variance– var.test (for two groups)

– bartlett.test (for several groups)

Tests for differences in means

• Student’s t-test: t.test()– One or two sample test

• Testing if sample mean differs e.g. from 0• Testing if sample means of two groups differ

– Paired/non paired• Are pairs of measurements associated?

– Variance homogeneous/non homogeneous

– Assumes normally distributed data

• Wilcoxon’s test: wilcox.test()– Normality not required

– paired/non paired DEMO 1

Correlation

• cor(x,y) calculated correlation coefficient between two numeric variables– close to 0: no correlation

– close to 1: strong correlation

• Is the correlation significant– cor.test(y,x)

– Note: check also graphically!!!

Confidence intervals and standard errors

• Typical ways of describing uncertainty in a parameter value (e.g. mean)– Standard error (SE of mean is sqrt(var(xx)/n)

– Confidence interval (95%)• The range within which the value is with the probability of 95%• Normal approximation: 1.96*SE, so that 95% CI for mean(xx)

[mean(xx) - 1.96*SE(xx), mean(xx) + 1.96*SE(xx)]

• If data not normally distributed bootstrapping can be helpful– Let’s assume we have measured age at death for 100 rats

95% CI for mean age at death can be derived by

» 1. take a sample of 100 rats with replacement from the original data

» 2. calculate mean

» 3. repeat 1 & 2 e.g. 1000 times and always record the mean

» 4. Now 2.5 and 97.5% quantiles of the means give the 95% CI for mean

EXERCISE TOMORROW!

Linear model and regression

• Models the response variable through additive effects of explanatory variables– E.g. how does stopping distance of a car depend on speed?– Or how does weight of an animal depend on it’s length?

The formula

Y = a + b1x1 + … + bnxn + ε

Response variable

Intercept

Explanatory variables

Normally distributed error term, i.e. ‘random noise’

Regression, ANOVA or ANCOVA?

How to interprete…

• Intercept: – Baseline value for Y– The value that Y is expected to get if all the predictors are 0– If one/some of the predictors are factors, then this is the value

predicted for the reference levels of the factors

• Coefficients bn

– If xn is numeric variable, then increment of xn with one unit increases the value of Y with bn

– If xn is a factor, then parameter bn gets different value for each factor level, so that Y increases with the value bn corresponding to the level of xn

• Note, reference level of x is included to the intercept

Fitting the model in R

• lm(y~x,data=“name of your dataset”)• Formula:

y~x

intercept + the effect of x

y~x-1

no intercept

y~x+z

multiple regression with main effects

y~x*z

multiple regression with main effects and interactions

• Exploring the model: summary(), anova(), plot(“model”)

plot() command in lm

Produced four figures1. Residuals against fitted values

2. QQ plot for residuals

3. Standardized residuals

4. ‘Influence’ plotted against residuals: identifies outliers

Residuals should be normally distributed and not show any systematic trends. If not OK, then:

-> transformation of response: sqrt(), ln(),…

-> transformations of explanatory variables

-> should generalized linear model be used?

How to predict?

Y = a + b1x1 + … + bnxn

Expected value of YValues of predictors

Estimated model parameters

In R, predict() function.

Briefly about model selection

• The aim: simplest adequate model– Few parameters preferred over many– Main effects preferred over interactions– Untransformed variables preferred over transformed– Model should still not be oversimplified

• Simplifying a model– Are effects of explanatory variables significant?

– Does deletion of a term increase residual variation significantly?

• Model selection tools:– anova() Tests difference in residual variation between alternative models– step() Stepwise model selection based on AIC values

DEMO 2

statistics 1: tests and linear models. how to get started? exploring data graphically: scatterplot...

Documents