statistics 1: tests and linear models. how to get started? exploring data graphically: scatterplot...
TRANSCRIPT
![Page 1: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/1.jpg)
Statistics 1: tests and linear models
![Page 2: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/2.jpg)
How to get started?
• Exploring data graphically:
Scatterplot Scatterplot
HistogramBoxplot
![Page 3: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/3.jpg)
Important things to check
• Are all the variables in correct format?• Do there seem to be outliers?
– Mistake in data coding?
Initial structure of the analyses
• What is the response variable?• What are the explanatory variables?• Explore patterns visually
– Correlations?– Differences between groups?
![Page 4: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/4.jpg)
Summary statistics
• summary(data), summary(x)• mean(x), median(x)• range(x)• var(x), sd(x)• min(x), max(x)• quantile(x,p)• tapply(), table()
![Page 5: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/5.jpg)
Tests
• Test for normality– Shapiro’s test: shapiro.test()
– QQ plot: qqnorm(), qqline()
• Homogeneity of variance– var.test (for two groups)
– bartlett.test (for several groups)
![Page 6: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/6.jpg)
Tests for differences in means
• Student’s t-test: t.test()– One or two sample test
• Testing if sample mean differs e.g. from 0• Testing if sample means of two groups differ
– Paired/non paired• Are pairs of measurements associated?
– Variance homogeneous/non homogeneous
– Assumes normally distributed data
• Wilcoxon’s test: wilcox.test()– Normality not required
– paired/non paired DEMO 1
![Page 7: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/7.jpg)
Correlation
• cor(x,y) calculated correlation coefficient between two numeric variables– close to 0: no correlation
– close to 1: strong correlation
• Is the correlation significant– cor.test(y,x)
– Note: check also graphically!!!
![Page 8: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/8.jpg)
Confidence intervals and standard errors
• Typical ways of describing uncertainty in a parameter value (e.g. mean)– Standard error (SE of mean is sqrt(var(xx)/n)
– Confidence interval (95%)• The range within which the value is with the probability of 95%• Normal approximation: 1.96*SE, so that 95% CI for mean(xx)
[mean(xx) - 1.96*SE(xx), mean(xx) + 1.96*SE(xx)]
• If data not normally distributed bootstrapping can be helpful– Let’s assume we have measured age at death for 100 rats
95% CI for mean age at death can be derived by
» 1. take a sample of 100 rats with replacement from the original data
» 2. calculate mean
» 3. repeat 1 & 2 e.g. 1000 times and always record the mean
» 4. Now 2.5 and 97.5% quantiles of the means give the 95% CI for mean
EXERCISE TOMORROW!
![Page 9: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/9.jpg)
Linear model and regression
• Models the response variable through additive effects of explanatory variables– E.g. how does stopping distance of a car depend on speed?– Or how does weight of an animal depend on it’s length?
![Page 10: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/10.jpg)
The formula
Y = a + b1x1 + … + bnxn + ε
Response variable
Intercept
Explanatory variables
Normally distributed error term, i.e. ‘random noise’
Regression, ANOVA or ANCOVA?
![Page 11: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/11.jpg)
How to interprete…
• Intercept: – Baseline value for Y– The value that Y is expected to get if all the predictors are 0– If one/some of the predictors are factors, then this is the value
predicted for the reference levels of the factors
• Coefficients bn
– If xn is numeric variable, then increment of xn with one unit increases the value of Y with bn
– If xn is a factor, then parameter bn gets different value for each factor level, so that Y increases with the value bn corresponding to the level of xn
• Note, reference level of x is included to the intercept
![Page 12: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/12.jpg)
Fitting the model in R
• lm(y~x,data=“name of your dataset”)• Formula:
y~x
intercept + the effect of x
y~x-1
no intercept
y~x+z
multiple regression with main effects
y~x*z
multiple regression with main effects and interactions
• Exploring the model: summary(), anova(), plot(“model”)
![Page 13: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/13.jpg)
plot() command in lm
Produced four figures1. Residuals against fitted values
2. QQ plot for residuals
3. Standardized residuals
4. ‘Influence’ plotted against residuals: identifies outliers
Residuals should be normally distributed and not show any systematic trends. If not OK, then:
-> transformation of response: sqrt(), ln(),…
-> transformations of explanatory variables
-> should generalized linear model be used?
![Page 14: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/14.jpg)
How to predict?
Y = a + b1x1 + … + bnxn
Expected value of YValues of predictors
Estimated model parameters
In R, predict() function.
![Page 15: Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot](https://reader035.vdocuments.mx/reader035/viewer/2022072016/56649ee85503460f94bf904d/html5/thumbnails/15.jpg)
Briefly about model selection
• The aim: simplest adequate model– Few parameters preferred over many– Main effects preferred over interactions– Untransformed variables preferred over transformed– Model should still not be oversimplified
• Simplifying a model– Are effects of explanatory variables significant?
– Does deletion of a term increase residual variation significantly?
• Model selection tools:– anova() Tests difference in residual variation between alternative models– step() Stepwise model selection based on AIC values
DEMO 2