some statistical basics marian scott. why bother with statistics we need statistical skills to: make...
Post on 28-Mar-2015
216 Views
Preview:
TRANSCRIPT
Some statistical basics
Marian Scott
Why bother with Statistics
We need statistical skills to: Make sense of numerical information, Summarise data, Present results (graphically), Test hypotheses Construct models
Variables- number and type
Univariate: there is one variable of interest measured on the individuals in the sample. We may ask:
What is the distribution of results-this may be further resolved into questions concerning the mean or average value of the variable and the scatter or variability in the results?
Bivariate
Bivariate two variables of interest are measured on each member of the sample. We may ask :
How are the two variables related? If one variable is time, how does the other
variable change? How can we model the dependence of one
variable on the other?
Multivariate
Multivariate many variables of interest are measured on the individuals in the sample, we might ask:
What relationships exist between the variables? Is it possible to reduce the number of variables, but
still retain 'all' the information?
Can we identify any grouping of the individuals on the basis of the variables?
Data types
Numerical: a variable may be either continuous or discrete.
For a discrete variable, the values taken are whole numbers (e.g. number of chromosome abnormalities, numbers of eggs).
For a continuous variable, values taken are real numbers (positive or negative and including fractional parts) (e.g. blood lead level, alkalinity, weight, temperature).
categorical
Categorical: a limited number of categories or classes exist, each member of the sample belongs to one and only one of the classes e.g. sex is categorical.
Sex is a nominal categorical variable since the categories are unordered.
Dose of a drug or level of diluent (eg recorded as low, medium ,high) would be an ordinal categorical variable since the different classes are ordered
Inference and Statistical Significance
Sample Population
inference
Is the sample representative? Is the population homogeneous?
Since only a sample has been taken from the population we cannot be 100% certain
Significance testing
Hypothesis Testing II
Null hypothesis: usually ‘no effect’
Alternative hypothesis: ‘effect’
Make a decision based on the evidence (the data)
There is a risk of getting it wrong!
Two types of error:- reject null when we shouldn’t
- Type I don’t reject null when we should
- Type II
Significance Levels
We cannot reduce probabilities of both Type I and Type II errors to zero.
So we control the probability of a Type I error.
This is referred to as the Significance Level or p-value.
Generally p-value of <0.05 is considered a reasonable risk of a Type I error.(beyond reasonable doubt)
Statistical Significance vs. Practical Importance
Statistical significance is concerned with the ability to discriminate between treatments given the background variation.
Practical importance relates to the scientific domain and is concerned with scientific discovery and explanation.
Power
Power is related to Type II error
probability of
power = 1 -making a Type II error
Aim:
to keep power as high as possible
Sample size calculations
What is the objective of the experiment?
How much of a difference is it important to be able to detect (the effect size)?
At what significance level do you want to conduct the test? (decrease the significance level, reduces power)
What is the power of the experiment (what is the probability that you will detect such a difference when it actually exists)?
How variable is the population? Greater variation needs larger sample size to achieve the same power
Power Curves
Modelling continuous variables-checking Normality
Normal density function and histogram
Check for symmetry Other possibility-Normal
probability plot
C1
Frequency
2.41.60.80.0-0.8-1.6-2.4
20
15
10
5
0
Mean 0.1211StDev 1.015N 100
Histogram of C1Normal
Modelling continuous variables-checking Normality
Normal probability plot
Should show a straight line
p-value of test is also reported (null: data are Normally distributed)
C1
Perc
ent
43210-1-2-3
99.9
99
95
90
80706050403020
10
5
1
0.1
Mean
0.439
0.1211StDev 1.015N 100AD 0.361P-Value
Probability Plot of C1Normal
Statistical inference
Hypothesis testing and the p-value Statistical significance vs real-world importance Confidence intervals
Confidence intervals- an alternative to hypothesis testing
A confidence interval is a range of credible values for the population parameter. The confidence coefficient is the percentage of times that the method will in the long run capture the true population parameter.
A common form is sample estimator 2* estimated standard error
Statistical models
Outcomes or Responsesthese are the results of the practical work and are sometimes referred to as ‘dependent variables’.
Causes or Explanationsthese are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as ‘independent variables’, but more commonly known as covariates.
Statistical models
In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses.
In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses.
Specifying a statistical models
Models specify the way in which outcomes and causes link together, eg.
Metabolite = Temperature The = sign does not indicate equality in a mathematical
sense and there should be an additional item on the right hand side giving a formula:-
Metabolite = Temperature + Error
statistical model interpretation
Metabolite = Temperature + Error
The outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error.
The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is ‘large’ in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.
Correlations and linear relationships
Strength of linear relationship Simple indicator lying between –1 and +1 Check your plots for linearity
gene correlations
1.11.00.90.80.70.60.50.4
3
2
1
mBadSpl
RA
G1S
pl
corr 0.9
1312111098765
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
mBcl2Sp
mB
adS
pl
corr 0.5
0.150.100.050.00
3
2
1
mBclxLNR
AG
1S
pl
corr 0.03
0.90.80.70.60.50.4
3
2
1
mBadLN
RA
G1S
pl
corr -0.56
Interpreting correlations
The correlation coefficient is used as a measure of the linear relationship between two variables,
The correlation coefficient is a measure of the strength of the linear association between two variables. If the relationship is non-linear, the coefficient can still be evaluated and may appear sensible, so beware- plot the data first.
Simple regression model
The basic regression model assumes: The average value of the response x, is
linearly related to the explanatory t, The spread of the response x, about the
average is the SAME for all values of t, The VARIABILITY of the response x, about
the average follows a NORMAL distribution for each value of t.
Simple regression model
Model is fit typically using least squares Goodness of fit of model assessed based on
residual sum of squares and R2 Assumptions checked using residual plots Inference about model parameters carried out
using hypothesis tests or confidence intervals
statistical model interpretation
The traditional ‘statistical tests’ such as t-tests, ANOVA, ANCOVA and regression are each special cases of a more general type of model, making a number of assumptions -
t-tests work where there are two groups, ANOVA works with categorical explanatory variables, regression assumes that explanatory variables are
continuous, Our explanatory variables are not like this, they are
mixtures of continuous and categorical, so we need a more flexible approach- the G(eneral) L(inear) M(odel).
General linear models
General Linear Models (GLMs) are a comprehensive set of techniques that cover a wide range of analyses. Problems that make use of number of specific techniques may be specified as GLM problems using a unified specification called a Model Syntax. The form of the Model Syntax varies a little from statistics package to statistics package, but is essentially just a way of unambiguously specifying what the relationship is between variables (categorical or continuous).
Examples
Example Traditional Test GLM word equation
Comparing the effect of burning and clipping on bracken
Two sample t-test SHOOTS = MANAGEMENT
Comparing the effect of two different drugs with a placebo
One-way analysis of variance EFFECT = DRUG
Comparing the yield between fertilisers conducting the experiment in several fields
One-way analysis of variance with blocking
YIELD = FIELD + FERTILISER
Investigating the relationship between height and weight in people
Regression WEIGHT = HEIGHT
Investigating the relationship between oxygen consumption and weight in scampi, taking level of activity into account
Analysis of covariance, with emphasis on regression
OXYGEN = WEIGHT + ACTIVITY
or under different assumptions(an interaction between the terms)OXYGEN = WEIGHT | ACTIVITY
summary
hypothesis tests and confidence intervals are used to make inferences
we build statistical models to explore relationships and explain variation
the modelling framework is a general one – general linear models, generalised additive models
assumptions should be checked.
top related