some statistical basics marian scott. why bother with statistics we need statistical skills to: make...

Some statistical basics

Marian Scott

Why bother with Statistics

We need statistical skills to: Make sense of numerical information, Summarise data, Present results (graphically), Test hypotheses Construct models

Variables- number and type

Univariate: there is one variable of interest measured on the individuals in the sample. We may ask:

What is the distribution of results-this may be further resolved into questions concerning the mean or average value of the variable and the scatter or variability in the results?

Bivariate

Bivariate two variables of interest are measured on each member of the sample. We may ask :

How are the two variables related? If one variable is time, how does the other

variable change? How can we model the dependence of one

variable on the other?

Multivariate

Multivariate many variables of interest are measured on the individuals in the sample, we might ask:

What relationships exist between the variables? Is it possible to reduce the number of variables, but

still retain 'all' the information?

Can we identify any grouping of the individuals on the basis of the variables?

Data types

Numerical: a variable may be either continuous or discrete.

For a discrete variable, the values taken are whole numbers (e.g. number of chromosome abnormalities, numbers of eggs).

For a continuous variable, values taken are real numbers (positive or negative and including fractional parts) (e.g. blood lead level, alkalinity, weight, temperature).

categorical

Categorical: a limited number of categories or classes exist, each member of the sample belongs to one and only one of the classes e.g. sex is categorical.

Sex is a nominal categorical variable since the categories are unordered.

Dose of a drug or level of diluent (eg recorded as low, medium ,high) would be an ordinal categorical variable since the different classes are ordered

Inference and Statistical Significance

Sample Population

inference

Is the sample representative? Is the population homogeneous?

Since only a sample has been taken from the population we cannot be 100% certain

Significance testing

Hypothesis Testing II

Null hypothesis: usually ‘no effect’

Alternative hypothesis: ‘effect’

Make a decision based on the evidence (the data)

There is a risk of getting it wrong!

Two types of error:- reject null when we shouldn’t

- Type I don’t reject null when we should

- Type II

Significance Levels

We cannot reduce probabilities of both Type I and Type II errors to zero.

So we control the probability of a Type I error.

This is referred to as the Significance Level or p-value.

Generally p-value of <0.05 is considered a reasonable risk of a Type I error.(beyond reasonable doubt)

Statistical Significance vs. Practical Importance

Statistical significance is concerned with the ability to discriminate between treatments given the background variation.

Practical importance relates to the scientific domain and is concerned with scientific discovery and explanation.

Power is related to Type II error

probability of

power = 1 -making a Type II error

to keep power as high as possible

Sample size calculations

What is the objective of the experiment?

How much of a difference is it important to be able to detect (the effect size)?

At what significance level do you want to conduct the test? (decrease the significance level, reduces power)

What is the power of the experiment (what is the probability that you will detect such a difference when it actually exists)?

How variable is the population? Greater variation needs larger sample size to achieve the same power

Power Curves

Modelling continuous variables-checking Normality

Normal density function and histogram

Check for symmetry Other possibility-Normal

probability plot

Frequency

2.41.60.80.0-0.8-1.6-2.4

Mean 0.1211StDev 1.015N 100

Histogram of C1Normal

Modelling continuous variables-checking Normality

Normal probability plot

Should show a straight line

p-value of test is also reported (null: data are Normally distributed)

43210-1-2-3

80706050403020

0.1211StDev 1.015N 100AD 0.361P-Value

Probability Plot of C1Normal

Statistical inference

Hypothesis testing and the p-value Statistical significance vs real-world importance Confidence intervals

Confidence intervals- an alternative to hypothesis testing

A confidence interval is a range of credible values for the population parameter. The confidence coefficient is the percentage of times that the method will in the long run capture the true population parameter.

A common form is sample estimator 2* estimated standard error

Statistical models

Outcomes or Responsesthese are the results of the practical work and are sometimes referred to as ‘dependent variables’.

Causes or Explanationsthese are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as ‘independent variables’, but more commonly known as covariates.

Statistical models

In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses.

In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses.

Specifying a statistical models

Models specify the way in which outcomes and causes link together, eg.

Metabolite = Temperature The = sign does not indicate equality in a mathematical

sense and there should be an additional item on the right hand side giving a formula:-

Metabolite = Temperature + Error

statistical model interpretation

Metabolite = Temperature + Error

The outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error.

The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is ‘large’ in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.

Correlations and linear relationships

Strength of linear relationship Simple indicator lying between –1 and +1 Check your plots for linearity

gene correlations

1.11.00.90.80.70.60.50.4

mBadSpl

corr 0.9

1312111098765

mBcl2Sp

corr 0.5

0.150.100.050.00

mBclxLNR

corr 0.03

0.90.80.70.60.50.4

mBadLN

corr -0.56

Interpreting correlations

The correlation coefficient is used as a measure of the linear relationship between two variables,

The correlation coefficient is a measure of the strength of the linear association between two variables. If the relationship is non-linear, the coefficient can still be evaluated and may appear sensible, so beware- plot the data first.

Simple regression model

The basic regression model assumes: The average value of the response x, is

linearly related to the explanatory t, The spread of the response x, about the

average is the SAME for all values of t, The VARIABILITY of the response x, about

the average follows a NORMAL distribution for each value of t.

Simple regression model

Model is fit typically using least squares Goodness of fit of model assessed based on

residual sum of squares and R2 Assumptions checked using residual plots Inference about model parameters carried out

using hypothesis tests or confidence intervals

statistical model interpretation

The traditional ‘statistical tests’ such as t-tests, ANOVA, ANCOVA and regression are each special cases of a more general type of model, making a number of assumptions -

t-tests work where there are two groups, ANOVA works with categorical explanatory variables, regression assumes that explanatory variables are

continuous, Our explanatory variables are not like this, they are

mixtures of continuous and categorical, so we need a more flexible approach- the G(eneral) L(inear) M(odel).

General linear models

General Linear Models (GLMs) are a comprehensive set of techniques that cover a wide range of analyses. Problems that make use of number of specific techniques may be specified as GLM problems using a unified specification called a Model Syntax. The form of the Model Syntax varies a little from statistics package to statistics package, but is essentially just a way of unambiguously specifying what the relationship is between variables (categorical or continuous).

Examples

Example Traditional Test GLM word equation

Comparing the effect of burning and clipping on bracken

Two sample t-test SHOOTS = MANAGEMENT

Comparing the effect of two different drugs with a placebo

One-way analysis of variance EFFECT = DRUG

Comparing the yield between fertilisers conducting the experiment in several fields

One-way analysis of variance with blocking

YIELD = FIELD + FERTILISER

Investigating the relationship between height and weight in people

Regression WEIGHT = HEIGHT

Investigating the relationship between oxygen consumption and weight in scampi, taking level of activity into account

Analysis of covariance, with emphasis on regression

OXYGEN = WEIGHT + ACTIVITY

or under different assumptions(an interaction between the terms)OXYGEN = WEIGHT | ACTIVITY

summary

hypothesis tests and confidence intervals are used to make inferences

we build statistical models to explore relationships and explain variation

the modelling framework is a general one – general linear models, generalised additive models

assumptions should be checked.

some statistical basics marian scott. why bother with statistics we need statistical skills to: make...

continuous variable

discrete variable

variable change

variables number

number of variables

ordinal categorical

nominal categorical

sample representative

Documents

mirrors why bother?

history - why bother

bother the apocalypse. bother demons. bother failure ......

tissue banking - why bother!?!

“star gazing “star gazing –– why bother?”why...

xml--why bother?

index of tables - university of stirling€¦ · web...

why bother 11

statistical report 2 13 - · pdf filestatistical report 2 13...

town teams – why bother

google +: why bother?

social media - why bother

why bother advertising

why does elliot bother

how to summarise - final

working parents: why bother?

sak 5090 mohd hasan selamat- chapter 10slide 1 statistical...

funscript: why bother?

contents · web viewin this chapter, you will summarise...

why bother making friends?