principles of statistics

41
PRINCIPLES OF PRINCIPLES OF STATISTICS STATISTICS IPTHO IPTHO ABDUL SUKUR KAMSIR ABDUL SUKUR KAMSIR

Upload: abdul-sukur-kamsir

Post on 10-Apr-2015

2.770 views

Category:

Documents


71 download

TRANSCRIPT

Page 1: Principles of Statistics

PRINCIPLES OF PRINCIPLES OF STATISTICSSTATISTICS

IPTHOIPTHO

ABDUL SUKUR KAMSIRABDUL SUKUR KAMSIR

Page 2: Principles of Statistics

OUTLINEOUTLINE

Introduction and DefinitionsIntroduction and DefinitionsSamplingSamplingExternal and Internal ValidityExternal and Internal ValiditySources of errorSources of errorNormal distributionNormal distributionStandard errorStandard errorBinomial probabilitiesBinomial probabilitiesPoisson distributionPoisson distributionStatistical testsStatistical testsConfidence intervalsConfidence intervals

Page 3: Principles of Statistics

What is statistics?What is statistics?

A Science where inferences are made A Science where inferences are made about specific random phenomena on the about specific random phenomena on the basis of a relatively limited sample basis of a relatively limited sample materialmaterial

The word “statistics” can also mean the The word “statistics” can also mean the analytical tools used in this science i.e. the analytical tools used in this science i.e. the calculated figures based on the data calculated figures based on the data collected.collected.

Page 4: Principles of Statistics

Two Main AreasTwo Main Areas

Mathematical statisticsMathematical statistics – concerns with the – concerns with the development of new methods of statistical development of new methods of statistical inference and requires a strong inference and requires a strong mathematics knowledgemathematics knowledgeApplied statistics Applied statistics – applying the methods – applying the methods of mathematical statistics to specific of mathematical statistics to specific subject areas; subject areas; BIOSTATISTICS BIOSTATISTICS is when it is when it is applied to biological or medical is applied to biological or medical problemsproblems

Page 5: Principles of Statistics

DefinitionsDefinitions

StatisticsStatistics - Collection of methods for - Collection of methods for planning experiments, obtaining data, and planning experiments, obtaining data, and then organizing, summarizing, presenting, then organizing, summarizing, presenting, analyzing, interpreting, and drawing analyzing, interpreting, and drawing conclusions. conclusions. StatisticStatistic - Characteristic or measure - Characteristic or measure obtained from a sample e.g mean, obtained from a sample e.g mean, variance, Chi-square statistic, t-test variance, Chi-square statistic, t-test statistic etcstatistic etc

Page 6: Principles of Statistics

DefinitionsDefinitions

Inferential StatisticsInferential Statistics - Generalizing from - Generalizing from samples to populations using probabilities.samples to populations using probabilities.Performing hypothesis testing, determining Performing hypothesis testing, determining relationships between variables, and relationships between variables, and making predictions. making predictions. Descriptive statisticsDescriptive statistics refers to the process refers to the process of organizing and summarising collected of organizing and summarising collected information (data) to study the properties information (data) to study the properties of a variable of a variable

Page 7: Principles of Statistics

Definitions Definitions

PopulationPopulation - All subjects possessing a - All subjects possessing a common characteristic that is being common characteristic that is being studied. studied. SampleSample - A subgroup or subset of the - A subgroup or subset of the population. population. ParameterParameter - Characteristic or measure - Characteristic or measure obtained from a population. obtained from a population. StatisticStatistic - Characteristic or measure - Characteristic or measure obtained from a sample. obtained from a sample.

Page 8: Principles of Statistics

Definitions Definitions

VariableVariable - Characteristic or attribute that - Characteristic or attribute that can assume different values, it is the can assume different values, it is the fundamental element of statistical fundamental element of statistical analysis; it is something analysis; it is something measured/counted or identified measured/counted or identified

Page 9: Principles of Statistics

VariablesVariables

QualitativeQualitative - Variables which assume non- - Variables which assume non-numerical values. numerical values. QuantitativeQuantitative - Variables which assume numerical - Variables which assume numerical values. values. DiscreteDiscrete - Variables which assume a finite or - Variables which assume a finite or countable number of possible values. Usually countable number of possible values. Usually obtained by counting. obtained by counting. ContinuousContinuous - Variables which assume an infinite - Variables which assume an infinite number of possible values. Usually obtained by number of possible values. Usually obtained by measurement. measurement.

Page 10: Principles of Statistics

Variables (qualitative/categorical)Variables (qualitative/categorical)

NominalNominal Level - Level of measurement Level - Level of measurement which classifies data into mutually which classifies data into mutually exclusive, all inclusive categories in which exclusive, all inclusive categories in which no order or ranking can be imposed on the no order or ranking can be imposed on the data. data. OrdinalOrdinal Level - Level of measurement Level - Level of measurement which classifies data into categories that which classifies data into categories that can be ranked. Differences between the can be ranked. Differences between the ranks do not exist. ranks do not exist.

Page 11: Principles of Statistics

Variables (quantitative/numerical)Variables (quantitative/numerical)

IntervalInterval Level - Level of measurement which Level - Level of measurement which classifies data that can be ranked and classifies data that can be ranked and differences are meaningful. However, there is no differences are meaningful. However, there is no meaningful zero, so ratios are meaningless. meaningful zero, so ratios are meaningless. (temperature in celcius, fahrenheit etc)(temperature in celcius, fahrenheit etc)

RatioRatio Level - Level of measurement which Level - Level of measurement which classifies data that can be ranked, differences classifies data that can be ranked, differences are meaningful, and there is a true zero. True are meaningful, and there is a true zero. True ratios exist between the different units of ratios exist between the different units of measure. (temperature in Kelvin)measure. (temperature in Kelvin)

Page 12: Principles of Statistics

SAMPLINGSAMPLING

RandomRandom - Sampling in which the data is collected using chance - Sampling in which the data is collected using chance methods or random numbers. methods or random numbers. SystematicSystematic - Sampling in which data is obtained by selecting every - Sampling in which data is obtained by selecting every kkth object. th object. StratifiedStratified - Sampling in which the population is divided into groups - Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques. is then sampled using one of the other sampling techniques. Cluster Cluster - Sampling in which the population is divided into groups - Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected.selected, and then all of the elements in those groups are selected.

(all the above methods have a known probability function)(all the above methods have a known probability function) ConvenienceConvenience - Sampling in which data is which is readily available - Sampling in which data is which is readily available is used, the probability of being selected as a sample is unknown. is used, the probability of being selected as a sample is unknown.

Page 13: Principles of Statistics

Why sample?Why sample?

Cheaper than getting data from everyone!Cheaper than getting data from everyone!Internally valid study because:Internally valid study because:-easier to manage-easier to manage-standardise methods-standardise methods-easier to conduct-easier to conduct-less people involved-less people involvedGood sampling method may ensure Good sampling method may ensure external validity!external validity!

Page 14: Principles of Statistics

Beware when samplingBeware when sampling

Page 15: Principles of Statistics

• Two samples have been taken at random from the same population

• By chance, sample 1 contains a group of relatively large fish while sample 2 are relatively small

• You might mistakenly conclude that the two populations are very different

* Even a random sample may not be a good representative of the population from which it has been taken

Page 16: Principles of Statistics

• Samples selected at random from very different populations may not be different.

• Simply by chance sample 1 and sample 2 are similar

• Even if two populations are very different, samples from each may be similar

• The misleading impression – the populations are similar

Page 17: Principles of Statistics

• Two samples of equal-size fish were taken from the same population

• One group fed with vitamin supplement diet for 300days & the other untreated control group

• The supplement diet caused 10% increase in length but the difference is small compared with the variation in growth among individuals which may obscure any effect of treatment

* Natural variation among individuals within a sample may obscure any effect of an experimental treatment

Page 18: Principles of Statistics

Because of the natural variability Because of the natural variability among living species:among living species:

A ‘true’ difference may not be apparentA ‘true’ difference may not be apparent

The effect of treatment may not be The effect of treatment may not be apparent after a clinical trialapparent after a clinical trial

Page 19: Principles of Statistics

How to solve this unavoidable problem in Life Sciences?

Researchers need to know how to sample to ensure you have a good representative sample of your population.

They also need a good understanding of experimental design, because a good design will take natural variation into account.

Know how to minimise additional unwanted variation introduced by the experimental procedure itself.

Need to take accurate and precise measurements to minimise other sources of error.

Page 20: Principles of Statistics

EXTERNAL VALIDITYEXTERNAL VALIDITY

External validityExternal validity is the extent to which the is the extent to which the results of a study are applicable to OTHER results of a study are applicable to OTHER populationspopulations

““Can my results be extrapolated to Can my results be extrapolated to others?”others?”

Page 21: Principles of Statistics

INTERNAL VALIDITYINTERNAL VALIDITY

Internal validityInternal validity is the extent to which the is the extent to which the results of an investigation accurately results of an investigation accurately reflects the true situation of the study reflects the true situation of the study populationpopulation

““The ability to measure what it sets out to The ability to measure what it sets out to measure”measure”

Avoids Avoids BIASBIAS or or SYSTEMIC ERRORSSYSTEMIC ERRORS

Page 22: Principles of Statistics

Sources of ‘errors’Sources of ‘errors’

Bias – a systematic error that can lead to a Bias – a systematic error that can lead to a distortion of the results; “deviation from truth”distortion of the results; “deviation from truth”

ConfoundingConfounding

ChanceChance

Page 23: Principles of Statistics

BIASBIAS

Selection bias (non random sample, Selection bias (non random sample, healthy worker effect etc)healthy worker effect etc)

Information bias (measurement Information bias (measurement inaccuracy, misclassification, recall bias, inaccuracy, misclassification, recall bias, interviewer bias etcinterviewer bias etc

Performance bias may occur in multicentre Performance bias may occur in multicentre studiesstudies

Page 24: Principles of Statistics

ConfoundingConfounding

Mixing of the effect of an extraneous variable Mixing of the effect of an extraneous variable with the effects of the exposure and disease of with the effects of the exposure and disease of interestinterest

For a variable to be considered a confounder, it For a variable to be considered a confounder, it must satisfy two conditions i.e. (a) has an must satisfy two conditions i.e. (a) has an association with the outcome of interest andassociation with the outcome of interest and(b) is also independently associated with the (b) is also independently associated with the exposure (NOT a result of being exposed)exposure (NOT a result of being exposed)

Page 25: Principles of Statistics

ConfoundingConfounding

Occurs when groups being compared are Occurs when groups being compared are different with regards to important risk or different with regards to important risk or prognostic factors other than the factor prognostic factors other than the factor under investigationunder investigation

Certain study designs are prone to Certain study designs are prone to confounding i.e. case controlconfounding i.e. case control

Page 26: Principles of Statistics

Mann JI et al (1968). Oral contraceptive use in older women and fatal Mann JI et al (1968). Oral contraceptive use in older women and fatal myocardial infarction. Br Med J 2: 193-199myocardial infarction. Br Med J 2: 193-199

153 women with myocardial 153 women with myocardial infarction (MI)infarction (MI)

178 controls178 controls

Past exposure to oral Past exposure to oral contraceptives (ocp) were contraceptives (ocp) were investigatedinvestigated

Second table is stratified Second table is stratified according to ageaccording to age

Note that OR became higherNote that OR became higher

The confounder ‘age’ The confounder ‘age’ weakened the relationship weakened the relationship between MI and ocpbetween MI and ocp

USERUSER Non Non USERUSER

casescases 3939 114114controlscontrols 2424 154154O.R. = 2.2O.R. = 2.2

AgeAge

UserUser

<40<40

Non Non useruser

AgeAge

UserUser

40-4440-44

Non Non useruser

CasesCases 2121 2626 1818 8888

ControlsControls 1717 5959 77 9595

O.R.O.R. 2.82.8 2.82.8

Page 27: Principles of Statistics

COULD IT BE DUE TO CHANCE?COULD IT BE DUE TO CHANCE?

Type I and Type II errorsType I and Type II errors

(will be explained in other lectures / slides)(will be explained in other lectures / slides)

Page 28: Principles of Statistics

Could errors have been Could errors have been introduced?introduced?

Susceptibility (?differences in basic Susceptibility (?differences in basic characteristics)characteristics)

Performance (e.g. differences in Performance (e.g. differences in proficiencies of treatment)proficiencies of treatment)

Detection (differences in measurement of Detection (differences in measurement of outcome)outcome)

Transfer (differential losses to follow-up)Transfer (differential losses to follow-up)

Page 29: Principles of Statistics

The Normal DistributionThe Normal Distribution

Theoretical distribution that has the shape of a Theoretical distribution that has the shape of a bell-shaped curvebell-shaped curvePerfectly symmetrical about its centre Perfectly symmetrical about its centre (mean=median=mode)(mean=median=mode)Standard deviation reflects the spread of Standard deviation reflects the spread of individual observations; 68% of the observations individual observations; 68% of the observations are located I std deviation from the mean.are located I std deviation from the mean.We can thus estimate the area under the curve We can thus estimate the area under the curve for any value of the variable once we know the for any value of the variable once we know the mean and standard deviation of the distributionmean and standard deviation of the distribution

Page 30: Principles of Statistics

Normal distribution curvesNormal distribution curves

Page 31: Principles of Statistics

Normal distributionNormal distribution

Many other distributions e.g. binomial Many other distributions e.g. binomial probabilities, Poisson approximates the probabilities, Poisson approximates the normal distribution under certain normal distribution under certain conditionsconditionsThe advantage of this normal The advantage of this normal approximation is that standard probability approximation is that standard probability tables for the normal distribution can be tables for the normal distribution can be used for binomial problems or Poisson used for binomial problems or Poisson distributions. distributions.

Page 32: Principles of Statistics

Standard ErrorStandard Error

Spread of observations in one experiment yields Spread of observations in one experiment yields a single mean and standard deviationsa single mean and standard deviationsRepeated sampling from the same population Repeated sampling from the same population will result in a normal distribution of means with will result in a normal distribution of means with a ‘grand’ mean (mean of means) and a spread a ‘grand’ mean (mean of means) and a spread called STANDARD ERRORcalled STANDARD ERRORStandard error = standard deviation Standard error = standard deviation ÷√n where n ÷√n where n is the sample sizeis the sample sizeSample size and variability of measurements Sample size and variability of measurements determine magnitude of standard errordetermine magnitude of standard errorUsed to construct confidence intervalsUsed to construct confidence intervals

Page 33: Principles of Statistics

Binomial probabilitiesBinomial probabilities

The distribution curve has a mean (M)=np and The distribution curve has a mean (M)=np and standard deviation (S)=standard deviation (S)=√(npq) where n is the √(npq) where n is the number of trials; p is the probability of outcome number of trials; p is the probability of outcome A and q is the probability of outcome B.A and q is the probability of outcome B.

Refers to situations where there are 2 Refers to situations where there are 2 alternatives (success/failure; black/white; alternatives (success/failure; black/white; heads/tails; alive/dead etc) i.e. p+q=1heads/tails; alive/dead etc) i.e. p+q=1

Used to determine whether results observed in Used to determine whether results observed in trials/experiments would have occurred trials/experiments would have occurred randomlyrandomly

Page 34: Principles of Statistics

Normal approximation to the Normal approximation to the binomial probability distributionbinomial probability distribution

Large n (number of trials)Large n (number of trials)

p not close to 0 (probability of occurrence p not close to 0 (probability of occurrence is not rare)is not rare)

Product of np>5Product of np>5

Page 35: Principles of Statistics

Example of a problem to be solved Example of a problem to be solved by using binomial theory / method:-by using binomial theory / method:-

What is the probability of the number of What is the probability of the number of successes (As) in the sample of n trials?successes (As) in the sample of n trials?Formula to calculate the probabilityFormula to calculate the probabilitynnCCrr . P . Prr . (1-p) . (1-p)n-rn-r

where r is the number of As (successes), n where r is the number of As (successes), n is the number of trials and nCr is known as is the number of trials and nCr is known as the binomial coefficient.the binomial coefficient.Description of binomial problem in a later Description of binomial problem in a later lecture.lecture.

Page 36: Principles of Statistics

Poisson DistributionPoisson Distribution

Useful for calculating probabilities of rare eventsUseful for calculating probabilities of rare events

No No a priori a priori estimate can be made of the estimate can be made of the probability that the event will occurprobability that the event will occur

Pr (n) = Pr (n) = ee-m-m m mnn

n!n!where Pr (n) is probability of occurrence of n such events, m is the where Pr (n) is probability of occurrence of n such events, m is the

mean number of events, e is a mathematical constant and n! is mean number of events, e is a mathematical constant and n! is factorial of nfactorial of n

Mean and the standard deviation in a Poisson Mean and the standard deviation in a Poisson distribution = mdistribution = m

Page 37: Principles of Statistics

Poisson Probability DistributionPoisson Probability Distribution

Must meet 4 conditions i.e.Must meet 4 conditions i.e.

Discontinuous (discrete) dataDiscontinuous (discrete) data

Chance of a result is SMALLChance of a result is SMALL

Chance of a result is independent of Chance of a result is independent of previous resultsprevious results

A large number of tests can be performedA large number of tests can be performed

Approaches normal with mean m and Approaches normal with mean m and standard deviation standard deviation √m for mean values>30√m for mean values>30

Page 38: Principles of Statistics

Statistical testsStatistical tests

Used for testing of statistical significanceUsed for testing of statistical significance

To decide on accepting or rejecting your To decide on accepting or rejecting your null hypothesisnull hypothesis

Type of test you use depends on type of Type of test you use depends on type of data as well as whether the data data as well as whether the data approximates the normal curveapproximates the normal curve

Page 39: Principles of Statistics

General Guide to select an appropriate statistical General Guide to select an appropriate statistical test in univariate analysistest in univariate analysis

Numberof groups

Independent variable Dependent variable Parametric test Non – Parametric test

two (Independent ) Categorical (e.g. smokers and non-smokers)Categorical (e.g. smokers and non-smokers)Categorical(e.g. smokers and non-smokers)

Categorical(e.g. CHD and no CHD)

Categorical (e.g. CHD and no CHD

Numerical(PEFR level)

-

-

Independent t test

Chi – square testFisher’s Exact test

Mantel Hanszel test ( if a third variable is controlled for ) (e.g. age group is controlled for)

Mann – Whitney test(Wilcoxon Ranked Sum test)

two ( Dependent ) Categorical(e.g. pre-intervention and post-intervention)Categorical(e.g. pre-treatment and post-treatment)

Categorical(e.g. behavioral changes)Numerical (e.g. blood pressure)

-

Paired - t test

Mc Nemar test

Wilcoxon Signed Rank test

>two ( Independent) Categorical (e.g. race)

Categorical(e.g. race)

Categorical(e.g. diabetic and non-diabetic)Numerical(e.g. blood sugar level)

-

One – way ANOVA

Chi-square test

Kruskal – Wallis test

two Numerical (e.g. height )

Numerical (e.g. weight)

Pearson correlation & Simple Linear Regression

Spearman correlation

Page 40: Principles of Statistics

Confidence Intervals (CI)Confidence Intervals (CI)

Estimating where the “true” population Estimating where the “true” population parameter is believed to be found within a given parameter is believed to be found within a given level of confidence (95%, 99% or more)level of confidence (95%, 99% or more)Parameters of interest are usually means, Parameters of interest are usually means, proportions, difference between means or proportions, difference between means or proportions, regression coefficients, correlation proportions, regression coefficients, correlation coefficients, relative riskscoefficients, relative risksCIs extremely useful in assessing clinical CIs extremely useful in assessing clinical significance of a given resultsignificance of a given result95% CI=sample estimate95% CI=sample estimate± 1.96 x SE± 1.96 x SE

Page 41: Principles of Statistics

THANK YOUTHANK YOU