Introduction to Distributions and
Probability Peter T. Donnan
Professor of Epidemiology and Biostatistics
Statistics for Health Statistics for Health ResearchResearch
OverviewOverview
•DistributionsDistributions
•History of probabilityHistory of probability
•Definitions of probabilityDefinitions of probability
•Random variableRandom variable
•Probability density functionProbability density function
•Normal, Binomial and Poisson Normal, Binomial and Poisson distributionsdistributions
•DistributionsDistributions
•History of probabilityHistory of probability
•Definitions of probabilityDefinitions of probability
•Random variableRandom variable
•Probability density functionProbability density function
•Normal, Binomial and Poisson Normal, Binomial and Poisson distributionsdistributions
Introduction to Probability Introduction to Probability Density FunctionsDensity Functions
•Normal Distribution / Normal Distribution / •Gaussian / Bell curveGaussian / Bell curve•Poisson named after French Poisson named after French MathematicianMathematician•Binomial related to binary Binomial related to binary factors (Bernoulli Trials)factors (Bernoulli Trials)
Early use of Early use of Normal Normal
DistributionDistribution•Gauss was a German Gauss was a German
mathematician who solved mathematician who solved mystery of where Ceres would mystery of where Ceres would appear after it disappeared appear after it disappeared behind the Sun. behind the Sun.
•He assumed the errors formed a He assumed the errors formed a Normal distribution and Normal distribution and managed to accurately predict managed to accurately predict the orbit of Ceres the orbit of Ceres
•Gauss was a German Gauss was a German mathematician who solved mathematician who solved mystery of where Ceres would mystery of where Ceres would appear after it disappeared appear after it disappeared behind the Sun. behind the Sun.
•He assumed the errors formed a He assumed the errors formed a Normal distribution and Normal distribution and managed to accurately predict managed to accurately predict the orbit of Ceres the orbit of Ceres
What is the What is the relationship relationship between the between the
Normal or Normal or Gaussian Gaussian
distribution and distribution and probability?probability?
ProbabilityProbability
““The probable is what usually The probable is what usually happens”happens”
AristotleAristotle
““I cannot believe that God I cannot believe that God plays dice with the cosmos”plays dice with the cosmos”
Albert EinsteinAlbert Einstein
Origins of ProbabilityOrigins of Probability
• Early interest in permutations Early interest in permutations Vedic literature 400 BCVedic literature 400 BC
• Distinguished origins in betting Distinguished origins in betting and gambling!and gambling!
• Pascal and Fermat studied division Pascal and Fermat studied division of stakes in gambling (1654)of stakes in gambling (1654)
• Enlightenment – seen as helping Enlightenment – seen as helping public policy, social equitypublic policy, social equity
• Astronomy – Gauss (1801)Astronomy – Gauss (1801)• Social and genetic – Galton (1885)Social and genetic – Galton (1885)• Experimental design – Fisher Experimental design – Fisher
(1936)(1936)
• Early interest in permutations Early interest in permutations Vedic literature 400 BCVedic literature 400 BC
• Distinguished origins in betting Distinguished origins in betting and gambling!and gambling!
• Pascal and Fermat studied division Pascal and Fermat studied division of stakes in gambling (1654)of stakes in gambling (1654)
• Enlightenment – seen as helping Enlightenment – seen as helping public policy, social equitypublic policy, social equity
• Astronomy – Gauss (1801)Astronomy – Gauss (1801)• Social and genetic – Galton (1885)Social and genetic – Galton (1885)• Experimental design – Fisher Experimental design – Fisher
(1936)(1936)
Types of ProbabilityTypes of Probability
Two basic definitions:Two basic definitions:Two basic definitions:Two basic definitions:
1) Frequentist1) Frequentist
ClassicalClassical
Proportion of Proportion of times an times an event occurs event occurs in a long in a long series of series of ‘trials’‘trials’
2) Subjectivist2) Subjectivist
BayesianBayesian
Strength of belief Strength of belief in event in event happeninghappening
Frequentists vs. Frequentists vs. BayesiansBayesians
•Two entrenched camps Two entrenched camps
•Scientists tend to use the Scientists tend to use the frequentist approachfrequentist approach
•Bayesians gaining groundBayesians gaining ground
•Most scientists use frequentist Most scientists use frequentist methods but incorrectly methods but incorrectly interpret results in a Bayesian interpret results in a Bayesian way!way!
•Two entrenched camps Two entrenched camps
•Scientists tend to use the Scientists tend to use the frequentist approachfrequentist approach
•Bayesians gaining groundBayesians gaining ground
•Most scientists use frequentist Most scientists use frequentist methods but incorrectly methods but incorrectly interpret results in a Bayesian interpret results in a Bayesian way!way!
Frequentists Frequentists
•Consider tossing a fair coinConsider tossing a fair coin
• In any trial, event may be a In any trial, event may be a ‘head’ or ‘tail’ i.e. binary‘head’ or ‘tail’ i.e. binary
•Repeated tossing gives Repeated tossing gives series of ‘events’series of ‘events’
• In long run prob of In long run prob of heads=0.5heads=0.5
•Consider tossing a fair coinConsider tossing a fair coin
• In any trial, event may be a In any trial, event may be a ‘head’ or ‘tail’ i.e. binary‘head’ or ‘tail’ i.e. binary
•Repeated tossing gives Repeated tossing gives series of ‘events’series of ‘events’
• In long run prob of In long run prob of heads=0.5heads=0.5
TTHHTTTTHHHHHHHHTTHHHHHHTTHHHHHHTTTTHHTTTTTTHHHHTTTTHHTTTTHHHHHHTTTTTTHHHHTTHHHHHHTTTTTTTTTTHHHHHH
0.6 0.56 0.6 0.56 0.52 0.52
Frequentist Frequentist Probability Probability
• Note the difference between ‘long run’ Note the difference between ‘long run’ probability and an individual trialprobability and an individual trial
• In an individual trial a head either In an individual trial a head either occurs (X=1) or does not occur (X=0)occurs (X=1) or does not occur (X=0)
• Patient either survives or dies Patient either survives or dies following an MIfollowing an MI
• Prob of dying after MI ≈ 30% based on Prob of dying after MI ≈ 30% based on a previous long series from a a previous long series from a population of individuals who population of individuals who experienced MI experienced MI
• Note the difference between ‘long run’ Note the difference between ‘long run’ probability and an individual trialprobability and an individual trial
• In an individual trial a head either In an individual trial a head either occurs (X=1) or does not occur (X=0)occurs (X=1) or does not occur (X=0)
• Patient either survives or dies Patient either survives or dies following an MIfollowing an MI
• Prob of dying after MI ≈ 30% based on Prob of dying after MI ≈ 30% based on a previous long series from a a previous long series from a population of individuals who population of individuals who experienced MI experienced MI
Subjective Subjective ProbabilityProbability
•Based on strength of beliefBased on strength of belief•ButBut more akin to thinking of more akin to thinking of
clinician making a diagnosisclinician making a diagnosis•Faced with patient with chest Faced with patient with chest
pain, based on past experience, pain, based on past experience, believes prob of heart disease is believes prob of heart disease is 20%20%
•Person tossing coin believes prob Person tossing coin believes prob of head is 1/2of head is 1/2
•Based on strength of beliefBased on strength of belief•ButBut more akin to thinking of more akin to thinking of
clinician making a diagnosisclinician making a diagnosis•Faced with patient with chest Faced with patient with chest
pain, based on past experience, pain, based on past experience, believes prob of heart disease is believes prob of heart disease is 20%20%
•Person tossing coin believes prob Person tossing coin believes prob of head is 1/2of head is 1/2
Comparison of Comparison of definitions of definitions of ProbabilityProbability
• Problems of subjective probabilityProblems of subjective probability
• Probability for same patient can vary Probability for same patient can vary even with same clinicianeven with same clinician
• Person can Person can believebelieve prob of head is 0.1 prob of head is 0.1 even if it is a fair coineven if it is a fair coin
• Subjectivists argue they are more Subjectivists argue they are more realisticrealistic
• This course sticks to ‘frequentist’ and This course sticks to ‘frequentist’ and ‘model-based’ methods of probability‘model-based’ methods of probability
• Problems of subjective probabilityProblems of subjective probability
• Probability for same patient can vary Probability for same patient can vary even with same clinicianeven with same clinician
• Person can Person can believebelieve prob of head is 0.1 prob of head is 0.1 even if it is a fair coineven if it is a fair coin
• Subjectivists argue they are more Subjectivists argue they are more realisticrealistic
• This course sticks to ‘frequentist’ and This course sticks to ‘frequentist’ and ‘model-based’ methods of probability‘model-based’ methods of probability
Random Random VariableVariable
•Consider rolling 2 dice and we want Consider rolling 2 dice and we want to summarise the probabilities of all to summarise the probabilities of all possible outcomespossible outcomes
•We call the outcome a random We call the outcome a random variable X which can have any value variable X which can have any value in this case from 2 to 12 in this case from 2 to 12
•Enumerate all probabilities in Enumerate all probabilities in sample space Ssample space S
•P (2) = 1/6x1/6 = 1/36, P (3)=2/36, P (2) = 1/6x1/6 = 1/36, P (3)=2/36, P (4) = 3/36, etc…..P (4) = 3/36, etc…..
•Consider rolling 2 dice and we want Consider rolling 2 dice and we want to summarise the probabilities of all to summarise the probabilities of all possible outcomespossible outcomes
•We call the outcome a random We call the outcome a random variable X which can have any value variable X which can have any value in this case from 2 to 12 in this case from 2 to 12
•Enumerate all probabilities in Enumerate all probabilities in sample space Ssample space S
•P (2) = 1/6x1/6 = 1/36, P (3)=2/36, P (2) = 1/6x1/6 = 1/36, P (3)=2/36, P (4) = 3/36, etc…..P (4) = 3/36, etc…..
Probability Density Function Probability Density Function for rolling two dicefor rolling two dice
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
6/36
5/36
4/36
3/36
2/36
1/36
Probability Density Function Probability Density Function for rolling two dicefor rolling two dice
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
6/36
5/36
4/36
3/36
2/36
1/36
What is probability of getting 12? Answer 1/36What is probability of getting 12? Answer 1/36
What is probability of getting more than 8? Ans. What is probability of getting more than 8? Ans. 10/3610/36
Probability Density Function Probability Density Function for continuous variablefor continuous variable
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
6/36
5/36
4/36
3/36
2/36
1/36
Consider distribution of Consider distribution of weight in kg; all values weight in kg; all values
possible not just discretepossible not just discrete
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
20…….30……40…… 50 ……20…….30……40…… 50 ……60…….70…….80…..90….100….110…… 120 60…….70…….80…..90….100….110…… 120
Probability
Probability
Weight in kilogramsWeight in kilograms
Probability Density Probability Density Function in SPSSFunction in SPSS
Use Analyze / Descriptive Statistics / FrequenciesUse Analyze / Descriptive Statistics / Frequencies
and select no table and charts box as belowand select no table and charts box as below
Probability Density Probability Density Function in SPSSFunction in SPSS
Data from ‘LDL Data.sav’ of baseline LDL Data from ‘LDL Data.sav’ of baseline LDL cholesterolcholesterol
Normal DistributionNormal Distribution
Note that a Normal or Gaussian Note that a Normal or Gaussian
curve is defined by two parameters:curve is defined by two parameters:
Mean µMean µ and and Standard Deviation Standard Deviation σσ
And often written as N ( µ, And often written as N ( µ, σσ ) )
Hence any Normal distribution has Hence any Normal distribution has mathematical formmathematical form
Impossible to be integrated so area under the Impossible to be integrated so area under the curve obtained by numerical integration and curve obtained by numerical integration and tabulated! tabulated!
Normal DistributionNormal Distribution
As noted earlier the curve is symmetrical As noted earlier the curve is symmetrical about the mean and so p ( x ) > mean = 0.5 or about the mean and so p ( x ) > mean = 0.5 or 50%50%
And p ( x ) < mean = 0.5 or 50%And p ( x ) < mean = 0.5 or 50%
And p (a < x < b) = p(b) – p(a) And p (a < x < b) = p(b) – p(a)
50% 50%
Normal Distribution and Normal Distribution and ProbabilitiesProbabilities
So we now have a way of working out the So we now have a way of working out the probability of any value or range of values of a probability of any value or range of values of a variables variables IFIF a Normal distribution is a a Normal distribution is a reasonable fit to the datareasonable fit to the data
p (a < x < b) = p(b) – p(a) which is the area p (a < x < b) = p(b) – p(a) which is the area under the curve between a and b under the curve between a and b
50% 50%
Normal DistributionNormal Distribution
Most of area lies between +1 and -1 SD (64%)Most of area lies between +1 and -1 SD (64%)
The large majority lie between +2 and -2 SDs The large majority lie between +2 and -2 SDs (95%)(95%)
Probability Density Probability Density Function (PDF) = Function (PDF) =
Normal Normal DistributionDistribution
How well does my data How well does my data fit a Normal fit a Normal Distribution?Distribution?
Note median and mean virtually the sameNote median and mean virtually the same
Skewness = 0.039, close to zeroSkewness = 0.039, close to zero
Skewness is measure of symmetry (0=perfect Skewness is measure of symmetry (0=perfect symetry)symetry)
Eyeball test - fitted normal curve looks good!Eyeball test - fitted normal curve looks good!
Statistics
Baseline LDL1383
0
3.454363
3.506214
.9889157
.039
.066
.3345
7.5650
Valid
Missing
N
Mean
Median
Std. Dev iation
Skewness
Std. Error of Skewness
Minimum
Maximum
Try Q-Q plot in Analyze / Try Q-Q plot in Analyze / Descriptive Statistics/ Q-Q Descriptive Statistics/ Q-Q
plotplot
Plot compares Plot compares Expected Normal Expected Normal distribution with distribution with real data and if data real data and if data lies on line y = x lies on line y = x then the Normal then the Normal Distribution is a Distribution is a good fitgood fit
Note still an eyeball Note still an eyeball test!test!
Is this a good fit?Is this a good fit?
I used to be Normal until I I used to be Normal until I discovered Kilmogorov-discovered Kilmogorov-
Smirnoff!Smirnoff!
Eyeball Test indicates distribution is Eyeball Test indicates distribution is approximately Normal approximately Normal butbut K-S test is K-S test is significant indicating discrepancy compared to significant indicating discrepancy compared to NormalNormal
WARNING: DO NOT RELY ON THIS TESTWARNING: DO NOT RELY ON THIS TEST
One-Sample Kolmogorov-Smirnov Test
1383
3.454363
.9889157
.043
.043
-.043
1.617
.011
N
Mean
Std. Dev iation
Normal Parameters a,b
Absolute
Positive
Negative
Most ExtremeDif f erences
Kolmogorov -Smirnov Z
Asy mp. Sig. (2-tailed)
Baseline LDL
Test distribution is Normal.a.
Calculated f rom data.b.
Consider the distribution of Consider the distribution of survival times following surgery survival times following surgery
for colorectal cancerfor colorectal cancer
Note median=835 days and mean=848Note median=835 days and mean=848
Skewness = 2.081, very skewed (> Skewness = 2.081, very skewed (> 1.0)1.0)
Strong tail to right! Approximately Strong tail to right! Approximately Normal?Normal?
Statistics
Time f rom Surgery476
0
848.3908
835.5000
582.39657
2.081
.112
14.00
5763.00
Valid
Missing
N
Mean
Median
Std. Dev iation
Skewness
Std. Error of Skewness
Minimum
Maximum
Try a log transformation for Try a log transformation for right positive skewed data?right positive skewed data?
Better but now slightly skewed to Better but now slightly skewed to left!left!
Statistics
logtime476
0
6.4346
6.7286
.95059
-1.504
.112
2.67
8.66
Valid
Missing
N
Mean
Median
Std. Dev iation
Skewness
Std. Error of Skewness
Minimum
Maximum
Examples of skewed Examples of skewed distributions in Health distributions in Health
ResearchResearchDiscrete random variables – hospital admissions, Discrete random variables – hospital admissions, cigarettes smoked, alcohol consumption, costscigarettes smoked, alcohol consumption, costs
Continuous RV – BMI, cholesterol, BPContinuous RV – BMI, cholesterol, BP
30%30%
The Binomial The Binomial DistributionDistribution
• ‘‘Binomial’ means ‘two numbers’. Binomial’ means ‘two numbers’. • Outcomes of health research are Outcomes of health research are
often measured by whether they often measured by whether they have occurred or not. have occurred or not.
• For example, recovered from disease, For example, recovered from disease, admitted to hospital, died, etcadmitted to hospital, died, etc
• May be modelled by assuming that May be modelled by assuming that the the number of events number of events n has a n has a binomial distribution with a fixed binomial distribution with a fixed probability of event pprobability of event p
• ‘‘Binomial’ means ‘two numbers’. Binomial’ means ‘two numbers’. • Outcomes of health research are Outcomes of health research are
often measured by whether they often measured by whether they have occurred or not. have occurred or not.
• For example, recovered from disease, For example, recovered from disease, admitted to hospital, died, etcadmitted to hospital, died, etc
• May be modelled by assuming that May be modelled by assuming that the the number of events number of events n has a n has a binomial distribution with a fixed binomial distribution with a fixed probability of event pprobability of event p
The Binomial The Binomial DistributionDistribution
• Based on work of Jakob Bernoulli, a Based on work of Jakob Bernoulli, a Swiss mathematicianSwiss mathematician
• Refused a church appointment and Refused a church appointment and instead studied mathematicsinstead studied mathematics
• Early use was for games of chance but Early use was for games of chance but now used in every human endeavournow used in every human endeavour
• When n = 1 this is called a Bernoulli When n = 1 this is called a Bernoulli trialtrial
• Binomial distribution is distribution for Binomial distribution is distribution for a series of Bernoulli trialsa series of Bernoulli trials
• Based on work of Jakob Bernoulli, a Based on work of Jakob Bernoulli, a Swiss mathematicianSwiss mathematician
• Refused a church appointment and Refused a church appointment and instead studied mathematicsinstead studied mathematics
• Early use was for games of chance but Early use was for games of chance but now used in every human endeavournow used in every human endeavour
• When n = 1 this is called a Bernoulli When n = 1 this is called a Bernoulli trialtrial
• Binomial distribution is distribution for Binomial distribution is distribution for a series of Bernoulli trialsa series of Bernoulli trials
The Binomial The Binomial DistributionDistribution
• Binomial distribution written as B ( n , Binomial distribution written as B ( n , p) where n is the total number of p) where n is the total number of events and p = prob of an eventevents and p = prob of an event
• This is a Binomial This is a Binomial Distribution withDistribution with p=0.25 and n=20p=0.25 and n=20
• Binomial distribution written as B ( n , Binomial distribution written as B ( n , p) where n is the total number of p) where n is the total number of events and p = prob of an eventevents and p = prob of an event
• This is a Binomial This is a Binomial Distribution withDistribution with p=0.25 and n=20p=0.25 and n=20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Successes
0.00
0.05
0.10
0.15
0.20
Pro
babi
lity
of R
Suc
cess
es
The Binomial The Binomial DistributionDistribution
The Poisson The Poisson DistributionDistribution
Poisson distribution (1838), named Poisson distribution (1838), named after its inventor Simeon Poisson who after its inventor Simeon Poisson who was a French mathematician. He found was a French mathematician. He found that if we have a rare event (i.e. p is that if we have a rare event (i.e. p is small) and we know the expected or small) and we know the expected or mean ( or µ) number of occurrences, mean ( or µ) number of occurrences, the probabilities of 0, 1, 2 ... events the probabilities of 0, 1, 2 ... events are given by:are given by:
Poisson distribution (1838), named Poisson distribution (1838), named after its inventor Simeon Poisson who after its inventor Simeon Poisson who was a French mathematician. He found was a French mathematician. He found that if we have a rare event (i.e. p is that if we have a rare event (i.e. p is small) and we know the expected or small) and we know the expected or mean ( or µ) number of occurrences, mean ( or µ) number of occurrences, the probabilities of 0, 1, 2 ... events the probabilities of 0, 1, 2 ... events are given by:are given by:
!R
e)R(P
R
The Poisson The Poisson DistributionDistribution
Note similarity to BinomialNote similarity to BinomialIn fact when p is small and n is large In fact when p is small and n is large B(n, p) ~ P (µ = np)B(n, p) ~ P (µ = np)Also for large values of µ:Also for large values of µ:P (µ) ~ N ( µ, µ )P (µ) ~ N ( µ, µ )
Hence if n and p not known Hence if n and p not known could use Poisson instead could use Poisson instead
Note similarity to BinomialNote similarity to BinomialIn fact when p is small and n is large In fact when p is small and n is large B(n, p) ~ P (µ = np)B(n, p) ~ P (µ = np)Also for large values of µ:Also for large values of µ:P (µ) ~ N ( µ, µ )P (µ) ~ N ( µ, µ )
Hence if n and p not known Hence if n and p not known could use Poisson instead could use Poisson instead
The Poisson The Poisson DistributionDistribution
In health research often used to model the In health research often used to model the number of events assumed to be random: number of events assumed to be random:
Number of hip replacement failures,Number of hip replacement failures,Number of cases of C. diff infection,Number of cases of C. diff infection,Diagnoses of leukaemia around Diagnoses of leukaemia around
nuclear nuclear power stations,power stations,Number of H1N1 cases in Scotland,Number of H1N1 cases in Scotland,Etc.Etc.
In health research often used to model the In health research often used to model the number of events assumed to be random: number of events assumed to be random:
Number of hip replacement failures,Number of hip replacement failures,Number of cases of C. diff infection,Number of cases of C. diff infection,Diagnoses of leukaemia around Diagnoses of leukaemia around
nuclear nuclear power stations,power stations,Number of H1N1 cases in Scotland,Number of H1N1 cases in Scotland,Etc.Etc.
SummarySummary
•Many of variables measured in Health Research Many of variables measured in Health Research form distributions which approximate to common form distributions which approximate to common distributions with known mathematical distributions with known mathematical propertiesproperties
•Normal, Poisson, Binomial, etc…Normal, Poisson, Binomial, etc…
•Note a relationship for all centred Note a relationship for all centred
around the exponential distributionaround the exponential distribution
Where e = 2.718Where e = 2.718
•All belong to the Exponential Family of All belong to the Exponential Family of distributions distributions
•These probability distributions are critical to These probability distributions are critical to applying statistical methodsapplying statistical methods
RANNORM
2.051.05.05-.95-1.95-2.95
40
30
20
10
0
Std. Dev = .96
Mean = -.04
N = 501.00
SPSS PracticalSPSS Practical
• Read in data file ‘LDL Data.sav’Read in data file ‘LDL Data.sav’
• Consider adherence to statins, Consider adherence to statins, baseline LDL, min Chol achieved, baseline LDL, min Chol achieved, BMI, duration of statin use BMI, duration of statin use
• Assess distributions for normalityAssess distributions for normality
• If non-normal consider a If non-normal consider a transformationtransformation
• Try to carry out Q-Q plots Try to carry out Q-Q plots
• Read in data file ‘LDL Data.sav’Read in data file ‘LDL Data.sav’
• Consider adherence to statins, Consider adherence to statins, baseline LDL, min Chol achieved, baseline LDL, min Chol achieved, BMI, duration of statin use BMI, duration of statin use
• Assess distributions for normalityAssess distributions for normality
• If non-normal consider a If non-normal consider a transformationtransformation
• Try to carry out Q-Q plots Try to carry out Q-Q plots