description of measurement data
DESCRIPTION
Description of measurement data. Prof. Yi-xiong Lei (1021305). Distribution of frequency To summarize the data or describe the distri- bution of frequency, frequency table or graph is the common way. The type of distribution: Normal distribution Skewness distribution - PowerPoint PPT PresentationTRANSCRIPT
Description of measurement data
Prof. Yi-xiong Lei (1021305)
1.1. Distribution of frequencyDistribution of frequencyTo summarize the data or describe the distri-
bution of frequency, frequency table or graph
is the common way. The type of distribution:
Normal distribution
Skewness distribution
Positive skewness distribution
Negative skewness distribution
If we are faced with a large amount of data,
we want to describe its more important features
more concisely. Usually we describe the data fr
om two aspects.
Measures of Location
Central tendency (central position)
Measures of Spread
Tendency of dispersion (variation)
RBCs Mark Frequency
3.70~ | | 2 3.90~ | | | | 44.10~ 正 | | | | 94.30~ 正正正 | 164.50~ 正正正正 | | 224.70~ 正正正正正 254.90~ 正正正正 | 215.10~ 正正正 | | 175.30~ 正 | | | | 95.50~ | | | | 45.70~5.90 | 1
Total —— 130
* Normal distributionTable2-1. Frequency Distribution of Red Blood Cells (1012/L)
among 130 Normal Male Adults in Some District
﹡ Positive skewness distribution Table 2-2. Frequency Distribution of Hair Hg Value (μg/g)
among 238 Normal Adults
Value of hair Hg Frequency Accumulative Frequency AF (%) (1) (2) (3) (4)=(3)/238
0.3~ 20 20 8.4 0.7~ 66 86 36.1 1.1~ 60 146 61.3 1.5~ 48 194 81.5 1.9~ 18 212 89.1 2.3~ 16 228 95.8 2.7~ 6 234 98.3 3.1~ 1 235 98.7 3.5~ 0 235 98.7 3.9~ 3 238 100.0
﹡ Negative skewness distributionTable 2-3. Frequency Distribution of Patients who die
of Malignant Tumors in some year and district Age (yr.) No. of Death Accumulative Frequency AF (%) 0~ 5 5 0.42 10~ 12 17 1.41 20~ 15 32 2.66 30~ 76 108 8.98 40~ 189 297 24.69 50~ 234 531 44.14 60~ 386 917 76.23 70~ 286 1203 100.00
Figure 2-1. Frequency Distribution of Serum Cholesterol (mg/dl)
among 200 Normal Adults in Some District
2. Average2. AverageWhat measure is used to describe the central
tendency? It is the average including mean,
geometric mean and median.
Mean, symbolized by , -bar
Average Geometric mean, symbolized by G
Median, symbolized by M
1). Mean, is suitable to the data distri-buted in normal distribution or at least symmetric distribution.
x1+ x2+……+ xn ∑ x
Formula(1) x = = n n
f 1x1 + f 2x2 + ……+f kxk ∑ fx Formula(2) x = =
f 1 +f 2+……+f k n
The formula (1) is for original data (direct method)
The formula (2) is for frequency table (weighing method)
RBCs Middle value (X) Frequency (f) f X
3.80~ 3.90 2 7.8 4.00~ 4.10 6 24.6 4.20~ 4.30 11 47.3 4.40~ 4.50 25 112.5 4.60~ 4.70 32 150.4 4.80~ 4.90 27 132.3 5.00~ 5.10 17 86.7 5.20~ 5.30 13 68.9 5.40~ 5.50 4 22.0 5.60~ 5.70 2 11.4 5.80~6.00 5.90 1 5.9
Total — 140(∑f) 669.8 (∑f x)
X= ∑f x ∑f =
669.8140
= 4.78 (×1012/L)
Table2-4. Frequency Distribution of Red Blood Cells (1012/L) among 140 Normal Male Adults in Some District
2). Geometric mean, is suitable to the data distributed in positive skewed distribution or logarithm normal distribution.
(1) G = n √ x1 · x2 … xn
lgx1+lgx2+…+lgxn ∑ lgx G = lg–1 = lg–1
n n
f1lgx1+f2lgx2+…+fklgxk ∑f lgx (2) G = lg–1 = lg–1
∑f ∑f
The formula (1) is for original data (direct method)
The formula (2) is for frequency table (weighing method)
There are 6 items of serum antibodies, the concentrations respectively are 1:10, 1:100, 1:1000, 1:10000 and 1:100000, what is the average concentration ?
X=∑x n
10+100+… … … +100000 5=
=22222
G = lg –1 lgx1+lgx2+…+lgxn n 〔 〕= lg–1
∑ lgx n
= lg –1 1+2+3+4+55
〔 〕 = lg –1 3
=10001:1000 RightRight
WrongWrong
Ab concent. Children (f) Reciprocal (x) lgx flgx
1: 40 3 40 1.602 4.81 1: 80 22 80 1.903 41.87 1: 160 17 160 2.204 37.47 1: 320 9 320 2.505 22.55 1: 640 0 640 2.806 0.00 1:1280 1 1280 3.107 3.11 Total ∑52 — — 109.79
=129.2
1 : 129
Lg –1 ∑f lgx∑f 〔 〕= Lg –1 109.79
52〔 〕G =
Table 2-5. The special serum antibodies’ concentrations after one month when 52 susceptible
children immunized with measles vaccine
average antibodies’ concentration
3).Median, is suitable to all kinds of data but it is poor attribution for further ana-lysis comparing to mean.
M = X n+1 (n is odd No.) 2
1
or M= X n + X n (n is even No.) 2 2 2
+1
The following formula is for original data (direct method):
For example :
There are 9 cases, the latent period is 2, 3, 3, 3, 4, 5, 6, 9, 16 days, please calculate their average latent period.
M = X(n+1)/2 = X(9+1)/2 = X5 = 4 (days)
4).Median and percentile
for the data from a frequency table
we do not know the exactly value of median, using the following formula for median or percentile
Px =L+ i / fx ( n.x% - ΣfL )
(frequency table method or percentile method)
Px =L+ i / fx ( n.x% - ΣfL )X : percentile;
L : the low limit of group where percentile located in
i : the interval;
f : frequency in the group;
n : the total cases;
ΣfL : accumulative frequency that less than L.
If Px = 50% = M, using following formula:
M=L+i/f(n/2-ΣfL)
Table 2-6. The calculations of median and percentile of latent period of food poisoning among 164 cases
Latent period
( hours ) Cases ( f )Accumulative
frequency(Σf ) Accumulative
frequency ( %)
0 ~ 25 25 15.2
12 ~ 58 83 50.6
24 ~ 40 123 75.0
36 ~ 23 146 89.0
48 ~ 12 158 96.3
60 ~ 5 163 99.4
72 ~ 84 1 164 100.0
Median calculation: from table 2-6, accumulative frequency 50% is within the group “12 ~”, L=12 , i=12, f=58, ΣfL=25, n=164
M=L+i/f(n/2-ΣfL)=12+12/58(164/2-25)=23.8 (hrs)
Percentile (Px) calculation:when P95, x=95 , accumulative frequency 95% is withi
n the group “48 ~”, L=48 , i=12 , fx=12, ΣfL=146, n=164
P95=48+12/12 ( 164×95%-146 ) =57.8 (hrs)
Measures of SpreadMeasures of Spread
Tendency of dispersion (variation)
Prof. Yi-xiong Lei (1021305)
3. Measures of Spread3. Measures of Spread
There are some features to describe the distri -bution of different data. Two common features we might be interested in are:
> What is the typical (average) value of a variable (what is its location)?
> How much variability is there in the data (how much does it spread out)?
The common variations are the following:
Range, symbolized by R
Interval of quartile, symbolized by Q
Variations Variance, symbolized by 2, S2
Standard deviation, symbolized by , S
Coefficient of variation, symbolized by CV
1) Range,1) Range, is suitable to all kinds of data is suitable to all kinds of data
but it is a poor measure of variability but it is a poor measure of variability
because it is based on only two extreme because it is based on only two extreme
observations.observations.
R = Xmax - Xmin
2) Interval of quartile (Q), is the scale of variation, from the 25 percentile ( P25) to the 75 percentile( P75) .
Quartile is suitable to all kinds of data, especially for the data of skewness distribution, it’s application is better than range. Using the following formula to calculate the quartile (Q)
Px =L+ i / fx ( n.x% - ΣfL )
Q = Qu-QL= P75 - P25 = 36.0 -15.3=20.7 ( hrs )
The example above is
3) 3) Variance (2, s2) and
Standard deviation (SD or S)
They are the important variability measures and suitable to data of normal distribution
∑( X— μ) 2 ∑(X— μ) 2
σ2 = σ= N N
∑( X— X) 2 ∑( X— X) 2
S2 = S = n — 1 n — 1
The Formula for standard deviation
∑X2 — ( ∑X ) 2 / n Direct S = method √ n — 1
∑f X2 — ( ∑ f X) 2 / nWeighing S = method √ n — 1
For example, 5 persons’ diastolic blood
pressure are: 162, 145, 178, 142, 186 (mmHg)
∑X2 - ( ∑X ) 2 / nn - 1√S =
∑X = 813
∑X2 = 133317
√ =133317 – (813)2/ 5
5 –1
= 19.49 mmHg
RBCs Middle value (X) Frequency (f) f X f X 2
3.80~ 3.90 2 7.8 30.42 4.00~ 4.10 6 24.6 100.86 4.20~ 4.30 11 47.3 203.39 4.40~ 4.50 25 112.5 506.25 4.60~ 4.70 32 150.4 706.88 4.80~ 4.90 27 132.3 648.27 5.00~ 5.10 17 86.7 442.17 5.20~ 5.30 13 68.9 365.17 5.40~ 5.50 4 22.0 121.00 5.60~ 5.70 2 11.4 64.98 5.80~6.00 5.90 1 5.9 5.90
Total(∑) — 140 669.8 3224.20
∑fX2 - ( ∑fX ) 2 / nn - 1√ S = √ =
3224.20 – (669.8)2/n140 - 1 =0.38
Table2-7. Frequency Distribution of Red Blood Cells (1012/L) among 140 Normal Male Adults in Some District
4) Coefficient of variation (CV), is that the standard deviation divided by mean and then it is comparable between different data.
If comparing the variability among two or more than two groups that their metric units are different or their means are obvious different values
you may calculate their CV
CV = s / x × 100%
For example: Someone randomly measure the heights
(cm) and weighs (kg) of 110 health male students in the
age of 20 at a city in 2004, please compare the variability
between heights (cm) and weighs (kg)
For heights, knowing: = 172.73 (cm), S =4.09 (cm)
For weights, knowing: = 55.04 (kg), S =4.10 (kg)
For heights, CV = 4.09 / 172.73 ×100% = 2.37%
For weights, CV = 4.10 / 55.04 ×100% = 7.45%
Indicating the index of heights is more stable.
5) Applications of standard deviation
(A) Showing the variability (spread) of observations;
(B) Describing the features of normal distribution of data when combined with mean;
(C) Estimating the medical reference range when combined with mean;
(D) Calculating the standard error of mean when combined with sample size (n).
Normal Distribution Normal Distribution
&& It’s application It’s application
Prof. Yi-xiong Lei (1021305)
4. Normal Distribution and Application4. Normal Distribution and Application
(1) What means the normal distribution
Frequency (f )
125 129 133 137 141 145 149 153 157 161
Heights (cm)
f
Figure 2-1. Frequency distribution and its curve of heights among 120 health boys at the age of 12
Normal distribution curve
F(X)
f(X)
-∞ +∞
The normal distribution is defined by the The normal distribution is defined by the function: f(X)function: f(X)
(2) The attributes of normal distribution
A. The shape of curve likes a bell and it is symmetric.
B. The top of peak locates in center (mean, median) .
C. There are two parameter and , marking N (, ) .
D. There is a rule to estimate the area of distribution,
the area within the curve is 1 or 100% .
Parameter Parameter
(3) The area rule of normal distribution
μ±1σ the area rule is 68.27%
μ±1.96σ the area rule is 95.00%
μ±2.58σ the area rule is 99.00%
If n>100 , μreplace by x , σreplace by s-
-2.58 -1.96 -1 +1 +1.96 +2.58
2.5%0.5%
Normal distribution curve
(4) Standard normal distribution
Ifμ=0, σ=1, Nd SndIf u=(X-μ)/σ, u observed a snd N(0,1)
So, standard normal distribution means u-distribution
-∞ 0 U +∞
(u)
(u)
-2.58 -1.96 -1 0 +1 +1.96 +2.58
Standard normal distribution
(5) The area rule of Snd
-1< u < +1, the area rule is 68.27%
-1.96 < u < +1.96, the area rule is 95.0%
-2.58 < u < +2.58, the area rule is 99.0%
(6) Application of normal distribution
(A) Estimating frequency distribution
130 newborn’s weight: X-bar=3200g, s=350g
Please estimate the ratio of the low weight.
(The standard of low weight: X=2500g)
u=(x-µ)/ = (2500-3200)/350= -2
See table 2-11 in the book (p39) :
( -2 ) = 0.0228= 2.28% (the ratio of the low weight)
130 2.28% = 2.96 = 3 (person No. of of the low weight)
(B) The estimation of a reference range: Reference range, meaning “normal range”, is the
value range of most normal individuals.
(The most means 80%, 90%, 95% or 99%)
Upper limit (95%)
Normal
Patient
False negative
False positive
For example
Red blood cells (RBC): 3.5~ 5.0 (×1012/L)
White blood cells (WBC): 4~ 10 (×109/L)
Cholesterol in blood: 3.1~ 5.7 mmol /L
Lead in urine: < 0.08 mg /L
Two methods to estimate a reference range:
(A) Method of normal distribution
(B) Method of percentiles
If the frequency distribution is close to the normal di
stribution, we may estimate the reference range accor
ding to the method of normal distribution or percentile
s.
x ± u s (two side 1- range)
x + u s or x - u s (one side 1- range)
Table 2-8. Common u- value Reference range Two side One side 80 % 1.282 0.842 90 % 1.645 1.282 95 % 1.960 1.645 99 % 2.576 2.326
For example: Someone randomly measure the heights
(cm) of 110 health male students in the age of 20 at a city
in 2004. For heights, =172.73 (cm), S=4.09 (cm), please
calculate the reference range of the height.
Calculating two side 95% reference range:
x ± u s x ±1.96·s
x ±1.96s = 172.73 ± 1.96 × 4.09
95% reference range of the students’ heights is
164.71 ~ 180.75 (cm)
If the frequency distribution is skewed, we may
estimate the reference range by percentiles.
95% two side reference range: P2.5 ~ P97.5
95% one side range in upper limit: < P95
95% one side range in lower limit: > P5
(C) The control of data quality
χ± 3s