description of measurement data

Description of measurement data

Prof. Yi-xiong Lei (1021305)

1.1. Distribution of frequencyDistribution of frequencyTo summarize the data or describe the distri-

bution of frequency, frequency table or graph

is the common way. The type of distribution:

Normal distribution

Skewness distribution

Positive skewness distribution

Negative skewness distribution

If we are faced with a large amount of data,

we want to describe its more important features

more concisely. Usually we describe the data fr

om two aspects.

Measures of Location

Central tendency (central position)

Measures of Spread

Tendency of dispersion (variation)

RBCs Mark Frequency

3.70~ | | 2 3.90~ | | | | 44.10~ 正 | | | | 94.30~ 正正正 | 164.50~ 正正正正 | | 224.70~ 正正正正正 254.90~ 正正正正 | 215.10~ 正正正 | | 175.30~ 正 | | | | 95.50~ | | | | 45.70~5.90 | 1

Total —— 130

* Normal distributionTable2-1. Frequency Distribution of Red Blood Cells (1012/L)

among 130 Normal Male Adults in Some District

﹡ Positive skewness distribution Table 2-2. Frequency Distribution of Hair Hg Value (μg/g)

among 238 Normal Adults

Value of hair Hg Frequency Accumulative Frequency AF (%) (1) (2) (3) (4)=(3)/238

0.3~ 20 20 8.4 0.7~ 66 86 36.1 1.1~ 60 146 61.3 1.5~ 48 194 81.5 1.9~ 18 212 89.1 2.3~ 16 228 95.8 2.7~ 6 234 98.3 3.1~ 1 235 98.7 3.5~ 0 235 98.7 3.9~ 3 238 100.0

﹡ Negative skewness distributionTable 2-3. Frequency Distribution of Patients who die

of Malignant Tumors in some year and district Age (yr.) No. of Death Accumulative Frequency AF (%) 0~ 5 5 0.42 10~ 12 17 1.41 20~ 15 32 2.66 30~ 76 108 8.98 40~ 189 297 24.69 50~ 234 531 44.14 60~ 386 917 76.23 70~ 286 1203 100.00

Figure 2-1. Frequency Distribution of Serum Cholesterol (mg/dl)

among 200 Normal Adults in Some District

2. Average2. AverageWhat measure is used to describe the central

tendency? It is the average including mean,

geometric mean and median.

Mean, symbolized by , -bar

Average Geometric mean, symbolized by G

Median, symbolized by M

1). Mean, is suitable to the data distri-buted in normal distribution or at least symmetric distribution.

x1+ x2+……+ xn ∑ x

Formula(1) x = = n n

f 1x1 + f 2x2 + ……+f kxk ∑ fx Formula(2) x = =

f 1 +f 2+……+f k n

The formula (1) is for original data (direct method)

The formula (2) is for frequency table (weighing method)

RBCs Middle value (X) Frequency (f) f X

3.80~ 3.90 2 7.8 4.00~ 4.10 6 24.6 4.20~ 4.30 11 47.3 4.40~ 4.50 25 112.5 4.60~ 4.70 32 150.4 4.80~ 4.90 27 132.3 5.00~ 5.10 17 86.7 5.20~ 5.30 13 68.9 5.40~ 5.50 4 22.0 5.60~ 5.70 2 11.4 5.80~6.00 5.90 1 5.9

Total — 140(∑f) 669.8 (∑f x)

X= ∑f x ∑f =

669.8140

= 4.78 (×1012/L)

Table2-4. Frequency Distribution of Red Blood Cells (1012/L) among 140 Normal Male Adults in Some District

2). Geometric mean, is suitable to the data distributed in positive skewed distribution or logarithm normal distribution.

(1) G = n √ x1 · x2 … xn

lgx1+lgx2+…+lgxn ∑ lgx G = lg–1 = lg–1

n n

f1lgx1+f2lgx2+…+fklgxk ∑f lgx (2) G = lg–1 = lg–1

∑f ∑f

The formula (1) is for original data (direct method)

The formula (2) is for frequency table (weighing method)

There are 6 items of serum antibodies, the concentrations respectively are 1:10, 1:100, 1:1000, 1:10000 and 1:100000, what is the average concentration ？

X=∑x n

10+100+… … … +100000 5=

=22222

G = lg –1 lgx1+lgx2+…+lgxn n 〔〕= lg–1

∑ lgx n

= lg –1 1+2+3+4+55

〔〕 = lg –1 3

=10001:1000 RightRight

WrongWrong

Ab concent. Children (f) Reciprocal (x) lgx flgx

1: 40 3 40 1.602 4.81 1: 80 22 80 1.903 41.87 1: 160 17 160 2.204 37.47 1: 320 9 320 2.505 22.55 1: 640 0 640 2.806 0.00 1:1280 1 1280 3.107 3.11 Total ∑52 — — 109.79

=129.2

1 ： 129

Lg –1 ∑f lgx∑f 〔〕= Lg –1 109.79

52〔〕G =

Table 2-5. The special serum antibodies’ concentrations after one month when 52 susceptible

children immunized with measles vaccine

average antibodies’ concentration

3).Median, is suitable to all kinds of data but it is poor attribution for further ana-lysis comparing to mean.

M = X n+1 (n is odd No.) 2

1

or M= X n + X n (n is even No.) 2 2 2

+1

The following formula is for original data (direct method):

For example :

There are 9 cases, the latent period is 2， 3， 3， 3， 4， 5， 6， 9， 16 days, please calculate their average latent period.

M = X(n+1)/2 = X(9+1)/2 = X5 = 4 (days)

4).Median and percentile

for the data from a frequency table

we do not know the exactly value of median, using the following formula for median or percentile

Px ＝Ｌ＋ i / fx （ n.x% - ΣfL ）

(frequency table method or percentile method)

Px ＝Ｌ＋ i / fx （ n.x% - ΣfL ）X : percentile;

L : the low limit of group where percentile located in

i : the interval;

f : frequency in the group;

n : the total cases;

ΣfL : accumulative frequency that less than L.

If Px = 50% = M, using following formula:

M=L+i/f(n/2-ΣfL)

Table 2-6. The calculations of median and percentile of latent period of food poisoning among 164 cases

Latent period

（ hours ） Cases ( f )Accumulative

frequency(Σf ） Accumulative

frequency （ %）

0 ～ 25 25 15.2

12 ～ 58 83 50.6

24 ～ 40 123 75.0

36 ～ 23 146 89.0

48 ～ 12 158 96.3

60 ～ 5 163 99.4

72 ～ 84 1 164 100.0

Median calculation: from table 2-6, accumulative frequency 50% is within the group “12 ～”， L=12 ， i=12, f=58, ΣfL=25, n=164

M=L+i/f(n/2-ΣfL)=12+12/58(164/2-25)=23.8 (hrs)

Percentile (Px) calculation:when P95, x=95 ， accumulative frequency 95% is withi

n the group “48 ～”， L=48 ， i=12 ， fx=12, ΣfL=146, n=164

P95=48+12/12 （ 164×95%-146 ） =57.8 (hrs)

Measures of SpreadMeasures of Spread

Tendency of dispersion (variation)


3. Measures of Spread3. Measures of Spread

There are some features to describe the distri -bution of different data. Two common features we might be interested in are:

> What is the typical (average) value of a variable (what is its location)?

> How much variability is there in the data (how much does it spread out)?

The common variations are the following:

Range, symbolized by R

Interval of quartile, symbolized by Q

Variations Variance, symbolized by 2, S2

Standard deviation, symbolized by , S

Coefficient of variation, symbolized by CV

1) Range,1) Range, is suitable to all kinds of data is suitable to all kinds of data

but it is a poor measure of variability but it is a poor measure of variability

because it is based on only two extreme because it is based on only two extreme

observations.observations.

R = Xmax - Xmin

2) Interval of quartile (Q), is the scale of variation, from the 25 percentile （ P25） to the 75 percentile（ P75） .

Quartile is suitable to all kinds of data, especially for the data of skewness distribution, it’s application is better than range. Using the following formula to calculate the quartile (Q)

Px ＝Ｌ＋ i / fx （ n.x% - ΣfL ）

Q = Qu-QL= P75 - P25 = 36.0 -15.3=20.7 （ hrs ）

The example above is

3) 3) Variance (2, s2) and

Standard deviation (SD or S)

They are the important variability measures and suitable to data of normal distribution

∑( X— μ) 2 ∑(X— μ) 2

σ2 = σ= N N

∑( X— X) 2 ∑( X— X) 2

S2 = S = n — 1 n — 1

The Formula for standard deviation

∑X2 — ( ∑X ) 2 / n Direct S = method √ n — 1

∑f X2 — ( ∑ f X) 2 / nWeighing S = method √ n — 1

For example, 5 persons’ diastolic blood

pressure are: 162, 145, 178, 142, 186 (mmHg)

∑X2 － ( ∑X ) 2 / nn － 1√S =

∑X = 813

∑X2 = 133317

√ =133317 – (813)2/ 5

5 –1

= 19.49 mmHg

RBCs Middle value (X) Frequency (f) f X f X 2

3.80~ 3.90 2 7.8 30.42 4.00~ 4.10 6 24.6 100.86 4.20~ 4.30 11 47.3 203.39 4.40~ 4.50 25 112.5 506.25 4.60~ 4.70 32 150.4 706.88 4.80~ 4.90 27 132.3 648.27 5.00~ 5.10 17 86.7 442.17 5.20~ 5.30 13 68.9 365.17 5.40~ 5.50 4 22.0 121.00 5.60~ 5.70 2 11.4 64.98 5.80~6.00 5.90 1 5.9 5.90

Total(∑) — 140 669.8 3224.20

∑fX2 － ( ∑fX ) 2 / nn － 1√ S = √ =

3224.20 – (669.8)2/n140 - 1 =0.38

Table2-7. Frequency Distribution of Red Blood Cells (1012/L) among 140 Normal Male Adults in Some District

4) Coefficient of variation (CV), is that the standard deviation divided by mean and then it is comparable between different data.

If comparing the variability among two or more than two groups that their metric units are different or their means are obvious different values

you may calculate their CV

CV = s / x × 100%

For example: Someone randomly measure the heights

(cm) and weighs (kg) of 110 health male students in the

age of 20 at a city in 2004, please compare the variability

between heights (cm) and weighs (kg)

For heights, knowing: = 172.73 (cm), S =4.09 (cm)

For weights, knowing: = 55.04 (kg), S =4.10 (kg)

For heights, CV = 4.09 / 172.73 ×100% = 2.37%

For weights, CV = 4.10 / 55.04 ×100% = 7.45%

Indicating the index of heights is more stable.

5) Applications of standard deviation

(A) Showing the variability (spread) of observations;

(B) Describing the features of normal distribution of data when combined with mean;

(C) Estimating the medical reference range when combined with mean;

(D) Calculating the standard error of mean when combined with sample size (n).

Normal Distribution Normal Distribution

&& It’s application It’s application


4. Normal Distribution and Application4. Normal Distribution and Application

(1) What means the normal distribution

Frequency (f )

125 129 133 137 141 145 149 153 157 161

Heights (cm)

f

Figure 2-1. Frequency distribution and its curve of heights among 120 health boys at the age of 12

Normal distribution curve

F(X)

f(X)

－∞ ＋∞

The normal distribution is defined by the The normal distribution is defined by the function: f(X)function: f(X)

(2) The attributes of normal distribution

A. The shape of curve likes a bell and it is symmetric.

B. The top of peak locates in center (mean, median) .

C. There are two parameter and , marking N (, ) .

D. There is a rule to estimate the area of distribution,

the area within the curve is 1 or 100% .

Parameter Parameter

(3) The area rule of normal distribution

μ±1σ the area rule is 68.27%

μ±1.96σ the area rule is 95.00%

μ±2.58σ the area rule is 99.00%

If n>100 ， μreplace by x ， σreplace by s-

-2.58 -1.96 -1 +1 +1.96 +2.58

2.5%0.5%

Normal distribution curve

(4) Standard normal distribution

Ifμ=0, σ=1, Nd SndIf u=(X-μ)/σ, u observed a snd N(0,1)

So, standard normal distribution means u-distribution

－∞ 0 U ＋∞

(u)

(u)

-2.58 -1.96 -1 0 +1 +1.96 +2.58

Standard normal distribution

(5) The area rule of Snd

-1< u < +1, the area rule is 68.27%

-1.96 < u < +1.96, the area rule is 95.0%

-2.58 < u < +2.58, the area rule is 99.0%

(6) Application of normal distribution

(A) Estimating frequency distribution

130 newborn’s weight: X-bar=3200g, s=350g

Please estimate the ratio of the low weight.

(The standard of low weight: X=2500g)

u=(x-µ)/ = (2500-3200)/350= -2

See table 2-11 in the book (p39) ：

（ -2 ） = 0.0228= 2.28% (the ratio of the low weight)

130 2.28% = 2.96 = 3 (person No. of of the low weight)

(B) The estimation of a reference range: Reference range, meaning “normal range”, is the

value range of most normal individuals.

(The most means 80%, 90%, 95% or 99%)

Upper limit (95%)

Normal

Patient

False negative

False positive

For example

Red blood cells (RBC): 3.5～ 5.0 (×1012/L)

White blood cells (WBC): 4～ 10 (×109/L)

Cholesterol in blood: 3.1～ 5.7 mmol /L

Lead in urine: < 0.08 mg /L

Two methods to estimate a reference range:

(A) Method of normal distribution

(B) Method of percentiles

If the frequency distribution is close to the normal di

stribution, we may estimate the reference range accor

ding to the method of normal distribution or percentile

s.

x ± u s (two side 1- range)

x + u s or x - u s (one side 1- range)

Table 2-8. Common u- value Reference range Two side One side 80 % 1.282 0.842 90 % 1.645 1.282 95 % 1.960 1.645 99 % 2.576 2.326

For example: Someone randomly measure the heights

(cm) of 110 health male students in the age of 20 at a city

in 2004. For heights, =172.73 (cm), S=4.09 (cm), please

calculate the reference range of the height.

Calculating two side 95% reference range:

x ± u s x ±1.96·s

x ±1.96s = 172.73 ± 1.96 × 4.09

95% reference range of the students’ heights is

164.71 ～ 180.75 (cm)

If the frequency distribution is skewed, we may

estimate the reference range by percentiles.

95% two side reference range: P2.5 ~ P97.5

95% one side range in upper limit: < P95

95% one side range in lower limit: > P5

(C) The control of data quality

χ± 3s

description of measurement data

Documents

distribution of frequencyto

type of distribution

logarithm normal distribution

fl frequency table method

positive skewed distribution

following formula

kinds of data

normal adults