1 wp.31 effects of rounding on data quality jay j. kim, lawrence h. cox, myron katzoff, joe fred...
DESCRIPTION
3 Purpose: Evaluate the effects of four rounding methods on data quality and utility in two ways : (1) bias and variance; (2) effects on the underlying distribution of the data determined by a distance measure.TRANSCRIPT
![Page 1: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/1.jpg)
1
WP.31 Effects of Rounding on Data Quality
Jay J. Kim , Lawrence H. Cox,Myron Katzoff, Joe Fred Gonzalez, Jr.
U.S. National Center for Health Statistics
![Page 2: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/2.jpg)
2
I. IntroductionReasons for rounding
• Rounding noninteger values to integer values for statistical purposes;
• To enhance readability of the data;
• To protect confidentiality of records in the file;
• To keep the important digits only.
![Page 3: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/3.jpg)
3
• Purpose:
Evaluate the effects of four rounding methods on data quality and utility in two ways:
• (1) bias and variance;
• (2) effects on the underlying distribution of the data determined by a distance measure.
![Page 4: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/4.jpg)
4
B : Base
: Quotient
: Remainder
Types of rounding:
• Unbiased rounding: E[R(r)|r] = r
• Sum-unbiased rounding: E[R(r)] = E(r)
x xx q B r
xq
xr( ) ( )x xR x q B R r
![Page 5: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/5.jpg)
5
II. Four rounding rules
1. Conventional rounding Suppose r = 0, 1, 2, . . . ,9 (=B-1).
If r (B/2), round r up to 10 (=B) else round down r to zero (0).
If B is odd, round r up when r
5
12
B
![Page 6: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/6.jpg)
6
r is assumed to follow a discrete uniform distribution
2. Modified Conventional rounding
Same as conventional rounding, exceptrounding 5 (B/2) up or down with probability ½.
3. Zero-restricted 50/50 rounding
Except zero (0), round r up or down with probability ½.
![Page 7: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/7.jpg)
7
4. Unbiased rounding rule
Round r up with probability r/B
Round r down with probability 1 - r/B
III. Mean and varianceIII.1 Mean and variance of unrounded number
r = 0, 1, 2, 3, . . . B-1.
![Page 8: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/8.jpg)
8
In general,
1( )2
BE r
2 1( )12
BV r
( ) ( | ) ( | )V x V E x q E V x q
![Page 9: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/9.jpg)
9
III. 2 Conventional rounding when B is even
for unrounded number.
1( )2
BE r
22 1
( ) ( )12
BV x B V q
[ ]2BE R r
![Page 10: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/10.jpg)
10
2
[ ]4
BV R r
2 1( ) .12
BV r for unrounded number
2
2( ) ( ) ,4
BV R x B V q and
2
2 1( ) ( ) .4
BMSE R x B V q
![Page 11: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/11.jpg)
11
III.3 Conventional rounding when B is odd
for unrounded number
1[ ] ,2
BE R r
2 1[ ( )] .4
BV R r
2 1( )12
BV r
[ ( )] 3 ( )V R r V r
![Page 12: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/12.jpg)
12
Modified conventional rounding,
50/50 rounding and
unbiased rounding
have the same mean, variance and MSE
as the conventional rounding with odd B.
![Page 13: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/13.jpg)
13
IV. Distance measure
Define
2
2
[ ( ) ]
[ ( ) ]x x
R x xUx
R r rx
1, ,0, .
xx
if r is rounded upotherwise
![Page 14: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/14.jpg)
14
Reexpressing the numerator of U, we have
With conventional rounding with B=10,
Then we have
2( )x xB r
21
0
( )( | ) | ( ).
x
x
x xx
B rE U x x P
x
1 5, 6, . . , 9x xwith r
![Page 15: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/15.jpg)
15
Expected value of U
We define
2( )
( | ) [ | ] ( ) ( ) ( ).x x x
x xq r x x x
q r
B rE E E U x x P P r P q
x
21
10
( )( | ) | [ | ] ( ) ( ).x x
x xr x x x
r x x
B rU E E U x q q P P rq B r
![Page 16: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/16.jpg)
16
IV.1 Conventional rounding with even B
which can be reexpressed as
2 2/ 2 1 1
1 / 21
( ) 1[ ]
x x
B Bx x
x x x xr r B
r B rU
q B r q B r B
2/ 2 1 1
11 / 2
( ) 1[ ]
x x
B Bx
xr r B x
B rU r
r B
2 2/ 2 1 1
1 / 2
( ) 1[ ]
x x
B Bx x
x x x xr r B
r B rq B r q B r B
![Page 17: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/17.jpg)
17
The upper and lower bounds for harmonic series are
The upper bound for the first term of
2/ 2 1 1
11 / 2
( ) 1[ ]
x x
B Bx
xr r xB
B rU r
r B
2 2/ 2 1 1
1 / 2
( ) 1[ ]x x
B Bx x
x xr r B
r B rq B q B B
ln( 1) 1 ln( )nn H n
1U
1ln[
2
2( 1)]
2B
BB
B
1 1 11 2 3 . .nH n
![Page 18: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/18.jpg)
18
The second term of
Note the second term of E(U) is,
IV.2 Modified conventional rounding with even B
Has the same E(U) as the conventional rounding.
1U
/ 2 1 12 2
21 / 2
1 1[ ][ ( ) ]
x
x x
B B
q x xx r r B
E r B rq B
2 212
BB
![Page 19: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/19.jpg)
19
IV.3 50/50 rounding
The first term of
The second term
IV.4 Unbiased rounding
The first term of
1U
22 3 16
B BB
1ln ( 1)2 2B B
1U 12
B
![Page 20: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/20.jpg)
20
The second term:
IV.5 Comparisons of three rounding rules
Conv 50/50 Unbiased
1st
Term
2nd
Term
1ln[
2
2( 1)]
2B
BB
B
1ln ( 1)2 2B B
12
B
2 212
1
x
BB
Eq
22 3 1 16 x
B B EB q
22 3 1 16 x
B B EB q
2 16
BB
![Page 21: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/21.jpg)
21
Comparisons of three rounding rules
B=10 Conv 50/50 Unbiased
1st term 2.61 11.49 (4.4) 4.5 (1.7)
2nd term .85 2.85 1.65 1
x
Eq
1
x
Eq
1
x
Eq
![Page 22: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/22.jpg)
22
Comparisons of three rounding rules
B=1,000 Conv 50/50 Unbiased
1st term 193.65 3,453.88 (18) 499.5 (2.6)
2nd term 83.33 322.83 166.67
1
x
Eq
1
x
Eq
1
x
Eq
![Page 23: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/23.jpg)
23
IV.6 E(1/q) for log-normal distribution
Lety = ln x
Then, x has a lognormal distribution, i.e.,
2( , )y N
212
ln( )2 1( | , ) ( 2 ) , 0x
f x x e x
![Page 24: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/24.jpg)
24
Let
Then
IV.6 E(1/q) for Pareto distribution of 2nd kind
The Pareto distribution of the second kind is,
21
( | , )c f x dx
21221 1( | 1, , ) [1 ( )]E q e
q c
1( ) , 0, 0a
a
akf q a q kq
![Page 25: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/25.jpg)
25
where k is the minimum value of q and c is the cumulative probability from 1.
IV.7 Upper limit for E(1/q) for multinomial distribution
The multinomial distribution has the form = 0,1,2,
11( )( 1)a kE
q c a
1 2 1 21
1
!( , , . . | , , . . )!
ik
qk k ik
ii
i
nf q q q p p p pq
iq
![Page 26: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/26.jpg)
26
Note,
for all i.
Let be the size of the category i and
1 1 3( ) ( ) [ ]
1 ( 1)( 2)i i i i
E E Eq q q q
in
1
1
j
ii
n n n
1
1
1
1
2
11 22
5( 1)( 2) 2( 2) 61( ) [ ]. .
2( 1)( 2) (1 )
i
jj
i
jj
n
n
ki i i
ik
i i
n n p r n pEq q q
n n p r
![Page 27: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/27.jpg)
27
V. Concluding comments
Various methods of rounding and in some applications various choices for rounding base B are available.
The question becomes: which method and/or base is expected to perform best in terms of data quality and preserving distributional properties of original data and, quantitatively, what is the expected distortion due to rounding?
This paper provides a preliminary analysis toward answering these questions
![Page 28: 1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics](https://reader036.vdocuments.mx/reader036/viewer/2022062906/5a4d1b717f8b9ab0599b566a/html5/thumbnails/28.jpg)
28
ReferencesGrab, E.L & Savage, I.R. (1954), Tables of the Expected Value of 1/X for Positive Bernoulli and Poisson Variables, Journal of the American Statistical Association 49, 169-177.N.L. Johnson & S. Kotz (1969). Distributions in Statistics, Discrete Distributions, Boston: Houghton Mifflin Company.N.L. Johnson & S. Kotz (1970). Distributions in Statistics, Continuous Univariate Distributions-1, New York: John Wiley and Sons, Inc.Kim, Jay J., Cox, L.H., Gonzalez, J.F. & Katzoff, M.J. (2004), Effects of Rounding Continuous Data Using Rounding Rules, Proceedings of the American Statistical Association, Survey Research Methods Section, Alexandria, VA, 3803-3807 (available on CD).Vasek Chvatal. Harmonic Numbers, Natural Logarithm and the Euler-Mascheroni Constant. See www.cs.rutgers.edu/~chvatal/notes/harmonic.html