on invariant post randomization for statistical …...on invariant post randomization for...
TRANSCRIPT
RESEARCH REPORT SERIES
(Disclosure Avoidance #2014-01)
On Invariant Post Randomization for Statistical
Disclosure Control
Tapan K. Nayak and Samson A. Adeshiyan
Center for Disclosure Avoidance Research
U.S. Census Bureau Washington DC 20233
Report Issued: September 11, 2014
Disclaimer: This report is released to inform interested parties of ongoing research and to encourage discussion of work in progress. The views expressed are those of the author and not necessarily those of the U.S. Census Bureau.
On Invariant Post Randomization for Statistical
Disclosure Control
Tapan K. Nayak∗ and Samson A. Adeshiyan†‡
Abstract
In this paper, we investigate certain operational and inferential aspects of invariant
PRAM (post randomization method) as a tool for disclosure limitation of categorical data.
Invariant PRAMs preserve unbiasedness of certain estimators, but inflate their variances and
distort other attributes. We introduce the concept of strongly invariant PRAM, which does
not affect data utility or the properties of any statistical method. However, the procedure
seems feasible in limited situations. We review methods for constructing invariant PRAM
matrices and prove that a conditional approach, which can preserve the original data on any
subset of variables, is an invariant PRAM. For multinomial sampling, we derive expressions
for variance inflation due to invariant PRAMing and variances of certain estimators of the
cell probabilities and also their tight upper bounds. We discuss estimation of these quantities
and thereby assessing statistical efficiency loss due to invariant PRAMing. We find a connec-
tion between invariant PRAM and creating partially synthetic data using a nonparametric
∗Center for Disclosure Avoidance Research, U.S. Census Bureau, Washington, DC 20233 and Department of
Statistics, George Washington University, Washington, DC 20052.†U.S. Energy Information Administration, Washington, DC 20585.‡The views expressed in this article are those of the authors and not necessarily those of the U.S. Census
Bureau. The analysis and conclusions contained in this paper are those of the authors and do not represent the
official position of the U.S. Energy Information Administration (EIA) or the U.S. Department of Energy (DOE).
1
approach, and compare estimation variance under the two approaches. Finally, we discuss
some aspects of invariant PRAM in a general survey context.
Key words and Phrases: Categorical data; randomized response; sampling design; synthetic
data; unbiased estimation; variance inflation.
2
1. Introduction
The Post Randomization Method (PRAM), introduced by Kooiman et al. (1997) and further
discussed by Gouweleeuw et al. (1998), De Wolf et al. (1998), Willenborg and De Waal (2001),
Ronning (2005), Van den Hout and Elamir (2006), Van den Hout and Kooiman (2006), Cruyff et
al. (2008) and others, is an important technique for categorical data perturbation for confiden-
tiality protection. PRAM stochastically transforms each record in a data set using pre-selected
probabilities. This deliberate misclassification of the original responses introduces uncertainty
about the true category of any respondent. On the other hand, since the misclassification prob-
abilities are known, valid statistical inferences can be derived from PRAMed data. Let X be a
categorical variable with categories c1, . . . , ck. The basic ideas of PRAM are to (i) select a tran-
sition probability matrix P = ((pij)), where∑i pij = 1 for j = 1, . . . , k, and then (ii) randomly
change any original category cj to ci with probability pij (i, j = 1, . . . , k). The randomization
step is performed for each record in the data set, independently of all other records. We shall
denote the transformed variable by Z, in which case, pij = P (Z = ci|X = cj). In practice,
data agencies should release PRAMed data along with the transition probability matrix P , also
called the PRAM matrix, to the public, so that users can derive valid statistical inferences from
the released data.
As noted by Gouweleeuw et al. (1998) and Van den Hout and Van der Heijden (2002),
mathematically, PRAM is similar to randomized response (RR) surveys (e.g., Warner, 1965;
Chaudhuri and Mukerjee, 1988; Nayak, 1994; Nayak and Adeshiyan, 2009). Thus, many ideas
and mathematical results developed for RR surveys can be applied to PRAM. Both methods deal
with the dual goals of protecting privacy and preserving statistical information. One difference
is that in RR surveys, each respondent randomizes his response at data gathering stage, whereas
in PRAM, randomization is carried out by the surveyor after the data are collected. Another
divergence, which has not been noted well earlier, is that in RR surveys the transition probability
3
matrix P is prespecified and hence fixed, but in PRAM it may depend on the data and hence
be random. This causes certain differences in the variance inflation due to the two procedures,
which we discuss in Section 4.
Let πi = P [X = ci], i = 1, . . . , k, and π = (π1, . . . , πk)′. Let n denote the sample size
and Ti denote the frequency of category ci in the original sample. Most papers on PRAM
assume multinomial sampling, i.e., the original data are collected by random sampling from an
infinite population or by simple random sampling with replacement (SRSWR) if the population
is finite. We shall assume multinomial sampling, except in Section 6, where we consider general
probability sampling. Under multinomial sampling, T = (T1, . . . , Tk)′ ∼ Mult(n, π), but note
that T1, . . . , Tk cannot be ascertained from PRAMed data. Let, Si denote the frequency of
category ci after PRAMing, λi = P (Z = ci), i = 1, . . . , k and λ = (λ1, . . . , λk)′. Then, S =
(S1, . . . , Sk)′ ∼Mult(n, λ), where
λ = Pπ. (1.1)
From (1.1), it follows that for nonsingular P , any unbiased estimator λ̃ of λ yields an unbiased
estimator of π, given by π̃ = P−1λ̃. The MLE (and UMVUE) of λ is λ̂ = S/n, which yields the
following estimator of π:
π̂ = P−1λ̂ = P−1(S/n). (1.2)
It can be seen that π̂ is an unbiased estimator of π and
V ar(π̂) =(Dπ − ππ′)
n+
[P−1Dλ(P−1)′ −Dπ]n
, (1.3)
whereDπ is a diagonal matrix with diagonal elements being π1, . . . , πk andDλ is defined similarly
(see Chaudhuri and Mukerjee, 1988, p. 43). The first term on the right side of (1.3) is the
variance under no randomization and the last term is the additional variance due to PRAMing.
Without PRAMing, i.e., based on the original data, the MLE (and the UMVUE) of π is
π̂0 = T/n. Note that E[S|T] = PT and hence E[π̂|T] = π̂0. Thus, the estimator π̂ in (1.2) can
4
be regarded as an unbiasedly recovered version of π̂0. However, if E[S|T] = T, i.e.,
PT = T or equivalently Pπ̂0 = π̂0, (1.4)
then S/n is also an unbiased estimator of π, which can be calculated without using P or its
inverse. Also, S/n is always a probability vector, while π̂ in (1.2) may not be so. Observing these
points, Gouweleeuw et al. (1998) defined a PRAM to be an invariant PRAM if P satisfies (1.4).
The main goal of this paper is to examine certain practical and inferential sides of invariant
PRAM and present some results to advance it further as a disclosure control method.
PRAM can be applied to more than one categorical variable, independently or jointly on the
cross-classification of all variables (see, Gouweleeuw et al., 1998). Conceptually, any PRAM can
be regarded as being applied to the compound variable created by cross-classifying all variables,
and that view is both convenient and appropriate for estimating the joint probabilities. When
there are several variables and they are PRAMed independently and invariantly, the marginal
probabilities can be estimated from PRAMed data without any adjustment for PRAMing, but
not the joint probabilities. This is because, independent invariant PRAMing of individual vari-
ables does not imply invariant PRAMing of the compound variable, unless the variables are
independently distributed. So, for estimating the joint probabilities, one will need to obtain the
transition probability matrix for the combined variable and use its inverse. Unbiased estimation
of all cell probabilities without knowing the PRAM matrix requires an invariant PRAM of the
combined variable.
In the next section, we introduce and discuss a stronger version of invariant PRAM that has
no effect on statistical inferences. Choosing a suitable P satisfying (1.4) is central to devising
an invariant PRAM. This requires methods for solving (1.4) for P and assessing the efficacy
of any given P for confidentiality protection and also its effect on variance of estimators. In
Section 3, we review two existing methods for solving equations (1.4) and (2.1) and formalize an
approach that permits PRAMing some of the variables while keeping others unchanged. This
5
is useful when there are several variables and the total number of cells is large. In Section 4,
we derive the variance covariance matrix V (π̂∗) of π̂∗ = S/n under multinomial sampling, and
discuss how a data agency can estimate it. However, a user cannot estimate V (π̂∗), unless P
is released along with PRAMed data, which is undesirable from confidentiality perspective. We
derive a tight upper bound for V (π̂∗), which can be estimated from PRAMed data. In Section
5, we relate one type of synthetic data to invariant PRAM. Specifically, we show that creating
nonparametric synthetic data is equivalent to applying a specific invariant PRAM, which also
induces the maximum variance inflation among all invariant PRAMs. In Section 6, we discuss
invariant PRAM for data coming from a probability sample. We give an expression for variance
inflation that can be used by data agencies. However, it is difficult for a user to estimate the
variance inflation or variances of estimators. We state some concluding remarks in Section 7.
2. Strongly Invariant PRAM
We note that (1.4) implies a specific invariance property, viz., unbiasedness of the observed
relative frequency vector (S/n) for estimating π, but not necessarily of other estimators. Even
under (1.4), the distributions of T and S are different (unless P = I), and consequently (i)
the covariance matrices of (T/n) and (S/n) are not the same and (ii) unbiased estimation of
nonlinear functions of π (e.g.,∑π2i ) may not be possible if P is not given. Thus, invariant
PRAMs preserve only certain properties of some estimators.
For stronger implications, we may wish to choose P so that statistical properties of all infer-
ential methods remain invariant under PRAMing, i.e., the validity of any inferential procedure
would not depend on whether it is applied to the original data or a PRAMed version of it. In
such cases, for making inferences about π, data users would not need to know or use the matrix
P and can use any procedure that is valid for the original data. The question of whether it is
possible to find such P or not was left open for future research in Gouweleeuw et al. (1998). Note
6
that invariance of all inferential methods essentially requires the distributions of any statistic
calculated from original and PRAMed data, respectively, to be the same. Under multinomial
sampling, this amounts to requiring the frequency counts from original and PRAMed data, i.e.,
(T1, . . . , Tk) and (S1, . . . , Sk) respectively, to have the same distribution, viz., Mult(n, π). This
holds if and only if λ = π or equivalently P satisfies the condition
Pπ = π. (2.1)
We define a PRAM to be a strongly invariant PRAM if P satisfies (2.1). One appealing
property of a strongly invariant PRAM is that it does not affect either data utility or data
analysis; there is no loss of statistical information and PRAMed data can be analyzed without
any adjustment for post-randomization. However, such a strong outcome is usually unattainable
in practice because typically π is unknown and so (2.1) cannot be solved for P (except for P = I).
One situation where the true π is known is when the data come from a census. Likewise, when
the sample size is large, π can be fairly well estimated from the original data. Thus, a strongly
invariant PRAM is applicable when releasing a random subset of census data (or of a large data
set) for public use. For example, the Public Use Microdata Sample (PUMS) files released by
the U.S. Census Bureau are subsets of census or large survey data. Thus, strongly invariant
PRAMing may serve as another tool for confidentiality protection in creating PUMS files.
3. Constructing Invariant PRAMs
To construct a strongly invariant PRAM, one needs to solve (2.1) for P (for given π) and choose
a solution. Similarly, constructing an invariant PRAM requires solving (1.4). However, (1.4)
and (2.1) are identical, as mathematical equations, and so we shall examine only (2.1). Note
that (2.1) has many solutions for P , with the identity matrix being an obvious solution, and
if P and P∗ satisfy (2.1), then aP + (1 − a)P∗ also satisfies (2.1) for all 0 ≤ a ≤ 1. Thus, the
7
solution space of (2.1) is a non-empty convex set. The solution set is also closed under matrix
multiplication, viz. if P and P∗ are strongly invariant transition probability matrices, then so is
the product PP∗.
In many practical applications, X is a compound variable with some structurally empty cells,
i.e., cells with zero probability (see Gouweleeuw et al., 1998). We note some practical points
about such situations. For notational simplicity, suppose π = (π′+, π′0)′, where all components
of π+ are positive and π0 is a vector of zeros. Then, partitioning P analogously, (2.1) reduces to P11 P12
P21 P22
π+
0
=
π+
0
. (3.1)
Clearly, (3.1) can hold only if P21 = 0, and P12 and P22 have no bearing on the validity of (3.1).
In particular, one may take P12 = 0 and P22 = I. Essentially, (2.1) reduces to P11π+ = π+,
and thus one needs to consider only the structurally non-zero cells (which also remain so under
any strongly invariant PRAM). Similarly, for invariant PRAM, i.e., for (1.4), only the cells with
positive frequencies are relevant. Any cell that has zero frequency for the original data will also
have zero frequency after invariant PRAMing. In contrast, a cell with original count of zero and a
positive probability may get a positive count after strongly invariant PRAMing. Thus, strongly
invariant PRAMs may better protect confidentiality than invariant PRAMs. One concern in
micro data perturbation is satisfying edit constraints (see De Wolf et al., 1998; Shlomo and De
Waal, 2008). The preceding observations show that invariant (or strongly invariant) PRAMs
of the cross classification of all variables automatically satisfy all natural edit constraints. For
notational simplicity, we shall continue to use (2.1) considering only structurally non-empty
cells, instead of switching to the equation P11π+ = π+.
We shall now briefly review, partly for later reference, two methods for solving (2.1), pre-
sented in Gouweleeuw et al. (1998). To discuss one class of solutions, assume for notational
simplicity that 0 < πk ≤ πi for i = 1, . . . , k. Then, letting pii = 1 − θ(πkπi
) and pij = θ[ πk(k−1)πj
]
8
for i 6= j, it can be seen that P = ((pij)) satisfies (2.1) for all 0 ≤ θ ≤ 1. These {pij},
with any 0 ≤ θ ≤ 1, constitute a strongly invariant PRAM matrix. Here, {pij} can be in-
terpreted nicely. Note that P (Z 6= cj |X = cj) = θ(πkπj
), j = 1, . . . , k, which implies that θ is
the probability of changing a true response of ck to another category. Any other response cj
(j 6= k), changes to another category with probability θ(πkπj
). Moreover, it can be seen that for
i 6= j, P (Z = ci|X 6= Z) = 1/(k − 1), which means that when a true response is changed, the
perturbed value is selected at random from one of the remaining (k − 1) categories.
Another method for solving (2.1) uses a two step process. Let R = ((rij)) be any transition
probability matrix such that all components of Rπ are positive. Let qij = [rjiπi]/[∑l rjlπl],
Q = ((qij)) and P = QR. Then, it can be verified that P satisfies (2.1). This also shows that
any PRAMed data set (with a known transition probability matrix R) can be PRAMed again,
with a suitable transition probability matrix Q, to make the final outcome a strongly invariant
PRAM. In this context, we note that R need not be a square matrix, i.e., X and Z need not
have the same number of cells. For any transition probability matrix R of order m × k with
all components of Rπ being positive, the final matrix P is a strongly invariant PRAM matrix.
However, P would be singular, if m < k or rank(R) < k.
When there are several variables, applying strongly invariant PRAM to each variable sepa-
rately (and independently) does not amount to using a strongly invariant PRAM on the com-
pound variable, unless all variables are independently distributed. So, the two methods discussed
above need to be applied to the cross-classification of all variables, which is unwieldy when the
total number of cells is large. Gouweleeuw et al. (1998) and De Wolf et al. (1998) discuss several
practical issues, such as how to decide which variables are to be PRAMed and how to choose P .
They suggest to group the variables such that any two variables falling in two different groups
are nearly independent. Then, consider all variables in each group as a compound variable and
partition its categories into clusters in such a way that any category can be replaced only by
9
categories within its cluster. This approach leads to a block diagonal structure of P for each
compound variable. Finally, choose the block matrices suitably to achieve confidentiality pro-
tection goals. Shlomo and De Waal (2008) applied these recommendations to an Israel Census
sample data set. However, as the compound variables (for the groups) are PRAMed separately,
the overall process may not be an invariant PRAM of the cross-classification of all variables.
Next, we describe another mechanism for constructing strongly invariant PRAMs of several
variables. The procedure uses a conditional approach and has been used by Shlomo and De
Waal (2008). The following discussion gives the procedure a formal footing and helps to bring
out a connection between invariant PRAM and partially synthetic data, presented in Section 5.
Theorem 3.1. Consider two categorical variables X and Y . Suppose X is kept unchanged
and within each category of X, Y is strongly invariantly PRAMed into Z. Then, the stochastic
transformation from (X,Y ) to (X,Z) is a strongly invariant PRAM of (X,Y ).
Proof. Suppose X has k categories, denoted by c1, . . . , ck, and Y and Z have m categories
labeled d1, . . . , dm. Since Y is strongly invariantly PRAMed within each category of X, we have
P (Z = dj |X = ci) = P (Y = dj |X = ci) for all i = 1, . . . , k and j = 1, . . . ,m. Then considering
the joint probabilities, we get
P (X = ci, Z = dj) = P (X = ci)P (Z = dj |X = ci)
= P (X = ci)P (Y = dj |X = ci)
= P (X = ci, Y = dj)
for all i = 1, . . . , k and j = 1, . . . ,m, which completes the proof.
It can be seen similarly that a parallel version of Theorem 3.1 holds for invariant PRAM,
i.e., invariant PRAMing of Y within each X category is an invariant PRAM of (X,Y ). Thus,
the following mechanism can be used for invariant PRAMing. First, divide all variables into two
10
groups and let X and Y denote the corresponding compound variables; each variable in the data
set must be a part of either X or Y . Then, for each category of X, apply an invariant PRAM to
Y . One advantage of the conditional approach is that it fully preserves the original data on X,
which may be regarded as a control variable. This is useful for preserving information on some
of the variables that are deemed important to users. For example, we may PRAM zipcodes
within each county to preserve county level counts. We should also note that the conditional
approach requires choosing and using a transition probability matrix for each X category, which
may be burdensome, but that also offers the flexibility to perturb Y in varying degree across the
categories of X. This is helpful for mitigating different disclosure risks for units in different X
categories. When the total number of cells is large and many cells have small counts, invariant
PRAMing may not be convenient or adequate for privacy protection. In such cases, invariant
PRAMing after cell collapsing could be a practical and effective approach.
We believe, the procedure proposed above, viz., dividing all variables into two compound
variables X and Y and then invariantly PRAMing Y within each category of X, would suffice
in most practical situations. However, we may note that repeated applications this process also
result in an invariant PRAM. This follows from the fact that if P and P∗ are invariant transition
probability matrices (i.e., satisfy (1.4)), then PP∗ is also so. In this context, one may put any of
the variables in the control set in each step. For example, for a data set with three categorical
variables X1, X2 and X3, we may first invariantly PRAM X2 and X3 within X1 categories,
creating (X1, X′2, X
′3) and then invariantly PRAM X1 and X ′2 within X3 to form (X ′1, X
′′2 , X
′3)
and so on. Such a perturbation process is equivalent to an invariant PRAM.
4. Estimation from Invariantly PRAMed Data
As noted in Section 2, under multinomial sampling, all analytical methods for original data
are equally valid for strongly invariantly PRAMed data. Thus, strongly invariant PRAMs do
11
not raise any new inferential issues. Also, a strongly invariant PRAM cannot be constructed
in many practical situations, as π is unknown. So, here we shall discuss estimation based on
invariantly PRAM data, as defined by (1.4). Let P = [P1 : . . . : Pk] be an invariant PRAM
matrix and rewrite (1.4) ask∑i=1
TiPi = T. (4.1)
Note that P depends on T and hence is a random matrix. Let Fij denote the number of units
with original category ci and perturbed category cj , and let Fi = (Fi1, . . . , Fik)′. Then, the
frequency vector (S) from PRAMed data can be expressed as S =∑ki Fi, where given T and
P , Fi ∼ Mult(Ti, Pi), i = 1, . . . , k, and they are independent. We shall find this representation
helpful for deriving formulas for variance inflation and overall estimation variance.
First, we note a subtle point that it makes sense to talk about unconditional distributions of
Fi and S only when a probability model for P is given. Note that while P depends on T, it need
not be a function of T, as (1.4) has many solutions. So, for discussing distributional properties
of relevant statistics, we shall assume that an algorithm for choosing P from T, via (1.4), is
given, i.e., there is a probability model for P that is fully determined by T. The algorithm for
choosing P may be probabilistic or deterministic, and in the latter case the distribution of P
given T is degenerate. However, for our discussion, we mostly do not need to know the actual
probability model for P ; it suffices to know that it exists and is determined by T.
In the above setting, we investigate statistical properties of π̂∗ = S/n as an estimator of π.
Using S =∑ki=1 Fi and above mentioned conditional distributions of Fi, i = 1, . . . , k, we get
E(π̂∗|T, P ) =1n
k∑i=1
E[Fi|T, P ] =1n
k∑i=1
TiPi =1nT = π̂0, (4.2)
by (4.1), and
V (π̂∗|T, P ) =1n2
k∑i=1
Ti[DPi − PiP ′i ] =1n
[Dπ̂0 −k∑i=1
(Tin
)PiP ′i ], (4.3)
12
which represents the added variation due to PRAMing. We believe, data agencies should examine
V (π̂∗|T, P ) when selecting a suitable P . Note that for any given P , the data agency can calculate
(4.3), based on the original data, before applying PRAM.
Since T ∼ Mult(n, π) and π̂0 = T/n, we get E(π̂0) = π and V (π̂0) = [Dπ − ππ′]/n. From
these and (4.2) and (4.3) it follows that E(π̂∗) = π (for any distribution of P ) and
V (π̂∗) = V [E(π̂∗|T, P )] + E[V (π̂∗|T, P )]
= V (π̂0) +1n
[Dπ − E{k∑i=1
(Tin
)PiP ′i}]. (4.4)
Thus, the relative frequency vector π̂∗ from invariantly PRAMed data is an unbiased estimator of
π and the last term of (4.4) is the variance inflation matrix, which is different from the last term
of (1.3). This difference arises from the fact that (1.3) holds when P is fixed and prespecified,
but in invariant PARM, P is data dependent and random. We can also express (4.4) as
V (π̂∗) =1n
[2Dπ − ππ′]−1nE{
k∑i=1
(Tin
)PiP ′i}]. (4.5)
The expectation in (4.5) may be difficult to derive, depending on the distribution of P . The
data agency knows P and the values of T1, . . . , Tk and hence can estimate (4.5) by
V̂1(π̂∗) =1n
[2Dπ̂0 − π̂0π̂′0]−
k∑i=1
(Tin2
)PiP ′i ].
If P is reported along with the perturbed data, a data user may estimate V (π̂∗) (without knowing
T1, . . . , Tk) by
V̂2(π̂∗) =1n
[2Dπ̂∗ − π̂∗π̂′∗]−k∑i=1
(Sin2
)PiP ′i ].
However, the reporting of P is problematic from disclosure perspective. As Gouweleeuw et al.
(1998) noted, T is an eigenvector of P corresponding to the eigenvalue 1 and one might be able
to deduce the original frequencies from P without any error. If P is not published, estimating
V (π̂∗) by data users is challenging. However, a user can apply the following result to obtain an
upper bound for the variance of π̂∗.
13
Theorem 4.1. (a) An upper bound of V (π̂∗) is
Vmax(π̂∗) = (2− 1n
)[Dπ − ππ′
n] (4.6)
in the sense that [Vmax(π̂∗)− V (π̂∗)] is non-negative definite for any invariant PRAM.
(b) The upper bound in (4.6) is tight, i.e., there exists an invariant PRAM for which V (π̂∗) =
Vmax(π̂∗).
Proof. For any given vector a of dimension k, consider
a′V (π̂∗)a =1na′[2Dπ − ππ′]a−
1n2E{
k∑i=1
Ti(a′Pi)2}. (4.7)
Suppose, without loss of generality, that only the first m(m ≤ k) components of T are nonzero.
Then, using (4.1) and the Cauchy-Schwarz inequality, (x′y)2 ≤ (x′Ax)(y′A−1y), with x′ =
(T1, . . . , Tm), y′ = (a′P1, . . . , a′Pm) and A = diag(1/T1, . . . , 1/Tm), we get
(a′T)2 = [m∑i=1
Ti(a′Pi)]2 ≤ n[m∑i=1
Ti(a′Pi)2] = n[k∑i=1
Ti(a′Pi)2],
which implies that
1n2E{
k∑i=1
Ti(a′Pi)2} ≥1n3E(a′T)2
=1na′[E(π̂0π̂
′0)]a
=1na′[(
Dπ − ππ′
n) + ππ′]a. (4.8)
Next, using (4.8) in (4.7), we get
a′V (π̂∗)a ≤ (2− 1n
)a′[Dπ − ππ′
n]a. (4.9)
Part (a) of the theorem now follows readily as (4.9) holds for all a.
14
For part (b), we show that V (π̂∗) equals Vmax(π̂∗) if the transition probability matrix is
P∗ =1n
T1 T1 · · · T1
T2 T2 · · · T2
...... · · ·
...
Tk Tk · · · Tk
.
Note that P∗ satisfies (1.4) and hence it is an invariant PRAM matrix. As each column of P∗ is
π̂0, the sum in (4.4) equals π̂0π̂′0 and hence V (π̂∗) reduces to
V (π̂∗) = V (π̂0) +1n
[Dπ − E(π̂0π̂′0)]
= V (π̂0) +1n
[Dπ − {V (π̂0) + ππ′}]
= (2− 1n
)V (π̂0),
which is the same as Vmax(π̂∗) in (4.6).
Estimating linear combinations of π, say a′π, is a common problem. One special case is
estimating the total probability of a set of categories, which can be expressed as a′π, where
ai = 1 if category ci is included in the set and ai = 0 otherwise. A natural estimator of a′π is
a′π̂∗ and its variance is a′V (π̂∗)a, for which (4.9) gives an upper bound. Data users can estimate
the upper bounds in (4.6) and (4.9) by replacing π by π̂∗. An obvious lower bound for V (a′π̂∗)
is V (a′π̂0) = a′[{Dπ−ππ′}/n]a. Recall that π̂0 and π̂∗ are estimators of π based on original and
invariantly PRAMed data, respectively, and from (4.6), Vmax(π̂∗) = (2 − 1/n)V (π̂0) ≈ 2V (π̂0);
a similar comparison holds for V (a′π̂∗) and V (a′π̂0). As the upper bounds in (4.6) and (4.9) are
tight, the variance of an estimator under an invariant PRAM may be almost twice the variance
of a corresponding estimator based on the original data.
15
5. Invariant PRAM and Synthetic Data
One significant approach to disclosure avoidance is to release fully or partially synthetic data;
see e.g., Rubin (1993), Raghunathan et al. (2003), Little (1993), Reiter (2002, 2003), An and
Little (2007), Hawala (2008) and Slavkovic and Lee (2010). The basic idea in creating partially
synthetic data is to replace the observed values of some selected variables by simulated data from
pertinent predictive distributions. A synthetic data set consisting of original X and synthetic Y
values, where both X and Y may be vectors, is created by replacing the observed Y of each unit
by a simulated value from a predictive distribution f(y|x), where x is the unit’s observed value
of X. Usually, f(y|x) is a posterior predictive distribution, but one may also use f(y|x, θ̂), where
θ̂ is some estimate (such as MLE) of the parameter(s) of an assumed model; see Hawala (2008)
and Reiter and Kinney (2012). In the following, we present a link between invariant PRAM and
nonparametric partially synthetic data and compare the variance of estimators under the two
approaches.
Suppose a data set has only two categorical variables X and Y (both can be compounded
variables) with categories labeled as in Theorem 3.1. Let Tij denote the frequency of (X =
ci, Y = dj), i = 1, . . . k, j = 1, . . .m, in the original data and let Ti. denote the marginal fre-
quency of (X = ci). Let pj|i = Tij/Ti.. Here, pj|i is a natural and nonparametric estimate of
f(y|x = ci). So, a nonparametric approach to creating partially synthetic data set would be
as follows. For any unit with X = ci, replace its observed Y category by one of d1, . . . , dm,
with respective probabilities p1|i, . . . , pm|i. However, this is equivalent to ordering the cells as
c1d1, . . . , c1dm, c2d1, . . . c2dm, . . . , ckd1, . . . ckdm and then applying a PRAM with a block diago-
nal PRAM matrix, whose ith (i = 1, . . . , k) block, say P i, is an m×m matrix with each column
being (p1|i, . . . , pm|i)′. It is easy to see that P i is an invariant PRAM matrix for PRAMing Y
within the category ci of X. The whole process is an invariant PRAM, in view of Theorem 3.1.
In summary, for categorical variables, a nonparametric synthetic data set can be viewed as an
16
invariantly PRAMed data set.
Our results of Section 4 can be extended to the case where Y is invariantly PRAMed within
each X category. Here, the PRAMing process divides all cells into k groups, one for each category
of X, and then perturbs the frequencies of the cells in each group; it does not alter the marginal
frequencies of X categories. To see how the method affects the frequencies in each group, let
us consider the cells within X = ci. Let ~Ti = (Ti1, . . . , Tim)′ and let P i = (P i1 : · · · : P im) be an
invariant PRAM matrix for this group, i.e.,
P i ~Ti = ~Ti orm∑j=1
TijPij = ~Ti. (5.1)
Let Sij denote the frequency of the (i, j)th cell based on PRAMed data and let ~Si = (Si1, . . . , Sim)′.
Here, ~Si is a perturbed version of ~Ti and as in Section 4, we have ~Si = F i1 + · · · + F im, where
F ij ∼ mult(Tij , P ij ), j = 1, . . . ,m, are independent. Then, for any P i satisfying (5.1), it follows
that E(~Si|~Ti, P i) = ~Ti and
V (~Si|~Ti, P i) = D~Ti−
m∑j=1
TijPij (P
ij )′. (5.2)
If P ij = (1/Ti.)(Ti1, . . . , Tim)′, j = 1, . . . ,m, which yields nonparametric partial synthetic data,
the sum in (5.2) reduces to (1/Ti.)~Ti ~T ′i . Moreover, it can be seen, as in the proof of Theorem
4.1, that [∑mj=1 TijP
ij (P
ij )′−(1/Ti.)~Ti ~T ′i ] is non-negative definite for all P i satisfying (5.1), which
leads to the following:
Theorem 5.1. The induced variation of the cell frequencies due to replacing observed Y by
nonparametric synthetic values is at least as large as the variation induced by any invariant
PRAM of Y within the categories of X.
The preceding discussions show that generating synthetic Y in a nonparametric way is equiv-
alent to invariantly PRAMing Y within X categories, with automatically using the most variance
inflationary PRAM matrices. In practice, this extreme invariant PRAMing may be unduly per-
turbative. We believe, data agencies would prefer to invariantly PRAM Y within X categories,
17
with more flexible and judicious choice of the PRAM matrices. The general approach also
permits to apply different amount of perturbation to data in different categories of X. Oper-
ationally, one needs to choose and use a solution P i of (5.1), for each i = 1, . . . , k. Here, the
first method that we reviewed in Section 3, due to Gouweleeuw et al. (1998), may be useful
as it requires just one number, viz. choosing a value of θ, for each X category. Recall that θ
represents the probability of changing any value of Y that has the lowest frequency. Usually, the
units falling in cells with small counts have high disclosure risk and require greater protection.
So, the value of θ for category ci of X should decrease as minj{Tij} increases. In particular, if
all Tij , j = 1, . . . ,m are fairly large, θ should be close to 0 and P i ≈ I.
The preceding discussion should not be viewed as a sweeping criticism of disclosure limitation
via synthetic data. Clearly, our observations and Theorem 5.1 are about a specific synthetic
approach and they do not apply to synthetic data based on models and priors, which may serve
well in some applications. However, the nonparametric approach has a particular appeal as
models and priors may lead to much biases when they are not correct. Obviously, PRAM is
only for categorical variables whereas synthetic data can be created for both categorical and
quantitative variables. Also, synthetic data based on multiple imputation involves additional
inferential and disclosure control issues. For a more comprehensive comparison of PRAM with
synthetic data one should examine both estimation error and disclosure control, which we leave
for future research.
6. Estimation Under Probability Sampling
Practical surveys rarely use simple random sampling with replacement (i.e, multinomial sam-
pling). Most surveys use a combination of stratified, cluster, systematic and multi-stage sam-
pling. Also, nonresponse is a common phenomenon. Consequently, one uses a weighted com-
bination of the data for estimating the population probability vector π. Thus, it is important
18
to examine properties of PRAM when applied to data coming from a general finite population
survey. Let N denote the population size and consider a general survey design p(s), where s is
a subset of the population and p(s) is the probability of selecting s. As in Section 1, let X be a
categorical variable with k categories c1, . . . , ck and let π denote the corresponding probability
vector. Let Ui (i = 1, . . . , N) be a k-dimensional indicator vector representing the X category
of unit i. Specifically, if unit i falls in category cj , then the jth component of Ui is 1 and the
remaining components are 0. A similar indicator vector Vi will represent the category of unit i
after PRAMing. Suppose
π̂w =∑i∈s
wsiUi (6.1)
is an estimator of π based on the original data, where the sampled units are weighted by {wsi}.
Suppose the data are PRAMed using a fixed non-singular PRAM matrix P = [P1 : · · · : Pk].
Then, one can use
π̂w∗ =∑i∈s
wsiP−1Vi = P−1(
∑i∈s
wsiVi)
to estimate π based on PRAMed data. Note that the conditional distribution of Vi given Ui is
mult(1, PUi), which shows that E[π̂w∗|s] = π̂w and
V [π̂w∗|s] = P−1[ ∑i∈s
w2si{DPUi − PUiU ′iP ′}
](P−1)′
= P−1[ ∑i∈s
w2siDPUi
](P−1)′ −
∑i∈s
w2siUiU
′i . (6.2)
From these, we see that π̂w∗ unbiasedly recovers π̂w and
V (π̂w∗) = Vp(π̂w) + Ep{V [π̂w∗|s]}, (6.3)
where Vp and Ep are with respect to the sampling design. The last term of (6.3) is variance
inflation due to PRAMing, which the data agency can unbiasedly estimate by (6.2). The data
agency should examine the size of the matrix in (6.2), measured for example by its trace or
determinant, for judging the suitability of P from data utility perspective.
19
Next consider using
π̃w =∑i∈s
wsiVi
for estimating π, i.e., applying the estimator (6.1) for the original data to PRAMed data. Here,
E[π̃w|s] =∑i∈s
wsiPUi = P [∑i∈s
wsiUi] = Pπ̂w,
which coincides with π̂w if and only if
Pπ̂w = π̂w. (6.4)
Based on this observation, Kooiman et al. (1997) and Gouweleeuw et al. (1998) defined invariant
PRAM by (6.4) when the survey weights are unequal. We note some properties of these invariant
PRAMs in the following.
We shall assume that π̂w is a proper probability vector, i.e., all of its components are non-
negative and add to 1. Then, any of the methods discussed in Section 3 can be used to find
solutions of (6.4). For using the conditional approach, we would use relevant conditional proba-
bilities implied by π̂w. We can still keep data on some variables (X) unchanged and invariantly
PRAM the remaining variables (Y ) within the categories of X. This is quite important when the
survey weights depend on some of the variables. PRAMing of those variables would introduce
randomness in the survey weights and hence the above treatment of E[π̃w|s] would not be cor-
rect. To define invariant PRAM via (6.4) it is necessary to assume that all variables determining
the survey weights are kept unchanged.
As before, choosing a P satisfying (6.4) depends on the data, via π̂w, and the algorithm for
calculating P . Thus, the distribution of {Vi} also depends on the procedure for selecting one
of many solutions of (6.4). Conditionally given s and P = [P1 : · · · : Pk], {Vi} are independent
mult(1, PUi), which implies that E[π̃w|s, P ] = π̂w and
V [π̃w|s, P ] =∑i∈s
w2si{DPUi − PUiU ′iP ′}
20
=∑i∈s
w2siDPUi − P
[ ∑i∈s
w2siUiU
′i
]P ′. (6.5)
Consequently, if π̂w is an unbiased estimator of π, then so is π̃w and
V (π̃w) = V (π̂w) + E{V [π̃w|s, P ]},
where the expectation is with respect to both the distribution of P given s and the sampling
design, which a user cannot evaluate if the mechanism for generating P is not released. However,
the data agency can calculate (6.5) and use it to assess the level of variance inflation.
7 Discussion
Of the few methods that are available for perturbing categorical microdata for disclosure control,
invariant PRAM is appealing because it allows unbiased estimation of all cell probabilities, with-
out any adjustment for data perturbation. We believe, our results and observations contribute
to its further development as a useful tool. We noted that preserving all properties of all esti-
mators require a stronger version of invariant PRAM, which is applicable in limited situations.
We gave further reasons for using the conditional approach, i.e., invariantly PRAMing some
variables within the categories of the remaining variables, for devising an invariant PRAM. It
results in a proper invariant PRAM and allows the data agency to selectively retain the original
data on some variables, which is particularly important when some of the variables determine
the survey weights. We examined the effect of invariant PRAM on the variance of estimators,
which previously had not been appreciated well. Our results are useful to data agencies for
assessing variance inflation and selecting suitable PRAM matrices. We found a connection be-
tween invariant PRAM and nonparametric synthetic data and observed that the former is more
flexible and gives greater control to data agencies to balance disclosure control and data utility.
In this paper, we focused mainly on constructing invariant PRAM matrices and effect of
invariant PRAMing on the variance of estimators of cell probabilities. We did not go into
21
measuring disclosure risk. Some general approaches and related issues have been discussed
by Duncan and Lambert (1986, 1989), Greenberg and Zayatz (1992), Lambert (1993), Reiter
(2005), Skinner and Elliott (2002), Shlomo and Skinner (2010, 2012) and others. Invariant
PRAM (or more generally PRAM) is a significant disclosure control technique, but we believe
that additional research and directives are needed for it to become a common method. In
particular, further guidance on how to choose the PRAM matrix will be helpful. In principle, the
choice should achieve disclosure control goals at minimum loss of information, one consequence
of which is variance inflation. De Wolf et al. (1998), Gouweleeuw et al. (1998), Skinner and
Elliott (2002), Shlomo (2010), Shlomo and De Waal (2008), Shlomo and Skinner (2010, 2012),
Van den Hout and Elamir (2006), Willenborg (2000), Yancey et al. (2002) and others have
made significant contribution to disclosure risk assessment and the choice of a PRAM matrix.
Additional empirical and theoretical research on effects of invariant PRAM on both information
loss and disclosure risk will help expand its the scope of applications for disclosure control.
References
[1] An, D., and Little, R.J.A. (2007). Multiple imputation: an alternative to top coding for
statistical disclosure control. J. Royal Statist. Soc., Ser. A, 170, 923-940.
[2] Chaudhuri, A. and Mukerjee, R. (1988). Randomized Response: Theory and Techniques.
New York: Marcel Dekker.
[3] Cruyff, M.J.L.F., Van den Hout, A. and Van der Heijden, P.G.M. (2008). The analysis
of randomized response sum score variables. J. Royal Statist. Soc., Ser. B, 70, 21-30.
[4] De Wolf, P.-P., Gouweleeuw, J.M., Kooiman, P., Willenborg, L. (1998). Reflections on
PRAM. In: Statistical Data Protection. Office for Official Publications of the European
Communities, Luxembourg, pp. 337-349
22
[5] Duncan, G.T. and Lambert, D. (1986). Disclosure-limited data dissemination. J. Amer.
Statist. Assoc., 81, 10-28.
[6] Duncan, G.T. and Lambert, D. (1989). The risk of disclosure for microdata. J. Business
& Econ. Statist, 7, 207-217.
[7] Gouweleeuw, J.M., Kooiman, P., Willenborg, L.C.R.J. and De Wolf, P.-P. (1998). Post
randomisation for statistical disclosure control: Theory and implementation. J. Official
Statist., 14, 463–478.
[8] Greenberg, B. and Zayatz, L. (1992). Strategies for measuring risk in public use microdata
files. Statist. Neerland., 46, 33-48.
[9] Hawala, S. (2008). Partially synthetic data via expert knowledge, modeling, and match-
ing. Proceedings of the Survey Methods Section, SSC Annual Meeting.
[10] Kooiman, P., Willenborg, L., and Gouweleeuw, J. (1997). A method for disclosure limi-
tation of microdata. Research paper 9705, Statistics Netherlands, Voorburg.
[11] Lambert, D. (1993). Measure of disclosure risk and harm. J. Official Statist., 9, 313-331.
[12] Little, R.J.A. (1993). Statistical analysis of masked data. J. Official Statist., 9, 407-426.
[13] Nayak, T.K. (1994). On randomized response surveys for estimating a proportion. Com-
mun. Statist.- Theory Meth., 23, 3303-3321.
[14] Nayak, T.K. and Adeshiyan, S.A. (2009). A unified framework for analysis and compari-
son of randomized response surveys of binary characteristics. J. Statist. Plann. Inference,
139, 2757-2766.
[15] Raghunathan, T.E., Reiter, J.P., and Rubin, D.B. (2003). Multiple imputation for sta-
tistical disclosure limitation. J. Official Statist., 19, 1-16.
23
[16] Reiter, J.P. (2002). Satisfying disclosure restrictions with synthetic datasets. J. Official
Statist., 18, 531-544.
[17] Reiter, J.P. (2003). Inference for partially synthetic, public use microdatasets. Survey
Methodology, 29, 181-188.
[18] Reiter, J.P. (2005). Estimating identification risk in microdata. J. Amer. Statist. Assoc.,
100, 1101-1113.
[19] Reiter, J. P. and Kinney, S. K. (2012). Inferentially valid, partially synthetic data: Gener-
ating from posterior predictive distributions not necessary. J. Official Statist., 28, 583-590.
[20] Ronning, G. (2005). Randomized response and the binary probit model. Economics Let-
ters, 86, 221-228.
[21] Rubin, D.B. (1993). Discussion: Statistical disclosure limitation. J. Official Statist., 9,
462-468.
[22] Skinner, C.J. and Elliot, M.J. (2002). A measure of disclosure risk for microdata. J. R.
Statist. Soc., Ser. B, 64, 855-867.
[23] Shlomo, N. (2010). Releasing microdata: Disclosure risk estimation, data masking and
assessing utility. J. Privacy & Confidentiality, 2, 73-91.
[24] Shlomo, N. and De Waal, T. (2008). Protection of micro-data subject to edit constraints
against statistical disclosure. J. Official Statist., 24, 229-253
[25] Shlomo, N. and Skinner, C. (2010). Assessing the protection provided by misclassification-
based disclosure limitation methods for survey microdata. Ann. Appl. Statist., 4, 1291-
1310.
24
[26] Shlomo, N. and Skinner, C.J. (2012). Privacy protection from sampling and perturbation
in survey microdata. J. Privacy & Confidentiality, 4, 155-169
[27] Slavkovic, A.B. and Lee, J. (2010). Synthetic two-way contingency tables that preserve
conditional frequencies. Statistical Methodology, 7, 225-239
[28] Van den Hout, A. and Van der Heijden, P.G.M. (2002). Randomized response, statistical
disclosure control and misclassification: A review. Internat. Statist. Rev., 70, 269-288.
[29] Van den Hout, A. and Elamir, E.A.H. (2006). Statistical disclosure control using post
randomisation: Variants and measures for disclosure risk. J. Official Statist., 22, 711731.
[30] Van den Hout, A. and Kooiman, P.(2006). Estimating the linear regression model with
categorical covariates subject to randomized response. Computational Statistics & Data
Analysis, 50, 3311-3323.
[31] Warner, S.L. (1965). Randomized response: A survey technique for eliminating evasive
answer bias. J. Amer. Statist. Assoc., 60, 63-69.
[32] Willenborg, L. (2000). Optimality models for PRAM. In Proceedings in Computational
Statistics, Eds. J.G. Bethlehem and P.G.M. van der Heijden. Heidelberg: Physica-Verlag,
pp. 505-510.
[33] Willenborg, L.C.R.J. and De Waal, T. (2001). Elements of Statistical Disclosure Control.
New York: Springer.
[34] Yancey, W., Winkler, W., and Creecy, R. (2002). Disclosure risk assessment in pertur-
bative micro-data protection. In J. Domingo-Ferrer, ed., Inference Control in Statistical
Databases, vol. 2316 of LNCS. New York: Springer. 135-151.
25