slice generalised software for statistical data editing and imputation.pdf
TRANSCRIPT
7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf
http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 1/6
SLICE generalised software for statistical data
editing and imputation
Ton de Waal
Department
of
Statistical Methods, Statistics Netherlands, PO Box 4000, 2270 M
Voorburg, Netherlands
Abstract. Statistical offices have to face the problem that data collected by surveys
or obtained from administrative registers generally contain errors. Another problem
they have to face is that values in data sets obtained from these sources may be
missing. To handle such errors and missing data efficiently, Statistics Netherlands
is currently developing a software package, called SLICE (Statistical Localisation,
Imputation and Correction
of
Errors). SLICE will contain several edit and
imputation modules. Examples are a module for automatic editing and a module
for imputation based on tree-models. In this paper I describe SLICE, hereby
focussing on the above-mentioned modules.
Keywords. automatic data editing, computer-assisted editing, Fellegi-Holt
paradigm, imputation, statistical data editing
Introduction
Statistical offices have to face the problem that data collected by surveys or
obtained from administrative registers generally contain errors. Errors in statistical
data sets may arise for a variety of reasons. For instance, the respondent may have
misunderstood a question, the respondent may have made a mistake when giving
the answer, or the respondent s database may contain errors. Alternatively, errors
may have been introduced at the statistical office, e.g., while transferring data from
a paper questionnaire to the computer system. Another problem that has to be faced
by statistical offices is that values in data sets obtained from surveys or
administrative registers may be missing. A value may be missing in a data set, e.g.,
because the respondent did not understand the corresponding question, the
respondent forgot to answer the question, or the respondent did not know the
answer. A value may also be missing because at the statistical office this value was
not entered when transferring the data from a paper questionnaire to the computer
system.
Incorrect data and missing data in a data set have to be treated in such a way that
statistical analyses based on this data set give reliable results. Incorrect data can be
treated y first identifying the errors and subsequently correcting these errors. This
process
s
referred to as data editing. Missing values can be treated by estimating
these values and filling in these estimates in the data set. This is referred to as
imputation.
To handle errors and missing values in data sets efficiently, Statistics
Netherlands is currently developing a software package called SLICE (Statistical
7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf
http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 2/6
278
Localisation, Imputation and Correction ofErrors). SLICE will contain several edit
and imputation modules. Examples of such modules are a module for automatic
data editing and a module for imputation based on tree-models. SLICE itself is
planned to become a module
of
Blaise, the integrated survey processing system
developed by Statistics Netherlands (see Blaise Reference Manual, 1998, and the
Blaise Developer s Guide, 1998).
The aim
of
the present paper is to give a description of the automatic editing
module and the imputation module of SLICE. For an overview
of
other editing
software I refer to Pierzchala (1995), and for an overview of other imputation
software to Hox (1999).
The remainder
of
the paper is organised
s
follows. Section 2 discusses
automatic data editing and SLICE. Section 3 describes the present imputation
module in SLICE, and the future imputation module based on so-called WAID
tree-models. Section 4 concludes the paper with a short discussion.
Automatic data editing and SLICE
Traditionally, incorrect data in a data set are edited 'manually'. That is, each faulty
record is modified manually, whereas the record has been checked either manually
or by means of a computer program.
n
the latter case where records are checked
by means
of
a computer program one also says that the data are edited in a
computer-assisted manner. Data are modified, e.g., on the basis of subject-matter
knowledge, by contacting the respondent again,
by
comparing the respondent's
data to his data from previous years, or by comparing the respondent's data to data
of
similar respondents. Correcting data 'manually'
is
a time and money consuming
process.
t
has long been recognised, however, that it is not necessary
to
correct all data in
every detail (see e.g. Granquist, 1995, and Granquist Kovar, 1997). A reason
why it is not necessary that all errors are removed from the data set is that
published results are usually aggregated data, such
s
totals or means. Small errors
will often cancel out when they are aggregated. Moreover, when the data have
been obtained by means of a sample
of
the population there will always be a small
error in the results, even when all collected data are completely correct. This
implies that for data obtained by means of a sample an error in the results due
to
'noise' in the data is acceptable s long s this error is negligible in comparison to
the sampling error. These observations allow one
to
automate the editing process to
some extent.
Three steps can be distinguished when editing data automatically. The first step
is to
identify the erroneous records. Once a record has been identified
s
incorrect,
its erroneous, or implausible to be more precise, values have
to
be identified. These
erroneous values are set to missing. The fmal step is to impute plausible values for
the missing values, both for the values that were actually missing in the original
data set
s
well
s
for the fields that were set
to
missing because their original
values were considered incorrect.
To establish whether a record is erroneous edit checks have to be specified. A
record is considered erroneous if one or more edit checks specified by subject
matter specialists are violated. f all specified edit checks are satisfied for a
particular record, this record is considered correct.
f
records are edited
7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf
http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 3/6
279
automatically a record that is considered correct is not modified, whereas a record
that is considered erroneous is always modified in such a way that all edit checks
are satisfied.
A well-known paradigm for identifying erroneous fields in an erroneous record
has been proposed in an influential paper by Fellegi and Holt 1976). According to
this Fellegi-Holt paradigm the data
of
a record should be made to satisfy all edit
checks y changing the values of the fewest possible number of variables. Over the
years this paradigm has been generalised slightly, because it was realised that the
values of some variables are more trustworthy than the values of other variables.
Therefore, to each variable a weight, the so-called confidence weight, can be
assigned. Such a confidence weight is a positive number that expresses the
confidence one has in the correctness of the values of this variables. A high
confidence corresponds to a variable
of
which the values are considered
trustworthy, a low confidence weight to a variable of which the values are
considered not so trustworthy. According to the generalised Fellegi-Holt paradigm
the data of a record should be made to satisfy all edit checks y changing the
values of the variables with the smallest possible sum of confidence weights.
To illustrate the Fellegi-Holt paradigm we assume that the following two edit
checks have been specified by subject-matter specialists:
P C=T
1)
and
C / T ~ O . 6
2)
where
P
is the profit of an enterprise, C its costs and
T
its turnover. Suppose
furthermore that a record contains the following data:
P=755 C=125
and
T=200.
This record violates edit check 1). Hence, the record is considered erroneous.
Now, suppose that the confidence weights
of
P C and T are all equal to
1
According to the Fellegi-Holt paradigm the data
of
the record should be made to
satisfy all edit checks
y
changing the values
of
the fewest possible number
of
variables. Preferably, only one value should be changed. By trial and error we can
fmd in our example that there is only one way to modify a single value such that all
edit checks are satisfied. Namely, the value of
P
should be modified to 75.
In this example the Fellegi-Holt paradigm yields exactly one optimal solution,
i.e. a set of variables with a minimum sum of confidence weights, to modify the
data In other cases, however, application of the generalised) Fellegi-Holt
paradigm may yield several optimal solutions to modify the data.
n
such cases
additional criteria may be used to select the best option.
The generalised) Fellegi-Holt paradigm is a very natural principle to determine
implausible fields. However, the resulting problem of fmding those fields with the
smallest possible sum of confidence weights that can be modified in such a way
that all edit checks are satisfied is quite complicated. This so-called error
localisation problem can be formulated as a mathematical programming problem.
In fact, the error localisation problem is so complicated that most algorithms have
7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf
http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 4/6
280
either been designed to solve this problem for categorical data only or for
numerical data only.
n
the current prototype version of SLICE a module for automatic editing of
numerical data only is implemented. This module, called CherryPi, was developed
in the spirit of GElS, the Generalized Edit and Imputation System of Statistics
Canada see Kovar and Whitridge, 1990). The basic principle to identify the
implausible fields in an erroneous record is the generalised Fellegi-Holt paradigm
that has been described above.
In the near) future the implemented algorithm is likely to be replaced by an
alternative algorithm that is currently being developed at Statistics Netherlands.
This new algorithm is based on searching a large binary tree for optimal solutions
to the error localisation problem. The new algorithm has several advantages in
comparison to the old one. First, the new algorithm can be extended to solve the
error localisation for a mix of numerical data and categorical data. At the moment
of writing this paper two similar algorithms exist: one for numerical data cf.
Quere, 2000) and one for categorical data cf. Daalmans, 2000). We hope to
combine these two algorithms into a single algorithm that can handle both types of
data simultaneously within the next few months. Second, the new algorithm is
easier to understand and implement than the old one. Third, for numerical data the
new algorithm is clearly faster than the old algorithm cf. Quere, 2000).
3
Imputation and SLI E
Considering the limited size of the present paper and the vast literature on
imputation, it is impossible to give even a brief overview of imputation methods
that have been developed. Instead, I only discuss the present imputation module of
SLICE and the imputation module that is planned to be implemented in SLICE in
the near future. For two overview papers on imputation techniques I refer to Kalton
Kasprzyk 1986) and Kovar Whitridge 1995).
The present imputation module of SLICE is an improved version of the
imputation module of CherryPi. In CherryPi regression imputation is used to
impute for missing values. For each variable a subject-matter specialist has to
select a predictor variable. The subject-matter specialist also has to determine the
constant term and the regression coefficient, e.g. by means of statistical analysis. f
after regression imputation a resultant record still fails one or more edit checks, the
imputed values are modified slightly in order to satisfy all edit checks. See De
Waal and Wings 1999) for more information on this regression imputation
module. In SLICE a similar regression imputation module has been implemented.
The main difference is that the number of predictor variables is not limited to only
one like in CherryPi. Instead, SLICE allows the use of an arbitrary number of
predictor variables.
At Statistics Netherlands we are currently developing a new, more advanced,
imputation module together with University of Southampton, Office for National
Statistics in the UK, Statistics Finland and Instituto Nacional de Estatistica de
Portugal. These institutes all participate in an European project called AUTIMP.
This project is partly fmanced by the 4th Framework Programme of the European
Commission. One of the aims of AUTIMP
is
the development of imputation
software based on automatic interaction detection AID) trees,
cf
Sonquist, Baker
7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf
http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 5/6
28
and Morgan (1971). The developed imputation software is planned to become a
module in SLICE. The algorithm to construct AID trees has been developed by
University o Southampton. Because the algorithm gives lower weights to outliers
while constructing the tree, the technique
is
referred to as weighted automatic
interaction detection (WAID).
In general, a tree-based model is a set o classification rules defmed in tenus o
the values o a set o (typically categorical) predictor variables.
t
is typically
constructed by successively splitting a training data set into subsets that are
increasingly more homogeneous with respect to a response variable o interest until
a stopping criterion is met.
The WAID technique that is developed is suitable for both categorical and
numerical data. To use the WAID methodology for imputing missing values in a
data set, first the missing data patterns in this data set have to be determined. For
certain missing data patterns, or parts o missing data patterns, predictor variables
are selected, and WAID trees are grown using a complete data set. The 'leaves'
o
these trees fonn homogeneous subsets o records. After generation o the trees, and
hence the homogeneous clusters
o
records, the data set with missing values is
supplied to the computer program.
To impute a record with missing values, we first determine which WAID tree
corresponds to the missing data pattern o this record. Subsequently, we determine
the homogeneous leaf corresponding to this record by examining the values
o
the
predictor variables. The records from that leaf are used to impute the missing data
in the record under consideration. The imputation software will support several
imputation methods, e.g., a random donor record can be selected from the
homogeneous leaf, or (in case o numerical data) a mean o the records in the
homogeneous leaf may be imputed. For the case where only numerical predictor
variables are used we will also support nearest neighbour hot deck imputation.
Discussion
In this paper I have described the present and planned automatic editing and
imputation modules
o
SLICE. The fact that I have only described automatic
editing and imputation, may give the false impression that Statistics Netherlands
plans to rely solely on these techniques for checking and correcting its statistical
data. I want to emphasise here that this defmitely is not the case. In particular, at
Statistics Netherlands we consider automatic editing
to
be just one step in the
editing process, not as the entire editing process. In our view the ideal edit strategy
consists o a combination o selective editing (where only the most suspicious and
influential records are selected for computer-assisted editing), automatic editing
and macro-editing (where records are only selected for computer-assisted editing
when aggregated totals based on these records are suspicious, or when they are
outliers in the distribution
o
all records).
n
this setup automatic editing
is
only
applied to the records in the so-called non-critical stream, i.e. the records with only
non-influential errors. Influential errors always have to be corrected by subject
matter specialists.
In comparison with the traditional computer-assisted approach the new combined
approach requires substantially less resources for cleaning the data while the
quality o the data can be maintained. Moreover, the timeliness
o
publication o
7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf
http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 6/6
282
the statistical data clearly is improved. For more infonnation on the view of
Statistics Netherlands on the ideal editing strategy I refer to e Waal, Renssen
Van de Pol (2000).
eferences
Blaise Reference Manual (1998). Department of Statistical Infonnatics, Statistics
Netherlands, Heerlen.
Blaise Developer s Guide
(1998). Department of Statistical Infonnatics, Statistics
Netherlands, Heerlen.
Daalmans,
J
(2000). Automatic editing of categorical data (working title). Statistics
Netherlands, Voorburg.
De Waal, T. and H. Wings (1999). From CherryPi to SLICE. Report, Statistics
Netherlands, Voorburg.
e Waal, T.,
R
Renssen and F. Van de Pol (2000). Graphical macro-editing:
possibilities and pitfalls. Report, Statistics Netherlands, Voorburg.
Fellegi, I.P. and D. Holt (1976). A systematic approach to automatic edit and
imputation. Journal o f he American Statistical Association, 71, pp. 17-35.
Granquist, L. (1995). Improving the traditional editing process. In:
Business Survey
Methods (ed. Cox, Binder, Chinnappa, Christianson Kott), John Wiley
Sons, Inc., pp. 385-401.
Granquist, L. and
J
Kovar (1997). Editing of survey data: how much is enough?
In:
Survey Measurement
nd
Process Quality
(ed. Lyberg, Biemer, Collins, e
Leeuw, Dippo, Schwartz Trewin), John Wiley Sons, Inc., pp. 415-435.
Hox, J.J. (1999). A review of current software for handling missing data.
Kwantitatieve Methoden, 62, pp. 123-138.
Kalton, G. and D. Kasprzyk (1986). The treatment of missing survey data. Survey
Methodology, 12, pp. 1-16.
Kovar, J. and P. Whitridge (1990). Generalized edit and imputation system;
overview and applications. Revista Brasiliera de Estadistica, 51, pp. 85-100.
Kovar, J. and P.J. Whitridge (1995). Imputation
of
business survey data. In:
Business Survey Methods
(ed. Cox, Binder, Chinnappa, Christianson Kott),
John Wiley Sons, Inc., pp. 403-423.
Pierzchala, M. (1995). Editing systems and software. In:
Business Survey Methods
(ed. Cox, Binder, Chinnappa, Christianson Kott), John Wiley Sons, Inc.,
pp. 425-441.
Quere, R. (2000). Automatic editing of numerical data. Report, Statistics
Netherlands, Voorburg.
Sonquist, J.N., E.L. Baker and J.A. Morgan (1971). Searching for structure.
Institute for Social Research, University ofMichigan.