slice generalised software for statistical data editing and imputation.pdf

7/26/2019 SLICE generalised software for statistical data editing and imputation.pdf

http://slidepdf.com/reader/full/slice-generalised-software-for-statistical-data-editing-and-imputationpdf 1/6

SLICE generalised software for statistical data

editing and imputation

Ton de Waal

Department

of

Statistical Methods, Statistics Netherlands, PO Box 4000, 2270 M

Voorburg, Netherlands

Abstract. Statistical offices have to face the problem that data collected by surveys

or obtained from administrative registers generally contain errors. Another problem

they have to face is that values in data sets obtained from these sources may be

missing. To handle such errors and missing data efficiently, Statistics Netherlands

is currently developing a software package, called SLICE (Statistical Localisation,

Imputation and Correction

of

Errors). SLICE will contain several edit and

imputation modules. Examples are a module for automatic editing and a module

for imputation based on tree-models. In this paper I describe SLICE, hereby

focussing on the above-mentioned modules.

Keywords. automatic data editing, computer-assisted editing, Fellegi-Holt

paradigm, imputation, statistical data editing

Introduction

Statistical offices have to face the problem that data collected by surveys or

obtained from administrative registers generally contain errors. Errors in statistical

data sets may arise for a variety of reasons. For instance, the respondent may have

misunderstood a question, the respondent may have made a mistake when giving

the answer, or the respondent s database may contain errors. Alternatively, errors

may have been introduced at the statistical office, e.g., while transferring data from

a paper questionnaire to the computer system. Another problem that has to be faced

by statistical offices is that values in data sets obtained from surveys or

administrative registers may be missing. A value may be missing in a data set, e.g.,

because the respondent did not understand the corresponding question, the

respondent forgot to answer the question, or the respondent did not know the

answer. A value may also be missing because at the statistical office this value was

not entered when transferring the data from a paper questionnaire to the computer

system.

Incorrect data and missing data in a data set have to be treated in such a way that

statistical analyses based on this data set give reliable results. Incorrect data can be

treated y first identifying the errors and subsequently correcting these errors. This

process

s

referred to as data editing. Missing values can be treated by estimating

these values and filling in these estimates in the data set. This is referred to as

imputation.

To handle errors and missing values in data sets efficiently, Statistics

Netherlands is currently developing a software package called SLICE (Statistical



278

Localisation, Imputation and Correction ofErrors). SLICE will contain several edit

and imputation modules. Examples of such modules are a module for automatic

data editing and a module for imputation based on tree-models. SLICE itself is

planned to become a module

of

Blaise, the integrated survey processing system

developed by Statistics Netherlands (see Blaise Reference Manual, 1998, and the

Blaise Developer s Guide, 1998).

The aim

of

the present paper is to give a description of the automatic editing

module and the imputation module of SLICE. For an overview

of

other editing

software I refer to Pierzchala (1995), and for an overview of other imputation

software to Hox (1999).

The remainder

of

the paper is organised

s

follows. Section 2 discusses

automatic data editing and SLICE. Section 3 describes the present imputation

module in SLICE, and the future imputation module based on so-called WAID

tree-models. Section 4 concludes the paper with a short discussion.

Automatic data editing and SLICE

Traditionally, incorrect data in a data set are edited 'manually'. That is, each faulty

record is modified manually, whereas the record has been checked either manually

or by means of a computer program.

n

the latter case where records are checked

by means

of

a computer program one also says that the data are edited in a

computer-assisted manner. Data are modified, e.g., on the basis of subject-matter

knowledge, by contacting the respondent again,

by

comparing the respondent's

data to his data from previous years, or by comparing the respondent's data to data

of

similar respondents. Correcting data 'manually'

is

a time and money consuming

process.

t

has long been recognised, however, that it is not necessary

to

correct all data in

every detail (see e.g. Granquist, 1995, and Granquist Kovar, 1997). A reason

why it is not necessary that all errors are removed from the data set is that

published results are usually aggregated data, such

s

totals or means. Small errors

will often cancel out when they are aggregated. Moreover, when the data have

been obtained by means of a sample

of

the population there will always be a small

error in the results, even when all collected data are completely correct. This

implies that for data obtained by means of a sample an error in the results due

to

'noise' in the data is acceptable s long s this error is negligible in comparison to

the sampling error. These observations allow one

to

automate the editing process to

some extent.

Three steps can be distinguished when editing data automatically. The first step

is to

identify the erroneous records. Once a record has been identified

s

incorrect,

its erroneous, or implausible to be more precise, values have

to

be identified. These

erroneous values are set to missing. The fmal step is to impute plausible values for

the missing values, both for the values that were actually missing in the original

data set

s

well

s

for the fields that were set

to

missing because their original

values were considered incorrect.

To establish whether a record is erroneous edit checks have to be specified. A

record is considered erroneous if one or more edit checks specified by subject

matter specialists are violated. f all specified edit checks are satisfied for a

particular record, this record is considered correct.

f

records are edited



279

automatically a record that is considered correct is not modified, whereas a record

that is considered erroneous is always modified in such a way that all edit checks

are satisfied.

A well-known paradigm for identifying erroneous fields in an erroneous record

has been proposed in an influential paper by Fellegi and Holt 1976). According to

this Fellegi-Holt paradigm the data

of

a record should be made to satisfy all edit

checks y changing the values of the fewest possible number of variables. Over the

years this paradigm has been generalised slightly, because it was realised that the

values of some variables are more trustworthy than the values of other variables.

Therefore, to each variable a weight, the so-called confidence weight, can be

assigned. Such a confidence weight is a positive number that expresses the

confidence one has in the correctness of the values of this variables. A high

confidence corresponds to a variable

of

which the values are considered

trustworthy, a low confidence weight to a variable of which the values are

considered not so trustworthy. According to the generalised Fellegi-Holt paradigm

the data of a record should be made to satisfy all edit checks y changing the

values of the variables with the smallest possible sum of confidence weights.

To illustrate the Fellegi-Holt paradigm we assume that the following two edit

checks have been specified by subject-matter specialists:

P C=T

1)

and

C / T ~ O . 6

2)

where

P

is the profit of an enterprise, C its costs and

T

its turnover. Suppose

furthermore that a record contains the following data:

P=755 C=125

and

T=200.

This record violates edit check 1). Hence, the record is considered erroneous.

Now, suppose that the confidence weights

of

P C and T are all equal to

1

According to the Fellegi-Holt paradigm the data

of

the record should be made to

satisfy all edit checks

y

changing the values

of

the fewest possible number

of

variables. Preferably, only one value should be changed. By trial and error we can

fmd in our example that there is only one way to modify a single value such that all

edit checks are satisfied. Namely, the value of

P

should be modified to 75.

In this example the Fellegi-Holt paradigm yields exactly one optimal solution,

i.e. a set of variables with a minimum sum of confidence weights, to modify the

data In other cases, however, application of the generalised) Fellegi-Holt

paradigm may yield several optimal solutions to modify the data.

n

such cases

additional criteria may be used to select the best option.

The generalised) Fellegi-Holt paradigm is a very natural principle to determine

implausible fields. However, the resulting problem of fmding those fields with the

smallest possible sum of confidence weights that can be modified in such a way

that all edit checks are satisfied is quite complicated. This so-called error

localisation problem can be formulated as a mathematical programming problem.

In fact, the error localisation problem is so complicated that most algorithms have



280

either been designed to solve this problem for categorical data only or for

numerical data only.

n

the current prototype version of SLICE a module for automatic editing of

numerical data only is implemented. This module, called CherryPi, was developed

in the spirit of GElS, the Generalized Edit and Imputation System of Statistics

Canada see Kovar and Whitridge, 1990). The basic principle to identify the

implausible fields in an erroneous record is the generalised Fellegi-Holt paradigm

that has been described above.

In the near) future the implemented algorithm is likely to be replaced by an

alternative algorithm that is currently being developed at Statistics Netherlands.

This new algorithm is based on searching a large binary tree for optimal solutions

to the error localisation problem. The new algorithm has several advantages in

comparison to the old one. First, the new algorithm can be extended to solve the

error localisation for a mix of numerical data and categorical data. At the moment

of writing this paper two similar algorithms exist: one for numerical data cf.

Quere, 2000) and one for categorical data cf. Daalmans, 2000). We hope to

combine these two algorithms into a single algorithm that can handle both types of

data simultaneously within the next few months. Second, the new algorithm is

easier to understand and implement than the old one. Third, for numerical data the

new algorithm is clearly faster than the old algorithm cf. Quere, 2000).

3

Imputation and SLI E

Considering the limited size of the present paper and the vast literature on

imputation, it is impossible to give even a brief overview of imputation methods

that have been developed. Instead, I only discuss the present imputation module of

SLICE and the imputation module that is planned to be implemented in SLICE in

the near future. For two overview papers on imputation techniques I refer to Kalton

Kasprzyk 1986) and Kovar Whitridge 1995).

The present imputation module of SLICE is an improved version of the

imputation module of CherryPi. In CherryPi regression imputation is used to

impute for missing values. For each variable a subject-matter specialist has to

select a predictor variable. The subject-matter specialist also has to determine the

constant term and the regression coefficient, e.g. by means of statistical analysis. f

after regression imputation a resultant record still fails one or more edit checks, the

imputed values are modified slightly in order to satisfy all edit checks. See De

Waal and Wings 1999) for more information on this regression imputation

module. In SLICE a similar regression imputation module has been implemented.

The main difference is that the number of predictor variables is not limited to only

one like in CherryPi. Instead, SLICE allows the use of an arbitrary number of

predictor variables.

At Statistics Netherlands we are currently developing a new, more advanced,

imputation module together with University of Southampton, Office for National

Statistics in the UK, Statistics Finland and Instituto Nacional de Estatistica de

Portugal. These institutes all participate in an European project called AUTIMP.

This project is partly fmanced by the 4th Framework Programme of the European

Commission. One of the aims of AUTIMP

is

the development of imputation

software based on automatic interaction detection AID) trees,

cf

Sonquist, Baker



28

and Morgan (1971). The developed imputation software is planned to become a

module in SLICE. The algorithm to construct AID trees has been developed by

University o Southampton. Because the algorithm gives lower weights to outliers

while constructing the tree, the technique

is

referred to as weighted automatic

interaction detection (WAID).

In general, a tree-based model is a set o classification rules defmed in tenus o

the values o a set o (typically categorical) predictor variables.

t

is typically

constructed by successively splitting a training data set into subsets that are

increasingly more homogeneous with respect to a response variable o interest until

a stopping criterion is met.

The WAID technique that is developed is suitable for both categorical and

numerical data. To use the WAID methodology for imputing missing values in a

data set, first the missing data patterns in this data set have to be determined. For

certain missing data patterns, or parts o missing data patterns, predictor variables

are selected, and WAID trees are grown using a complete data set. The 'leaves'

o

these trees fonn homogeneous subsets o records. After generation o the trees, and

hence the homogeneous clusters

o

records, the data set with missing values is

supplied to the computer program.

To impute a record with missing values, we first determine which WAID tree

corresponds to the missing data pattern o this record. Subsequently, we determine

the homogeneous leaf corresponding to this record by examining the values

o

the

predictor variables. The records from that leaf are used to impute the missing data

in the record under consideration. The imputation software will support several

imputation methods, e.g., a random donor record can be selected from the

homogeneous leaf, or (in case o numerical data) a mean o the records in the

homogeneous leaf may be imputed. For the case where only numerical predictor

variables are used we will also support nearest neighbour hot deck imputation.

Discussion

In this paper I have described the present and planned automatic editing and

imputation modules

o

SLICE. The fact that I have only described automatic

editing and imputation, may give the false impression that Statistics Netherlands

plans to rely solely on these techniques for checking and correcting its statistical

data. I want to emphasise here that this defmitely is not the case. In particular, at

Statistics Netherlands we consider automatic editing

to

be just one step in the

editing process, not as the entire editing process. In our view the ideal edit strategy

consists o a combination o selective editing (where only the most suspicious and

influential records are selected for computer-assisted editing), automatic editing

and macro-editing (where records are only selected for computer-assisted editing

when aggregated totals based on these records are suspicious, or when they are

outliers in the distribution

o

all records).

n

this setup automatic editing

is

only

applied to the records in the so-called non-critical stream, i.e. the records with only

non-influential errors. Influential errors always have to be corrected by subject

matter specialists.

In comparison with the traditional computer-assisted approach the new combined

approach requires substantially less resources for cleaning the data while the

quality o the data can be maintained. Moreover, the timeliness

o

publication o



282

the statistical data clearly is improved. For more infonnation on the view of

Statistics Netherlands on the ideal editing strategy I refer to e Waal, Renssen

Van de Pol (2000).

eferences

Blaise Reference Manual (1998). Department of Statistical Infonnatics, Statistics

Netherlands, Heerlen.

Blaise Developer s Guide

(1998). Department of Statistical Infonnatics, Statistics

Netherlands, Heerlen.

Daalmans,

J

(2000). Automatic editing of categorical data (working title). Statistics

Netherlands, Voorburg.

De Waal, T. and H. Wings (1999). From CherryPi to SLICE. Report, Statistics


e Waal, T.,

R

Renssen and F. Van de Pol (2000). Graphical macro-editing:

possibilities and pitfalls. Report, Statistics Netherlands, Voorburg.

Fellegi, I.P. and D. Holt (1976). A systematic approach to automatic edit and

imputation. Journal o f he American Statistical Association, 71, pp. 17-35.

Granquist, L. (1995). Improving the traditional editing process. In:

Business Survey

Methods (ed. Cox, Binder, Chinnappa, Christianson Kott), John Wiley

Sons, Inc., pp. 385-401.

Granquist, L. and

J

Kovar (1997). Editing of survey data: how much is enough?

In:

Survey Measurement

nd

Process Quality

(ed. Lyberg, Biemer, Collins, e

Leeuw, Dippo, Schwartz Trewin), John Wiley Sons, Inc., pp. 415-435.

Hox, J.J. (1999). A review of current software for handling missing data.

Kwantitatieve Methoden, 62, pp. 123-138.

Kalton, G. and D. Kasprzyk (1986). The treatment of missing survey data. Survey

Methodology, 12, pp. 1-16.

Kovar, J. and P. Whitridge (1990). Generalized edit and imputation system;

overview and applications. Revista Brasiliera de Estadistica, 51, pp. 85-100.

Kovar, J. and P.J. Whitridge (1995). Imputation

of

business survey data. In:

Business Survey Methods

(ed. Cox, Binder, Chinnappa, Christianson Kott),

John Wiley Sons, Inc., pp. 403-423.

Pierzchala, M. (1995). Editing systems and software. In:

Business Survey Methods

(ed. Cox, Binder, Chinnappa, Christianson Kott), John Wiley Sons, Inc.,

pp. 425-441.

Quere, R. (2000). Automatic editing of numerical data. Report, Statistics


Sonquist, J.N., E.L. Baker and J.A. Morgan (1971). Searching for structure.

Institute for Social Research, University ofMichigan.

slice generalised software for statistical data editing and imputation.pdf

Documents