automatic editing with hard and soft edits – some first experiences sander scholtus sevinç...

Automatic Editing with Hardand Soft Edits – Some First ExperiencesSander Scholtus

Sevinç Göksen

(Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 2

Introduction

• Error localisation problem:• Try to identify variables with erroneous/missing values

• Edits:• Constraints that should be satisfied by the data• Hard (fatal) – e.g. Turnover – Costs = Profit• Soft (query) – e.g. Profit / Turnover ≤ 0.6

• Manual editing: hard and soft edits• Automatic editing: only hard edits


Error localisation (1)

• Fellegi and Holt (1976):• Find the smallest (weighted) number of variables that

can be imputed so that all edits are satisfied• Minimise

so that all edits are satisfied• No room for soft edits

j jjFH ywD


Error localisation (2)

• Alternative approach:• Choose a function Dsoft that measures the degree of

suspicion associated with particular soft edit failures• Minimise

so that all hard edits are satisfied

• Prototype algorithm in R (based on editrules)

softFH DDD )1(


Simulation study (1)

• Two data sets:• Dutch SBS 2007, medium-sized wholesale businesses• Raw and manually edited data available• One half used as test data, one half as reference data

• Test data set 1:• 728 records, 12 variables, 16 hard edits, 10 soft edits• Synthetic errors

• Test data set 2:• 580 records, 10 variables, 17 hard edits, 24 soft edits• Real errors



editing approach (choice of Dsoft) % records with perfect solution

data set 1 data set 2

no soft edits, only hard edits 40.2% 58.4%

all edits as hard edits 36.8% n/a


Choices for Dsoft – fixed weights (1)

• Fixed failure weights:

• Resulting target function to be minimised:

• Higher failure weight ‘harder’ soft edit

k kksoft zsD

k kkj jj zsywD )1(


Choices for Dsoft – fixed weights (2)

• Possible choices for sk:

A. All failure weights equal to 1

B. Proportion of records that satisfy edit k in manually edited reference data

Interpretation: P(edited record satisfies edit k)

C. P(edited record satisfies edit k | raw record fails edit k)

• Alternative: categorised versions of B and C






all edits, using soft edits as hard edits 36.8% n/a

sum of fixed failure weights A 47.3% 63.4%

sum of fixed failure weights B 52.1% 60.9%

sum of fixed failure weights C 43.3% 60.7%

sum of fixed failure weights B(cat) 50.0% 64.5%

sum of fixed failure weights C(cat) 43.1% 64.5%


Choices for Dsoft – quantile edits (1)

• Drawback of fixed failure weights: no difference between large and small edit failures

• Trick: quantile edits



• Idea: use different versions of the same edit by varying one of the constants

• Choose values for this

constant based on the

fraction of reference

data records that fail

the resulting edit

(e.g. 1%, 5%, 10%)



• Example: ratio edit x1 / x3 ≥ c

% records failed c in ref. data quantile edit sk cumul. sk

10% 0.75 x1 / x3 ≥ 0.75 1 1

5% 0.60 x1 / x3 ≥ 0.60 1 2

1% 0.10 x1 / x3 ≥ 0.10 1 3












10-5-1%-quantile edits, weights 0.33-0.33-0.33 54.4% 63.4%



Choices for Dsoft – dynamic expressions

• Size of edit failure: ek• Linear equality edit: ak1x1 + … + akpxp + bk = 0

Take: ek = | ak1x1 + … + akpxp + bk |

• Linear inequality edit: ak1x1 + … + akpxp + bk ≥ 0

Take: ek = max{ 0, –(ak1x1 + … + akpxp + bk) }

• Use reference data to standardise:• Linear sum:

• Mahalanobis distance:

k kksoft eD ̂

eeeDD Msoft1ˆ)0,(














sum of standardised soft edit failures 49.2% ?

Mahalanobis distance of soft edit failures 46.8% ?


Conclusion

• Using soft edits improved error localisation

• Choice of Dsoft:• Results not unequivocal• Quantile edits seem to work well• Room for improvement

• Future work:• Extended simulation study with mixed data/edits

automatic editing with hard and soft edits – some first experiences sander scholtus sevinç...

Documents