automatic editing with hard and soft edits – some first experiences sander scholtus sevinç...
TRANSCRIPT
Automatic Editing with Hardand Soft Edits – Some First ExperiencesSander Scholtus
Sevinç Göksen
(Statistics Netherlands)
Automatic Editing with Hard and Soft Edits - Some First Experiences 2
Introduction
• Error localisation problem:• Try to identify variables with erroneous/missing values
• Edits:• Constraints that should be satisfied by the data• Hard (fatal) – e.g. Turnover – Costs = Profit• Soft (query) – e.g. Profit / Turnover ≤ 0.6
• Manual editing: hard and soft edits• Automatic editing: only hard edits
Automatic Editing with Hard and Soft Edits - Some First Experiences 3
Error localisation (1)
• Fellegi and Holt (1976):• Find the smallest (weighted) number of variables that
can be imputed so that all edits are satisfied• Minimise
so that all edits are satisfied• No room for soft edits
j jjFH ywD
Automatic Editing with Hard and Soft Edits - Some First Experiences 4
Error localisation (2)
• Alternative approach:• Choose a function Dsoft that measures the degree of
suspicion associated with particular soft edit failures• Minimise
so that all hard edits are satisfied
• Prototype algorithm in R (based on editrules)
softFH DDD )1(
Automatic Editing with Hard and Soft Edits - Some First Experiences 5
Simulation study (1)
• Two data sets:• Dutch SBS 2007, medium-sized wholesale businesses• Raw and manually edited data available• One half used as test data, one half as reference data
• Test data set 1:• 728 records, 12 variables, 16 hard edits, 10 soft edits• Synthetic errors
• Test data set 2:• 580 records, 10 variables, 17 hard edits, 24 soft edits• Real errors
Automatic Editing with Hard and Soft Edits - Some First Experiences 6
Simulation study (2)
editing approach (choice of Dsoft) % records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2% 58.4%
all edits as hard edits 36.8% n/a
Automatic Editing with Hard and Soft Edits - Some First Experiences 7
Choices for Dsoft – fixed weights (1)
• Fixed failure weights:
• Resulting target function to be minimised:
• Higher failure weight ‘harder’ soft edit
k kksoft zsD
k kkj jj zsywD )1(
Automatic Editing with Hard and Soft Edits - Some First Experiences 8
Choices for Dsoft – fixed weights (2)
• Possible choices for sk:
A. All failure weights equal to 1
B. Proportion of records that satisfy edit k in manually edited reference data
Interpretation: P(edited record satisfies edit k)
C. P(edited record satisfies edit k | raw record fails edit k)
• Alternative: categorised versions of B and C
Automatic Editing with Hard and Soft Edits - Some First Experiences 9
Simulation study (3)
editing approach (choice of Dsoft) % records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2% 58.4%
all edits, using soft edits as hard edits 36.8% n/a
sum of fixed failure weights A 47.3% 63.4%
sum of fixed failure weights B 52.1% 60.9%
sum of fixed failure weights C 43.3% 60.7%
sum of fixed failure weights B(cat) 50.0% 64.5%
sum of fixed failure weights C(cat) 43.1% 64.5%
Automatic Editing with Hard and Soft Edits - Some First Experiences 10
Choices for Dsoft – quantile edits (1)
• Drawback of fixed failure weights: no difference between large and small edit failures
• Trick: quantile edits
Automatic Editing with Hard and Soft Edits - Some First Experiences 11
Choices for Dsoft – quantile edits (2)
• Idea: use different versions of the same edit by varying one of the constants
• Choose values for this
constant based on the
fraction of reference
data records that fail
the resulting edit
(e.g. 1%, 5%, 10%)
Automatic Editing with Hard and Soft Edits - Some First Experiences 12
Choices for Dsoft – quantile edits (3)
• Example: ratio edit x1 / x3 ≥ c
% records failed c in ref. data quantile edit sk cumul. sk
10% 0.75 x1 / x3 ≥ 0.75 1 1
5% 0.60 x1 / x3 ≥ 0.60 1 2
1% 0.10 x1 / x3 ≥ 0.10 1 3
Automatic Editing with Hard and Soft Edits - Some First Experiences 13
Simulation study (4)
editing approach (choice of Dsoft) % records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2% 58.4%
all edits, using soft edits as hard edits 36.8% n/a
sum of fixed failure weights A 47.3% 63.4%
sum of fixed failure weights B 52.1% 60.9%
sum of fixed failure weights C 43.3% 60.7%
sum of fixed failure weights B(cat) 50.0% 64.5%
sum of fixed failure weights C(cat) 43.1% 64.5%
10-5-1%-quantile edits, weights 0.33-0.33-0.33 54.4% 63.4%
10-5-1%-quantile edits, weights 0.90-0.05-0.05 56.5% 63.8%
Automatic Editing with Hard and Soft Edits - Some First Experiences 14
Choices for Dsoft – dynamic expressions
• Size of edit failure: ek• Linear equality edit: ak1x1 + … + akpxp + bk = 0
Take: ek = | ak1x1 + … + akpxp + bk |
• Linear inequality edit: ak1x1 + … + akpxp + bk ≥ 0
Take: ek = max{ 0, –(ak1x1 + … + akpxp + bk) }
• Use reference data to standardise:• Linear sum:
• Mahalanobis distance:
k kksoft eD ̂
eeeDD Msoft1ˆ)0,(
Automatic Editing with Hard and Soft Edits - Some First Experiences 15
Simulation study (5)
editing approach (choice of Dsoft) % records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2% 58.4%
all edits, using soft edits as hard edits 36.8% n/a
sum of fixed failure weights A 47.3% 63.4%
sum of fixed failure weights B 52.1% 60.9%
sum of fixed failure weights C 43.3% 60.7%
sum of fixed failure weights B(cat) 50.0% 64.5%
sum of fixed failure weights C(cat) 43.1% 64.5%
10-5-1%-quantile edits, weights 0.33-0.33-0.33 54.4% 63.4%
10-5-1%-quantile edits, weights 0.90-0.05-0.05 56.5% 63.8%
sum of standardised soft edit failures 49.2% ?
Mahalanobis distance of soft edit failures 46.8% ?
Automatic Editing with Hard and Soft Edits - Some First Experiences 16
Conclusion
• Using soft edits improved error localisation
• Choice of Dsoft:• Results not unequivocal• Quantile edits seem to work well• Room for improvement
• Future work:• Extended simulation study with mixed data/edits