automatic editing with hard and soft edits – some first experiences sander scholtus sevinç...

16
Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Upload: howard-holt

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hardand Soft Edits – Some First ExperiencesSander Scholtus

Sevinç Göksen

(Statistics Netherlands)

Page 2: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 2

Introduction

• Error localisation problem:• Try to identify variables with erroneous/missing values

• Edits:• Constraints that should be satisfied by the data• Hard (fatal) – e.g. Turnover – Costs = Profit• Soft (query) – e.g. Profit / Turnover ≤ 0.6

• Manual editing: hard and soft edits• Automatic editing: only hard edits

Page 3: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 3

Error localisation (1)

• Fellegi and Holt (1976):• Find the smallest (weighted) number of variables that

can be imputed so that all edits are satisfied• Minimise

so that all edits are satisfied• No room for soft edits

j jjFH ywD

Page 4: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 4

Error localisation (2)

• Alternative approach:• Choose a function Dsoft that measures the degree of

suspicion associated with particular soft edit failures• Minimise

so that all hard edits are satisfied

• Prototype algorithm in R (based on editrules)

softFH DDD )1(

Page 5: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 5

Simulation study (1)

• Two data sets:• Dutch SBS 2007, medium-sized wholesale businesses• Raw and manually edited data available• One half used as test data, one half as reference data

• Test data set 1:• 728 records, 12 variables, 16 hard edits, 10 soft edits• Synthetic errors

• Test data set 2:• 580 records, 10 variables, 17 hard edits, 24 soft edits• Real errors

Page 6: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 6

Simulation study (2)

editing approach (choice of Dsoft) % records with perfect solution

data set 1 data set 2

no soft edits, only hard edits 40.2% 58.4%

all edits as hard edits 36.8% n/a

Page 7: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 7

Choices for Dsoft – fixed weights (1)

• Fixed failure weights:

• Resulting target function to be minimised:

• Higher failure weight ‘harder’ soft edit

k kksoft zsD

k kkj jj zsywD )1(

Page 8: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 8

Choices for Dsoft – fixed weights (2)

• Possible choices for sk:

A. All failure weights equal to 1

B. Proportion of records that satisfy edit k in manually edited reference data

Interpretation: P(edited record satisfies edit k)

C. P(edited record satisfies edit k | raw record fails edit k)

• Alternative: categorised versions of B and C

Page 9: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 9

Simulation study (3)

editing approach (choice of Dsoft) % records with perfect solution

data set 1 data set 2

no soft edits, only hard edits 40.2% 58.4%

all edits, using soft edits as hard edits 36.8% n/a

sum of fixed failure weights A 47.3% 63.4%

sum of fixed failure weights B 52.1% 60.9%

sum of fixed failure weights C 43.3% 60.7%

sum of fixed failure weights B(cat) 50.0% 64.5%

sum of fixed failure weights C(cat) 43.1% 64.5%

Page 10: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 10

Choices for Dsoft – quantile edits (1)

• Drawback of fixed failure weights: no difference between large and small edit failures

• Trick: quantile edits

Page 11: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 11

Choices for Dsoft – quantile edits (2)

• Idea: use different versions of the same edit by varying one of the constants

• Choose values for this

constant based on the

fraction of reference

data records that fail

the resulting edit

(e.g. 1%, 5%, 10%)

Page 12: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 12

Choices for Dsoft – quantile edits (3)

• Example: ratio edit x1 / x3 ≥ c

% records failed c in ref. data quantile edit sk cumul. sk

10% 0.75 x1 / x3 ≥ 0.75 1 1

5% 0.60 x1 / x3 ≥ 0.60 1 2

1% 0.10 x1 / x3 ≥ 0.10 1 3

Page 13: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 13

Simulation study (4)

editing approach (choice of Dsoft) % records with perfect solution

data set 1 data set 2

no soft edits, only hard edits 40.2% 58.4%

all edits, using soft edits as hard edits 36.8% n/a

sum of fixed failure weights A 47.3% 63.4%

sum of fixed failure weights B 52.1% 60.9%

sum of fixed failure weights C 43.3% 60.7%

sum of fixed failure weights B(cat) 50.0% 64.5%

sum of fixed failure weights C(cat) 43.1% 64.5%

10-5-1%-quantile edits, weights 0.33-0.33-0.33 54.4% 63.4%

10-5-1%-quantile edits, weights 0.90-0.05-0.05 56.5% 63.8%

Page 14: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 14

Choices for Dsoft – dynamic expressions

• Size of edit failure: ek• Linear equality edit: ak1x1 + … + akpxp + bk = 0

Take: ek = | ak1x1 + … + akpxp + bk |

• Linear inequality edit: ak1x1 + … + akpxp + bk ≥ 0

Take: ek = max{ 0, –(ak1x1 + … + akpxp + bk) }

• Use reference data to standardise:• Linear sum:

• Mahalanobis distance:

k kksoft eD ̂

eeeDD Msoft1ˆ)0,(

Page 15: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 15

Simulation study (5)

editing approach (choice of Dsoft) % records with perfect solution

data set 1 data set 2

no soft edits, only hard edits 40.2% 58.4%

all edits, using soft edits as hard edits 36.8% n/a

sum of fixed failure weights A 47.3% 63.4%

sum of fixed failure weights B 52.1% 60.9%

sum of fixed failure weights C 43.3% 60.7%

sum of fixed failure weights B(cat) 50.0% 64.5%

sum of fixed failure weights C(cat) 43.1% 64.5%

10-5-1%-quantile edits, weights 0.33-0.33-0.33 54.4% 63.4%

10-5-1%-quantile edits, weights 0.90-0.05-0.05 56.5% 63.8%

sum of standardised soft edit failures 49.2% ?

Mahalanobis distance of soft edit failures 46.8% ?

Page 16: Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)

Automatic Editing with Hard and Soft Edits - Some First Experiences 16

Conclusion

• Using soft edits improved error localisation

• Choice of Dsoft:• Results not unequivocal• Quantile edits seem to work well• Room for improvement

• Future work:• Extended simulation study with mixed data/edits