data cleaning in financial modules

10
Data Cleaning in Financial Modules Workshop in Frankfurt Mario Schnalzenberger

Upload: thora

Post on 07-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Data Cleaning in Financial Modules. Workshop in Frankfurt Mario Schnalzenberger. Which Modules exactly. EP Pensions and Benefits (ep078*, ep094*) Income from various sources ( ep201, ep045, ep314, ep041, ep205, ep207, ep305) AS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Cleaning in Financial Modules

Data Cleaning in Financial Modules

Workshop in Frankfurt

Mario Schnalzenberger

Page 2: Data Cleaning in Financial Modules

Which Modules exactly• EP

– Pensions and Benefits (ep078*, ep094*)– Income from various sources (ep201, ep045, ep314, ep041, ep205,

ep207, ep305)• AS

– Financial assets and dividends from them (as003, as005, as007, as009, as011, as015, as017, as058, as021, as024, as027, as030, as032*, as034*, as042, as049, as051)

– Total debts (as051)– Savings of other HH members (as069)

• HC – out of pocket payments (hc045, hc047, hc049, hc051)• FT – financial transfers (ft004*, ft011*, ft018)• HO – rents and value of real property

– (ho003, ho005, ho008, ho015, ho020, ho024, ho027, ho030)• HH – household income (hh002, hh011, hh017)• CO – amout spent (co002, co003, co004, co011)

Page 3: Data Cleaning in Financial Modules

1. Check for outliers

• Merge all relevant sections (include also IWER – id from dataset origin, makes it easier for the agency)

• Recode „Don‘t Know“, „Refuse“ as missing– (eg. use „mvdecode _all, mv(9e+20=.a \ 8e+20=.b)” in

STATA)• Use graphical methods (e.g. histograms if that is

easier for you)• Search for outliers in each section: use

percentiles, min, max or other automatic procedures (see hadimvo)

Page 4: Data Cleaning in Financial Modules

02

.0e-

04

4.0

e-0

46

.0e-

04

8.0

e-0

4.0

01

Den

sity

0 1000 2000 3000 4000typical payment of pension in last year

Graphical check

• Here you see that without correcting the outliers, all data is concentrated in few bins

• Corrected data is distributed nicely

05

.0e-

05

1.0

e-0

41

.5e-

04

2.0

e-0

42

.5e-

04

Den

sity

0 20000 40000 60000 80000 100000typical payment of pension in last year

Page 5: Data Cleaning in Financial Modules

Automatic check for the FT section

foreach var of varlist ft004_1 ft004_2 ft004_3 ft011_1 ft011_2 ft018_1 {

noi disp "`var'" noi sum `var', detail hadimvo `var' if `var'>0, gen(g_`var') sort laptop date1 time1 noi list laptop sampid respid `var' if g_`var'==1}

egen g_ft=rsum(g_ft*)sort laptop date1 time1noi disp "Rows with more than one outlier in FT"noi list laptop sampid respid g_ft if g_ft>1

Page 6: Data Cleaning in Financial Modules

Look at those outputs

• Find real outliers & typos

• Find currency problems (this may be hard in various countries)

• Make a list (eg. Excel – File) for the agency

• Those issues should be easy to fix.

Page 7: Data Cleaning in Financial Modules

In addition to ask the agency

• Check things which are duplicate in the data• Gross and net income of same source (but

Attention: sometimes apply to different years)!• Income (ep078* ep094* and others) + hh002 +

hh011 ~= hh017

• It often occured that hh002 was set to the partners income (exactly), even though this is wrong (see question text)

Page 8: Data Cleaning in Financial Modules

Find absolute brackets

• Maximum pensions (maybe per type, in Austria 3850 Euro per month)

• Find some useful brackets for assets (we used 100000 Euro for many of them)

• Look for proportions of large values in the whole dataset (reasonable?)

• Look for wealthy households

Page 9: Data Cleaning in Financial Modules

Don‘t forget

• Usually we have a skewed distribution (monetary numbers)

• Therefore outliers are more easily found on the maximum side!

• No outliers will be usually found on the minimum side.

• Check also the minimal numbers (e.g. we had houses worth 0, 2 or 5 Euro, penisons of 0 or 1 Euro a month, …) – see also Dimitri‘s presentation

Page 10: Data Cleaning in Financial Modules

Other issues

• Periodicity of payments (e.g. in Austria ALL(!) pensions are paid monthly only)

• Sometimes the respondents told they got additional payments (last month, i.e. ep201 is higher than in usual month), but those payments (ep314) are equal to or even higher than ep201

• This would mean they have no or even negative regular income payment (obviously wrong)