data cleaning in financial modules
DESCRIPTION
Data Cleaning in Financial Modules. Workshop in Frankfurt Mario Schnalzenberger. Which Modules exactly. EP Pensions and Benefits (ep078*, ep094*) Income from various sources ( ep201, ep045, ep314, ep041, ep205, ep207, ep305) AS - PowerPoint PPT PresentationTRANSCRIPT
Data Cleaning in Financial Modules
Workshop in Frankfurt
Mario Schnalzenberger
Which Modules exactly• EP
– Pensions and Benefits (ep078*, ep094*)– Income from various sources (ep201, ep045, ep314, ep041, ep205,
ep207, ep305)• AS
– Financial assets and dividends from them (as003, as005, as007, as009, as011, as015, as017, as058, as021, as024, as027, as030, as032*, as034*, as042, as049, as051)
– Total debts (as051)– Savings of other HH members (as069)
• HC – out of pocket payments (hc045, hc047, hc049, hc051)• FT – financial transfers (ft004*, ft011*, ft018)• HO – rents and value of real property
– (ho003, ho005, ho008, ho015, ho020, ho024, ho027, ho030)• HH – household income (hh002, hh011, hh017)• CO – amout spent (co002, co003, co004, co011)
1. Check for outliers
• Merge all relevant sections (include also IWER – id from dataset origin, makes it easier for the agency)
• Recode „Don‘t Know“, „Refuse“ as missing– (eg. use „mvdecode _all, mv(9e+20=.a \ 8e+20=.b)” in
STATA)• Use graphical methods (e.g. histograms if that is
easier for you)• Search for outliers in each section: use
percentiles, min, max or other automatic procedures (see hadimvo)
02
.0e-
04
4.0
e-0
46
.0e-
04
8.0
e-0
4.0
01
Den
sity
0 1000 2000 3000 4000typical payment of pension in last year
Graphical check
• Here you see that without correcting the outliers, all data is concentrated in few bins
• Corrected data is distributed nicely
05
.0e-
05
1.0
e-0
41
.5e-
04
2.0
e-0
42
.5e-
04
Den
sity
0 20000 40000 60000 80000 100000typical payment of pension in last year
Automatic check for the FT section
foreach var of varlist ft004_1 ft004_2 ft004_3 ft011_1 ft011_2 ft018_1 {
noi disp "`var'" noi sum `var', detail hadimvo `var' if `var'>0, gen(g_`var') sort laptop date1 time1 noi list laptop sampid respid `var' if g_`var'==1}
egen g_ft=rsum(g_ft*)sort laptop date1 time1noi disp "Rows with more than one outlier in FT"noi list laptop sampid respid g_ft if g_ft>1
Look at those outputs
• Find real outliers & typos
• Find currency problems (this may be hard in various countries)
• Make a list (eg. Excel – File) for the agency
• Those issues should be easy to fix.
In addition to ask the agency
• Check things which are duplicate in the data• Gross and net income of same source (but
Attention: sometimes apply to different years)!• Income (ep078* ep094* and others) + hh002 +
hh011 ~= hh017
• It often occured that hh002 was set to the partners income (exactly), even though this is wrong (see question text)
Find absolute brackets
• Maximum pensions (maybe per type, in Austria 3850 Euro per month)
• Find some useful brackets for assets (we used 100000 Euro for many of them)
• Look for proportions of large values in the whole dataset (reasonable?)
• Look for wealthy households
Don‘t forget
• Usually we have a skewed distribution (monetary numbers)
• Therefore outliers are more easily found on the maximum side!
• No outliers will be usually found on the minimum side.
• Check also the minimal numbers (e.g. we had houses worth 0, 2 or 5 Euro, penisons of 0 or 1 Euro a month, …) – see also Dimitri‘s presentation
Other issues
• Periodicity of payments (e.g. in Austria ALL(!) pensions are paid monthly only)
• Sometimes the respondents told they got additional payments (last month, i.e. ep201 is higher than in usual month), but those payments (ep314) are equal to or even higher than ep201
• This would mean they have no or even negative regular income payment (obviously wrong)