data cleaning in financial modules

Download Data Cleaning in Financial Modules

Post on 07-Jan-2016




0 download

Embed Size (px)


Data Cleaning in Financial Modules. Workshop in Frankfurt Mario Schnalzenberger. Which Modules exactly. EP Pensions and Benefits (ep078*, ep094*) Income from various sources ( ep201, ep045, ep314, ep041, ep205, ep207, ep305) AS - PowerPoint PPT Presentation


  • Data Cleaning in Financial ModulesWorkshop in FrankfurtMario Schnalzenberger

  • Which Modules exactlyEPPensions and Benefits (ep078*, ep094*)Income from various sources (ep201, ep045, ep314, ep041, ep205, ep207, ep305)ASFinancial assets and dividends from them (as003, as005, as007, as009, as011, as015, as017, as058, as021, as024, as027, as030, as032*, as034*, as042, as049, as051)Total debts (as051)Savings of other HH members (as069)HC out of pocket payments (hc045, hc047, hc049, hc051)FT financial transfers (ft004*, ft011*, ft018)HO rents and value of real property (ho003, ho005, ho008, ho015, ho020, ho024, ho027, ho030)HH household income (hh002, hh011, hh017)CO amout spent (co002, co003, co004, co011)

  • 1. Check for outliersMerge all relevant sections (include also IWER id from dataset origin, makes it easier for the agency)Recode Dont Know, Refuse as missing(eg. use mvdecode _all, mv(9e+20=.a \ 8e+20=.b) in STATA)Use graphical methods (e.g. histograms if that is easier for you)Search for outliers in each section: use percentiles, min, max or other automatic procedures (see hadimvo)

  • Graphical checkHere you see that without correcting the outliers, all data is concentrated in few binsCorrected data is distributed nicely

  • Automatic check for the FT sectionforeach var of varlist ft004_1 ft004_2 ft004_3 ft011_1 ft011_2 ft018_1 { noi disp "`var'" noi sum `var', detail hadimvo `var' if `var'>0, gen(g_`var') sort laptop date1 time1 noi list laptop sampid respid `var' if g_`var'==1}

    egen g_ft=rsum(g_ft*)sort laptop date1 time1noi disp "Rows with more than one outlier in FT"noi list laptop sampid respid g_ft if g_ft>1

  • Look at those outputsFind real outliers & typosFind currency problems (this may be hard in various countries)Make a list (eg. Excel File) for the agencyThose issues should be easy to fix.

  • In addition to ask the agencyCheck things which are duplicate in the dataGross and net income of same source (but Attention: sometimes apply to different years)!Income (ep078* ep094* and others) + hh002 + hh011 ~= hh017

    It often occured that hh002 was set to the partners income (exactly), even though this is wrong (see question text)

  • Find absolute bracketsMaximum pensions (maybe per type, in Austria 3850 Euro per month)Find some useful brackets for assets (we used 100000 Euro for many of them)Look for proportions of large values in the whole dataset (reasonable?)Look for wealthy households

  • Dont forgetUsually we have a skewed distribution (monetary numbers)Therefore outliers are more easily found on the maximum side!No outliers will be usually found on the minimum side.Check also the minimal numbers (e.g. we had houses worth 0, 2 or 5 Euro, penisons of 0 or 1 Euro a month, ) see also Dimitris presentation

  • Other issuesPeriodicity of payments (e.g. in Austria ALL(!) pensions are paid monthly only)

    Sometimes the respondents told they got additional payments (last month, i.e. ep201 is higher than in usual month), but those payments (ep314) are equal to or even higher than ep201 This would mean they have no or even negative regular income payment (obviously wrong)


View more >