all the answers? statistics new zealand’s integrated data infrastructure

27
All the answers? Statistics New Zealand’s Integrated Data Infrastructure Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented by Felibel Zabala Sept 2012

Upload: sol

Post on 22-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

All the answers? Statistics New Zealand’s Integrated Data Infrastructure. Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented by Felibel Zabala. Sept 2012. Statistics New Zealand’s Integrated Data Infrastructure (IDI). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

All the answers? Statistics New Zealand’s Integrated Data

InfrastructurePaper by Felibel Zabala, Rodney Jer,

Jamas Enright and Allyson SeybPresented by Felibel Zabala

Sept 2012

Page 2: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Statistics New Zealand’s Integrated Data Infrastructure (IDI)

Merges data from different suppliers including Statistics NZ

Variable quality of the different datasets, both within and between

2

Page 3: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Statistics New Zealand’s Integrated Data Infrastructure (IDI)

Linking clean datasets is not easy, much more difficult for variable quality in datasets

Importance of an effective and efficient editing strategy

3

Page 4: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Main objective

Present some of the issues on and solutions to any linked administrative dataset with a focus on one of Statistics NZ‘s first integrated dataset, the Linked Employer-Employee Data (LEED)

4

Page 5: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

LEED

Provides the backbone of the IDI prototype

Links longitudinal business data from Statistics NZ’s Business Frame to a longitudinal series of payroll tax data from Inland Revenue (IRD)

Used to produce quarterly statistics that measure labour market dynamics at various levels, eg filled jobs, worker flows, and total earnings

5

Page 6: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

LEED Payroll data

Collected from employers for New Zealand’s taxation system through IRD’s Employer Monthly Schedule (EMS)

Information available from EMS Employer/employee name and IRD number taxable earnings for work performed taxed at source

of income tax deductions (pay-as-you-earn or PAYE,

withholding tax, child support payment, student loan indicator amount)

start and finish dates of employment6

Page 7: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

LEED – additional details

Also includes payments made to beneficiaries by the government

Contains a subset of the self-employed

7

Page 8: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

LEED – additional details (cont’d)

Collection unit - the legal entity that files the EMS return

Statistical unit – or the ‘employer’ in LEED is the geographical or physical location of the business

8

Page 9: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Methods of integration in LEED

Figure 1. Unit record links in LEED9

Page 10: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Figure 1. Unit record links in LEED10

Linking employer to enterprise

Page 11: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Figure 1. Unit record links in LEED11

Linking employer longitudinally

Page 12: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Figure 1. Unit record links in LEED12

Linking enterprise and geo longitudinally

Page 13: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Figure 1. Unit record links in LEED13

Linking employee longitudinally

Page 14: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Variables edited in LEEDIRD numbers

Gross earnings

Date of birth

Sex

Workplace of an employee

Start and end dates of employment

Editing strategy: Do not replace any IRD data unless there is strong evidence it is an error

14

Page 15: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Variables edited in LEED (cont’d)

IRD numbers

Imputation of sex

Imputation of start and end dates of employment

15

Page 16: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Variables edited in LEED (cont’d)

Gross earnings Presence of systematic errors Detection method – use of ratio edit: PAYE/gross

earnings Imputation method

Date of birth Presence of systematic errors Detection method – edit rules based on an

employee’s age against some events Imputation method

16

Page 17: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Variables edited in LEED (cont’d)

Imputation of workplace of an employee Uses transportation method, where the imputed workplace of an employee is the

geo that minimises the distance between an employee’s home address to the geo, subject to the constraints that

each employee is assigned to a geo and the total number of employees allocated to a

geo should equal the number of employees expected from the geo

17

Page 18: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

The IDI prototype

Datasets linked to LEED

Benefit data

Tertiary education data

Administrative tertiary education data and student loans and allowances data

Statistics NZ’s Household Labour Force Survey (HLFS) and its supplementary surveys

18

Page 19: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

The IDI prototype (cont’d)

Other linked dataset in IDIThe Longitudinal Business Database (LBD) prototype includes information on business

demographics, financial data, employment, goods exports, government assistance, and management practices

19

Page 20: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

The IDI prototype (cont’d)

Figure 2. Linking in the IDI prototype20

Page 21: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Issues in linking in the IDI

Lack of a common identifier across datasets

Main variables in the Central Linking Concordance (CLC) IRD numbers, passport numbers, and student ID,

where available

Use of demographic variables as partial identifiers

21

Page 22: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Issues in linking in the IDI (cont’d)

Need for a standard software for automated data linkage robust to data changes

Timing of receipt of data

22

Page 23: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Editing strategy in the IDIFocus on ensuring high-quality linking variables are used in linking. Examples: Validity rules were used to edit names across

data sources Sex and date of birth are reformatted to ensure

common coding is used across data sources

Where inconsistencies occur in records linked from two different data sources, it is important to know which of the two data sources is more reliable

23

Page 24: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Editing strategy in the IDI (cont’d)

Process to resolve inconsistencies in personal details Most common value present in the datasets

should be kept Prioritise the data sources to determine the order

of retaining their values

24

Page 25: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Editing strategy in the IDI (cont’d)

Editing strategy should be able to

Edit inconsistencies from the same unit from different sources

Treat erroneous and missing variables in a record

Ensure consistency in variables across a record for a time period and over time

25

Page 26: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Next steps

Build of the IDI with a focus on improving the linking methodology

Determine standard quality measures for outputs produced using administrative data

26

Page 27: All  the answers? Statistics New Zealand’s Integrated Data Infrastructure

Next steps (cont’d)

Redevelopment of LEED and SLA systems Investigate the use of geospatial information

to improve the employee allocation method Review of the editing of gross earnings Investigate the use of Banff

27