unleash the power of abs statistics through methodological ... · unleash the power of abs...

Post on 22-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Unleash the power of ABS statistics through methodological innovation

By Dr Siu-Ming Tam Chief Methodologist

March, 2017

Views expressed in this talk are those of the author and do not necessarily represent those of the Australian Bureau of Statistics

1

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

2

Drivers of change in ABS statistics

3

Need

for faster

decisions

Data

deluge

More

evidence

based

decision

making

Ageing

systems and

manual

processes

Growing

expectations

New

statistical

possibilities

and

opportunities

ABS Transformation - Who are we transforming for?

Our partners

• Greater responsiveness

• Improved collaboration

• Quicker to market

• Less red tape

Our community

• Improved data matching

• Informed use of statistics

• Evidence based policy and programs

• Less burden on households and businesses

Our organisation

• Ongoing sustainability

• Greater influence and reach

• More dynamic – able to respond to future challenges

Our people

• Greater flexibility

• More satisfying work

• New skills and opportunities

• More diverse and engaged culture

Six dimensions of transformation to achieve ABS goals

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

6

Methodology Architecture (MA) - Tam (2014)

• Being part of EA, MA is a transformation plan for methodology

• Vision for ABS MA – To provide a set of methods that underpins the products and process

vision of the ABS Transformation program

• MA is supported by 5 key “rules of engagement” – Innovate

– Industralise

– Build capability

– Contemporise

– Build support

7

Methodology Transformation

8

Transformational change in methodologies

• From classical to contemporary statistical methods

– from Designed data to Found data

– from direct measurements to modelled data

– from single source to multiple sources

– from siloed data sets to integrated data sets

• From limited access of URFs to more liberal access

9

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

10

Data Deluge - Big Data opportunities

11

Big Data and Big Challenges - (Tam and Clarke, 2015)

• ABS objective

• Harness Big Data sources to create a richer, more dynamic and focused statistical picture of Australia for better informed decision-making

• Challenges • Business benefit • Privacy and public trust • Technological feasibility • Data acquisition • Data integrity • Methodological

soundness • How to make valid

statistical inferences • Tam (2015)

12

Big Data = Big Sources, but

not entirely foreign to official statisticians e.g. Administrative records, Scanner Data

• Behaviour metrics and online opinion – potentially large inherent statistical biases

13

Administrative Records

Tax Records

Medical Records

Bank Records

Commercial Transactions

Credit Card Transactions

Scanner Transactions

Online Purchases

Sensor Data

Satellite Imagery

Ground Sensor Data

Location Data

Behaviour Metrics

Search Engine Queries

Web Pages Views and Navigation

Media Subscriptions

Online Opinion

Social Media Comments

Twitter Feeds

Data Analytics – What problems are they trying to solve?

Machine Learning methods

15

Statistical methods

16

Big Inference – One possible approach - (Tam, 1987, 2015)

Using Big Data

• Use a sample to calibrate the Big Data (treated as “covariates”) using ground truths

• Calibrate using a linear model (with time varying coefficients – Dynamic Model)

• Estimate parameters (using Frequentist/Bayesian approaches)

• Predict the non-sampled values using the covariates

• Or use the Generalised Regression Estimation (GREG) framework for estimation (parameters estimated using design-based methods)

The simple case – no missing data nor covariates

17

An ABS Pilot Study

To determine the feasibility of

Distinguishing crop types

Estimating area of land under each crop

Predict crop yield

from Earth Observations data

Barley or Wheat?

Region

Average Proportion Correctly Classified

Crop classification SE QLD 78.5%

Crop Presence Mallee 83%

Summary of Indicative Results

Survey Process Augmented by Big Data* *Big Data process augmented by survey data – Paul Biemer (2016)

20

Frame Population Sample

Randomization Observations

Data Integration and Processing

Modelling & Adjustment

Estimation Statistical Inference

Validation

Missing Covariate, and Missing Data challenges

21

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

22

Data integration vis-a-vis fusion

• Integration – Felligi and Sunter (1969)

• Fusion – Kim et al (2016)

23

One file – Sample AUB Note – ABS uses the EM algorithm to estimate the “m” and “u” probabilities (Samuels, 2012)

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

24

Data Utility versus Disclosure Risk for Unit Record Files

Disclosure Risk Data Utility

Protections

Ability in using the data to draw valid conclusions

Spontaneous Recognition

Matching risk

Higher risk for unit record

than aggregated data

Perturbation

Cell Suppression

Collapsing of Categories

Sampling

Record masking

Substitution of Values

25

From “4 Safes” to the “Five Safes” Framework - (Richie, 2014)

Safe people

Safe project

Safe setting

Safe data

Safe output

Can the person be trusted to use the data appropriately?

Is the specific use of the data appropriate?

How does the mode of access limit the risk of disclosure?

How much protections are to be applied to the data?

How much controls are applied to ensure the output is non-disclosive?

A multidimensional approach to disclosure risk assessment Key Equation: Pr(D) = Pr(D|A)Pr(A)

26

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

27

Temporal modelling

• Shapes of curve is not constant over time

• Temporal modelling would seem appropriate – Dynamic linear model for production data

– Dynamic logistic regression for binary data

• Modelling the “beta” in GREG over time

– State Transition Equation

28

State Space Modelling for Satellite Imagery data Tam(1987, 2015)

29

Outline

• ABS transformation

• Methodology transformation

– Big Data challenges

– Data integration and data fusion

– Data access

• Methodological innovation

– Borrowing strength over time

– Measurement error model

30

31

Where survey sampling errors go if they are not removed ?

32

Better signal extraction using Structural Time Series model

• Explicit modelling of sampling error for survey estimates

– 𝑦𝑡 = ϑ𝑡 + 𝑢𝑡; ϑ𝑡 = 𝑇𝑡 + 𝑆𝑡 + 𝐼𝑡 – Modelling for trend (eg local linear trend), seasonal

effects (eg dummy seasonal) • Option 1 – put ϑ 𝑡|𝑡through the linear filters (eg X13 ARIMA)

to decompose into trend, seasonal effects and irregular (ABS option)

• Option 2 – use T 𝑡|𝑡 and 𝑆 𝑡|𝑡 as trend and seasonal effects estimates

• Benefit for seasonally adjusted estimates for areas with relatively small sample sizes

33

|ˆt t

ACT Employment and Unemployment Estimates

Small Domain Estimation

35

SDE with repeated surveys

36

Multiple data sources to improve survey estimates

• Borrowing strength from multiple sources (Harvey and Chung, 2000; Zhang and Honchar, 2016)

• Using Unemployment Benefit Claimant Counts to improve the LFS estimates – Exploit the correlation in the error covariance matrix

– Bivariate State Space Model (aka Seemingly Unrelated Time Series Equations model - SUTSE)

• Quality assurance tool for ABS LFS unemployment estimates

37

Seemingly Unrelated Time Series Equations Model (SUTSE)

Case study of Unemployment – LFS estimates vs benefit claimant count

L_lfs_unemp_o-Slope

1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

-0.02

0.00

0.02

0.04L_lfs_unemp_o-Slope

L_cc_total-Slope

1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

-0.025

0.000

0.025

0.050L_cc_total-Slope

Smoothed estimates of LFS unemployement trend slope (ν 1,𝑡|𝑇) vs Claimant Count Trend slope (ν 2,𝑡|𝑇)

Case study of Unemployment (cont.)

761.1 763.3

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

LFS SSM predicted

March 2016

95% low and high

Prediction total unemployment using known CC data

Concluding remarks

• Methodology innovation is fundamental to support ABS transformation

• Methodological transformational change – More measured use of statistical models – Reforming data access

• Methodological innovation – Use measurement error model for aggregate stats – Use of SSM for a number applications involving time

• Change programs – Need to build support and buy in from subject matter

colleagues

Selected References

Biemer. P. (2016). Key note address to the 2016 International Survey Error Workshop. Felligi, I, and Sunter, A.B. (1969). A theory of record linkage. Journal of the American Statistical Association, 64, 1183-1210. Harvey, A. and Chung, C.H. (2000). Estimating the underlying change in unemployment in the UK. Journal of the Royal Statistical Society, Series A,3, 303-339 Kim, J.K, Berg, E. and Park, T. (2016). Statistical matching using fractional imputation. Survey Methodology, 40, 19-40. Ritchie, F. (2014). Access to Sensitive Data: Satisfying Objectives Rather than Constraints . Journal of Official Statistics, 30, pp. 533-545. Samuels, C. (2012). Using the EM algorithm to estimate the parameters of the Fellegi-Sunter model for data linking. Tam, S-M. (1987). Analysis of a repeated survey using a dynamic linear model. International Statistical Review, 55, 63-73.

43

Selected references (cont’d)

Tam, S-M. (2014). Methodology architecture – a roadmap for new methodological directions in the Australian Bureau of Statistics. Journal of Official Statistics, 30, 371-375.

Tam, S-M. (2015). A Statistical Framework for Analysing Big Data. Survey Statistician, 72, 36-51

Tam, S-M. and Clarke, F. (2015) Big Data, Official Statistics and Some Initiatives by the ABS. International Statistical Review, 83, 436-448

Thomsen, I.B. (1973). A note on the efficiency of weighing subclass means to remove the effects of non-response when analysing survey data. Statistics Norway. Unpublished manuscript

Zhang, M. and Honchar, O. (2016). Predicting survey estimates by state space models using multiple data sources. Australian Bureau of Statistics. Unpublished manuscript.

44

Questions?

45

Siu-Ming.Tam@abs.gov.au

top related