Transcript
Page 1: USING MACHINE LEARNING ALGORITHMS …

Using Machine Learning Algorithms …

… to reduce clerical effort and improve the quality of

administrative data sources.

Workshop on Access to Administrative Data Sources

Brussels, 13-14 September 2016

© Federal Statistical Office of Germany | Department E 105

Page 2: USING MACHINE LEARNING ALGORITHMS …

Crafts Trades – What's that?

© Federal Statistical Office of Germany | Department E 105

Quelle: Deutsche Fotothek Quelle: Marketing Handwerk

Page 3: USING MACHINE LEARNING ALGORITHMS …

Crafts Trades in Official Statistics?

© Federal Statistical Office of Germany | Department E 105

Official Statistics compile short term and structural results for

Turnover

Employees

Crafts Statistics are compiled entirelyfrom the Business Register (BR). BR

Page 4: USING MACHINE LEARNING ALGORITHMS …

Business Register

© Federal Statistical Office of Germany | Department E 105

BR

Page 5: USING MACHINE LEARNING ALGORITHMS …

Administrative Data Sources

© Federal Statistical Office of Germany | Department E 105

Page 6: USING MACHINE LEARNING ALGORITHMS …

Linking the Crafts Property to BR-Units

© Federal Statistical Office of Germany | Department E 105

BR

Page 7: USING MACHINE LEARNING ALGORITHMS …

The Problem – Irrelevant Cases

© Federal Statistical Office of Germany | Department E 105

BR

Page 8: USING MACHINE LEARNING ALGORITHMS …

The Problem – Irrelevant Cases

© Federal Statistical Office of Germany | Department E 105

Every year about 40 000 new or significantly changed matchesneed to be checked for relevance.

This means a lot of clerical review.

Page 9: USING MACHINE LEARNING ALGORITHMS …

The ProblemIrrelevant Cases

© Federal Statistical Office of Germany | Department E 105

0%

20%

40%

60%

80%

100%

Fallzahl Umsatz

Crafts in the BR;Share of Irrelevant Cases

relevant irrelevant

Cases Turnover

Page 10: USING MACHINE LEARNING ALGORITHMS …

The ProblemIrrelevant Cases

© Federal Statistical Office of Germany | Department E 105

Page 11: USING MACHINE LEARNING ALGORITHMS …

The ProblemIrrelevant Cases

© Federal Statistical Office of Germany | Department E 105

Density of irrelevant Cases in Craftsby Crafts Trade and NACE rev. 2

NACE rev. 2

Cra

fts T

rade

s

A 01

A 05

A 10

A 15

A 20

A 25

A 30

A 35

B 01

B 05

B 10

B 15

B 20

B 25

B 30

B 35

B 40

B 45

B 50

01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99

0,0

0,2

0,4

0,6

0,8

1,0

Page 12: USING MACHINE LEARNING ALGORITHMS …

The ProblemIrrelevant Cases

© Federal Statistical Office of Germany | Department E 105

Density of irrelevant Turnover in Craftsby Crafts Trade and NACE rev. 2

NACE rev. 2

Cra

fts T

rade

s

A 01

A 05

A 10

A 15

A 20

A 25

A 30

A 35

B 01

B 05

B 10

B 15

B 20

B 25

B 30

B 35

B 40

B 45

B 50

01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99

0,0

0,2

0,4

0,6

0,8

1,0

Page 13: USING MACHINE LEARNING ALGORITHMS …

Solution Machine Learning?

© Federal Statistical Office of Germany | Department E 105

Since results of the clerical checking from prior periods exist, …

…is it possible to train a Support Vector Machine to recognise the editing patterns …

… and apply them to a new data set?

Density of irrelevant Turnover in Craftsby Crafts Trade and NACE rev. 2

NACE rev. 2

Cra

fts T

rade

s

A 01

A 05

A 10

A 15

A 20

A 25

A 30

A 35

B 01

B 05

B 10

B 15

B 20

B 25

B 30

B 35

B 40

B 45

B 50

01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99

0,0

0,2

0,4

0,6

0,8

1,0

Page 14: USING MACHINE LEARNING ALGORITHMS …

Support Vector Machines – What's that?

© Federal Statistical Office of Germany | Department E 105

Support Vector Machines (SVM) find the widest separating hyperplanebetween two groups in a dataset with known classification...

Page 15: USING MACHINE LEARNING ALGORITHMS …

Support Vector Machines – What's that?

© Federal Statistical Office of Germany | Department E 105

… and applies the patterns found to a dataset with unknown classification.

Page 16: USING MACHINE LEARNING ALGORITHMS …

Support Vector Machines – What’s that?

© Federal Statistical Office of Germany | Department E 105

PROs: It operates on very little assumptions

CONs: SVM tend to require large amounts of calculating power

SVM (as well as other machine learning algorithms) are prone to overfitting

Keep in Mind: SVM have to be tuned to get useful results

Avoid overfitting by cross validating models

+

-

!

Page 17: USING MACHINE LEARNING ALGORITHMS …

SVM Applied – Training Data

© Federal Statistical Office of Germany | Department E 105

We train the SVM on a Business Register subset,containing about 650 000 unitswith crafts property and known relevance.

Roughly 13 000 (~2%) of them are marked irrelevant for the crafts statistics.

BR

Page 18: USING MACHINE LEARNING ALGORITHMS …

SVM Applied – Training Data

© Federal Statistical Office of Germany | Department E 105

We train on the variables:

Crafts Trade

NACE rev. 2

Turnover

Employees

Page 19: USING MACHINE LEARNING ALGORITHMS …

SVM AppliedTraining

© Federal Statistical Office of Germany | Department E 105

filter trivial cases

reduce to mostimportant variables

cases with knownrelevance

nontrivial cases

Nontrivial cases w. reduced variables

trivial cases

train SVM

model fornontrivial cases

model fortrivial cases

Page 20: USING MACHINE LEARNING ALGORITHMS …

SVM AppliedPrediction

© Federal Statistical Office of Germany | Department E 105

predict nontrivialcases

cases withunknown relevance

predict trivial casesmodel for trivial

casestrivial cases with

predicted relevance

model fornontrivial cases

nontrivial cases withunknown relevance

nontrivial cases withpredicted relevance

cases withpredicted relevance

Page 21: USING MACHINE LEARNING ALGORITHMS …

Results

© Federal Statistical Office of Germany | Department E 105

predicted original

relevant relevant 93,1 % 65,2 %

not relevant not relevant 3,2 % 33,1 %

96,3 % 98,2 %

not relevant relevant 0,3 % 0,3 %

relevant not relevant 3,4 % 1,5 %

3,7 % 1,8 %

relevant relevant 93,0 % 62,8 %

not relevant not relevant 1,4 % 20,5 %

94,4 % 83,3 %

not relevant relevant 0,4 % 2,7 %

relevant not relevant 5,2 % 14,1 %

5,6 % 16,7 %

classified correctly:

not correctly classified:

Random Forest

ClassificationAlgorithm Cases Turnover

SVMclassified correctly:

not correctly classified:

Page 22: USING MACHINE LEARNING ALGORITHMS …

Discussion

© Federal Statistical Office of Germany | Department E 105

We consider the results sufficiently precise to classify the bulk of the cases automatically and without clerical editing.

Cases with large impact on results will still be given into clerical review.

Furthermore the SVM-models can be used to check the results of the clerical review process.

Page 23: USING MACHINE LEARNING ALGORITHMS …

Jörg Feuerhake

Phone: +49/(0) 611 / 75 41 16

[email protected]

www.destatis.de

Page 24: USING MACHINE LEARNING ALGORITHMS …

Further Reading

© Federal Statistical Office of Germany | Department E 105

Applied Multivariate Statistical Analysis; Härdle, Simar; 2007

http://books.google.de/books/about/Applied_Multivariate_Statistical_Analysi.html?id=6nSF2PTi9bkC&redir_esc=y

The Elements of Statistical Learning; Hastie, Tibshirani, Friedman; 2009

http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

A Practical Guide to Support Vector Classification; Hsu, Chang, and Lin; 2003-2010;

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Support Vector Machines - The Interface to libsvm in Package e1071; Meyer; 2014

http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

Page 25: USING MACHINE LEARNING ALGORITHMS …

Presentations

© Federal Statistical Office of Germany | Department E 105

A Simple Introduction to Support Vector Machines; Martin Lawhttp://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf

Klassifikation mit Support Vector Machines; Florian Markowetz (German)

http://lectures.molgen.mpg.de/statistik03/docs/Kapitel_16.pdf

Support Vector Machines; Andrew W. Moore

http://www.autonlab.org/tutorials/svm15.pdf

Page 26: USING MACHINE LEARNING ALGORITHMS …

Videos

© Federal Statistical Office of Germany | Department E 105

Artificial Intelligence; Lecture 16: Learning: Support Vector Machines (49’34”)

https://www.youtube.com/watch?v=_PwhiWxHK8o

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/index.htm

Introduction to Support Vector Machines (51’54”, 36’11”)

http://videolectures.net/epsrcws08_campbell_isvm/

Artificial Intelligence | Machine Learning; Lecture 6 (73‘09“) und 7 (75‘45“)

www.youtube.com/watch?v=qyyJKd-zXRE

www.youtube.com/watch?v=s8B4A5ubw6c

http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

Pattern Recognition Class (40’31“)

https://www.youtube.com/watch?v=rjIac3NxAYA


Top Related