USING MACHINE LEARNING ALGORITHMS

Download USING MACHINE LEARNING ALGORITHMS

Post on 13-Feb-2017

217 views

Category:

Documents

4 download

TRANSCRIPT

  • Using Machine Learning Algorithms

    to reduce clerical effort and improve the quality of

    administrative data sources.

    Workshop on Access to Administrative Data Sources

    Brussels, 13-14 September 2016

    Federal Statistical Office of Germany | Department E 105

  • Crafts Trades What's that?

    Federal Statistical Office of Germany | Department E 105

    Quelle: Deutsche Fotothek Quelle: Marketing Handwerk

  • Crafts Trades in Official Statistics?

    Federal Statistical Office of Germany | Department E 105

    Official Statistics compile short term and structural results for

    Turnover

    Employees

    Crafts Statistics are compiled entirelyfrom the Business Register (BR). BR

  • Business Register

    Federal Statistical Office of Germany | Department E 105

    BR

  • Administrative Data Sources

    Federal Statistical Office of Germany | Department E 105

  • Linking the Crafts Property to BR-Units

    Federal Statistical Office of Germany | Department E 105

    BR

  • The Problem Irrelevant Cases

    Federal Statistical Office of Germany | Department E 105

    BR

  • The Problem Irrelevant Cases

    Federal Statistical Office of Germany | Department E 105

    Every year about 40 000 new or significantly changed matchesneed to be checked for relevance.

    This means a lot of clerical review.

  • The ProblemIrrelevant Cases

    Federal Statistical Office of Germany | Department E 105

    0%

    20%

    40%

    60%

    80%

    100%

    Fallzahl Umsatz

    CraftsintheBR;ShareofIrrelevantCases

    relevant irrelevant

    Cases Turnover

  • The ProblemIrrelevant Cases

    Federal Statistical Office of Germany | Department E 105

  • The ProblemIrrelevant Cases

    Federal Statistical Office of Germany | Department E 105

    Density of irrelevant Cases in Craftsby Crafts Trade and NACE rev. 2

    NACE rev. 2

    Cra

    fts T

    rade

    s

    A 01

    A 05

    A 10

    A 15

    A 20

    A 25

    A 30

    A 35

    B 01

    B 05

    B 10

    B 15

    B 20

    B 25

    B 30

    B 35

    B 40

    B 45

    B 50

    01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99

    0,0

    0,2

    0,4

    0,6

    0,8

    1,0

  • The ProblemIrrelevant Cases

    Federal Statistical Office of Germany | Department E 105

    Density of irrelevant Turnover in Craftsby Crafts Trade and NACE rev. 2

    NACE rev. 2

    Cra

    fts T

    rade

    s

    A 01

    A 05

    A 10

    A 15

    A 20

    A 25

    A 30

    A 35

    B 01

    B 05

    B 10

    B 15

    B 20

    B 25

    B 30

    B 35

    B 40

    B 45

    B 50

    01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99

    0,0

    0,2

    0,4

    0,6

    0,8

    1,0

  • Solution Machine Learning?

    Federal Statistical Office of Germany | Department E 105

    Since results of the clerical checking from prior periods exist,

    is it possible to train a Support Vector Machine to recognise the editing patterns

    and apply them to a new data set?

    Density of irrelevant Turnover in Craftsby Crafts Trade and NACE rev. 2

    NACE rev. 2

    Cra

    fts T

    rade

    s

    A 01

    A 05

    A 10

    A 15

    A 20

    A 25

    A 30

    A 35

    B 01

    B 05

    B 10

    B 15

    B 20

    B 25

    B 30

    B 35

    B 40

    B 45

    B 50

    01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99

    0,0

    0,2

    0,4

    0,6

    0,8

    1,0

  • Support Vector Machines What's that?

    Federal Statistical Office of Germany | Department E 105

    Support Vector Machines (SVM) find the widest separating hyperplanebetween two groups in a dataset with known classification...

  • Support Vector Machines What's that?

    Federal Statistical Office of Germany | Department E 105

    and applies the patterns found to a dataset with unknown classification.

  • Support Vector Machines Whats that?

    Federal Statistical Office of Germany | Department E 105

    PROs: It operates on very little assumptions

    CONs: SVM tend to require large amounts of calculating power

    SVM (as well as other machine learning algorithms) are prone to overfitting

    Keep in Mind: SVM have to be tuned to get useful results

    Avoid overfitting by cross validating models

    +

    -

    !

  • SVM Applied Training Data

    Federal Statistical Office of Germany | Department E 105

    We train the SVM on a Business Register subset,containing about 650 000 unitswith crafts property and known relevance.

    Roughly 13 000 (~2%) of them are marked irrelevant for the crafts statistics.

    BR

  • SVM Applied Training Data

    Federal Statistical Office of Germany | Department E 105

    We train on the variables:

    Crafts Trade

    NACE rev. 2

    Turnover

    Employees

  • SVM AppliedTraining

    Federal Statistical Office of Germany | Department E 105

    filter trivial cases

    reduce to mostimportant variables

    cases with knownrelevance

    nontrivial cases

    Nontrivial cases w. reduced variables

    trivial cases

    train SVM

    model fornontrivial cases

    model fortrivial cases

  • SVM AppliedPrediction

    Federal Statistical Office of Germany | Department E 105

    predict nontrivialcases

    cases withunknown relevance

    predict trivial casesmodel for trivial

    casestrivial cases with

    predicted relevance

    model fornontrivial cases

    nontrivial cases withunknown relevance

    nontrivial cases withpredicted relevance

    cases withpredicted relevance

  • Results

    Federal Statistical Office of Germany | Department E 105

    predicted original

    relevant relevant 93,1% 65,2%

    not relevant not relevant 3,2% 33,1%

    96,3 % 98,2 %

    not relevant relevant 0,3% 0,3%

    relevant not relevant 3,4% 1,5%

    3,7 % 1,8 %

    relevant relevant 93,0% 62,8%

    not relevant not relevant 1,4% 20,5%

    94,4 % 83,3 %

    not relevant relevant 0,4% 2,7%

    relevant not relevant 5,2% 14,1%

    5,6 % 16,7 %

    classified correctly:

    not correctly classified:

    Random Forest

    ClassificationAlgorithm Cases Turnover

    SVMclassified correctly:

    not correctly classified:

  • Discussion

    Federal Statistical Office of Germany | Department E 105

    We consider the results sufficiently precise to classify the bulk of the cases automatically and without clerical editing.

    Cases with large impact on results will still be given into clerical review.

    Furthermore the SVM-models can be used to check the results of the clerical review process.

  • Jrg Feuerhake

    Phone: +49/(0) 611 / 75 41 16

    joerg.feuerhake@destatis.de

    www.destatis.de

  • Further Reading

    Federal Statistical Office of Germany | Department E 105

    Applied Multivariate Statistical Analysis; Hrdle, Simar; 2007

    http://books.google.de/books/about/Applied_Multivariate_Statistical_Analysi.html?id=6nSF2PTi9bkC&redir_esc=y

    The Elements of Statistical Learning; Hastie, Tibshirani, Friedman; 2009

    http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

    A Practical Guide to Support Vector Classification; Hsu, Chang, and Lin; 2003-2010;

    http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

    Support Vector Machines - The Interface to libsvm in Package e1071; Meyer; 2014

    http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

  • Presentations

    Federal Statistical Office of Germany | Department E 105

    A Simple Introduction to Support Vector Machines; Martin Lawhttp://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf

    Klassifikation mit Support Vector Machines; Florian Markowetz (German)

    http://lectures.molgen.mpg.de/statistik03/docs/Kapitel_16.pdf

    Support Vector Machines; Andrew W. Moore

    http://www.autonlab.org/tutorials/svm15.pdf

  • Videos

    Federal Statistical Office of Germany | Department E 105

    Artificial Intelligence; Lecture 16: Learning: Support Vector Machines (4934)

    https://www.youtube.com/watch?v=_PwhiWxHK8o

    http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/index.htm

    Introduction to Support Vector Machines (5154, 3611)

    http://videolectures.net/epsrcws08_campbell_isvm/

    Artificial Intelligence | Machine Learning; Lecture 6 (7309) und 7 (7545)

    www.youtube.com/watch?v=qyyJKd-zXRE

    www.youtube.com/watch?v=s8B4A5ubw6c

    http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

    Pattern Recognition Class (4031)

    https://www.youtube.com/watch?v=rjIac3NxAYA

Recommended

View more >