essnet on bd ii wpc: results and alignment to breal...principles, practices (implementation...
TRANSCRIPT
![Page 1: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/1.jpg)
ESSnet on BD IIWPC:
Results and alignment to BREAL
Implementation Track Meeting
9 -10 December 2019
Wien, Austria
![Page 2: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/2.jpg)
WPC: Overview Actors: AT, BG (Coordinator), DE, FI, IR, IT, NL, PL and UK
Main objectives: • Improve or update existing information (SBR)
• Maximize the quality and quantity of the statistical outputs (ICT survey)
• To achieve important economies of scale: opportunities for sharing of resources at ESS-level
5 tasks: • ESS web-scraping policy
• Reference Methodological Framework (RMF)
• Experimental Statistics
• Starter Kit for NSIs
• Quality template for statistical outputs
![Page 3: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/3.jpg)
Deliverables – till end of October 2019 ESS web-scraping policy Template
Reference Methodological Framework (RMF), ver. 1.0
Functional production Prototypes
Experimental Statistics
![Page 4: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/4.jpg)
ESS web-scraping policy Template Purpose and scope
8 main sub-section: Preamble, Background, Scope,Principles, Practices (Implementation guidelines), Roles andresponsibilities, Governance and Glossary
Valuable contribution from Eurostat
Current status
Need to discuss further
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/0/0a/WPC_ESS_webscraping_policy_template.pdfNeed to discuss further
![Page 5: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/5.jpg)
Reference Methodological Framework, v. 1.0
Purposes• to describe a complete OBEC statistics processing pipeline across
the main big data life cycle phases
• to be a reference guide and template for NSIs within ESS
• to be a relevant document for NSIs during the implementing process
Six chapters
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/1/1f/WPC_Deliverable_C2_Reference_Methodological_Framework_v1.0.pdf
![Page 6: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/6.jpg)
Reference Methodological Framework, v. 1.0
Four use cases are defined:
1. URLs Inventory
3. Data driven discovery of emergent
enterprise classifications
2. Variables in the ICT
survey
4. Experimental language statistics
Implementable Proof-of-concepts
![Page 7: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/7.jpg)
Statistical products
Main concept
• online based enterprise characteristics (OBECs): anyattributes/characteristics, linked to businesses, that have beenextracted from webpages (e.g. enterprise’s URL ).
At input level • statistical unit: enterprises and/or webpages
• target population: enterprises included in the target population of ICT survey or just a sample thereof
• observation variables: variables are observed by using search engines, APIs and/or web scraping software
![Page 8: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/8.jpg)
Statistical products
At output level• periodicity: at least once a year/or in accordance with the
observation period of the ICT survey
• statistical indicators
Rate of enterprises having websites
Rate of enterprises engaged in web sales on their website
Rate of enterprises that are present on social media
Rate of enterprises using Twitter for a specific purpose
Rate of enterprises having specific features of the website
Rate of enterprises working on upcoming/new phenomena, specifically AI and ML
![Page 9: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/9.jpg)
Big data processing life cycle on OBECs
High level view on enterprise characteristics web scraping process
![Page 10: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/10.jpg)
GSBPM phases recognized in the big data lifecycle
GSBPM 4 Collect: Acquisition/Recording OBECs• identifying a list of companies for which data will be collected
(target population)
• a list of potential website addresses is built
• a partial crawling data collection is done on potential websites
• chosen the “first-best” website for each enterprise
![Page 11: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/11.jpg)
GSBPM phases recognized in the big data lifecycle
GSBPM 5 Process: Extraction, cleaning andannotation, integration, aggregation andrepresentation, etc.
• pre-processing the raw dataset (including tag identification)
• processing data into machine readable format (including data cleansing and text mining methods)
• data evaluation and improving (including imputation of missing data/data linkability)
![Page 12: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/12.jpg)
GSBPM phases recognized in the big data lifecycle
GSBPM 6 Analyse: Modelling and interpretation• validate/reject candidate URLs
• calculation of enterprise characteristics through modelling and interpretation
• microdata is aggregated before publishing (e.g. NACE, NUTS, number of employees)
GSBPM 7 Disseminate• not significantly different from the traditional dissemination
processes
![Page 13: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/13.jpg)
Functional production prototypes
URLs Inventory of enterprises• the methods used are described
• the software used is available on the wiki WPC Git Hub
• Python, Java, PHP, Node and R are the main web scraping languages
• 2 procedures were adopted: Java, Solr, R and PHP, MySQL
• process was tested several times in some WPC countries withsuccess and it is suitable for integration into the real statisticalproduction
• the result can be used to retrieve information from the enterprisewebsites on variables in the ICT survey, new variables, validation ofthe SBR or NACE classification
![Page 14: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/14.jpg)
Functional production prototypes
Variables in the ICT usage in enterprise survey• five different indicators have been prepared
• two sets of methods used to provide output data for indicators
• the software used is available on the wiki WPC Git Hub
• the procedure how to use the prototype is described
• the process was tested in some WPC countries and not significantproportion of errors occurred
• can draw statistical conclusions based on the data from websites
![Page 15: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/15.jpg)
Proof-of-concept prototype
Data driven discovery of emergent enterprise classifications• NLPT is used to discover new data driven classifications of
enterprises
• expected outcome is one or more new enterprise classifications andthe corresponding distributions of the scraped enterprises
• 2-phase process (data acquisition and data analysis) are described
• the software used is available on the wiki WPC Git Hub
• obtained results: didn’t match the expectations
• additional experiments is needed: to improve the quality of theobtained results
![Page 16: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/16.jpg)
Proof-of-concept prototype
Experimental language statistics• clustering enterprises/website owners by descriptions of their
business or sustainability activities
• expected outcomes are: Business Activity Cluster and SustainableActivities Cluster
• outline of processing pipeline is presented
• procedure and software used are detailed described
• methodology developed should be considered a work in progress
• pipeline improvements needed
• explore bias/coverage in the dataset
![Page 17: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/17.jpg)
Experimental statistics: OBECs
Dissemination of statistical outputs• Results - calculation of the statistical indicators defined in the RMF
1.0
• Integration of OBECs information at macro and/or micro level
• Methodology – explanation how the results were produced
• Experimental section on the wiki
![Page 18: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/18.jpg)
WPC Application architecture
![Page 19: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/19.jpg)
WPC Information architecture
![Page 20: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/20.jpg)
Business process: URLs Inventory
![Page 21: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/21.jpg)
Business Process: OBECs
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
![Page 22: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/22.jpg)
What comes next?
Start to develop the Starter Kit for NSIs
Describing the reference architecture for OBEC data (RMF, Ver. 2.0)
Defining the implementation requirements at national level and at ESS level (RMF, Ver. 2.0)
Quality Report for OBEC outputs based on SIMS 2.0 (WPK deliverable)
Experimental statistics 2020
![Page 23: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises](https://reader036.vdocuments.mx/reader036/viewer/2022070900/5f3f94a0f9640634cb2a64b0/html5/thumbnails/23.jpg)
THANK YOU FOR YOUR ATTENTION!
Galya Stateva
Bulgarian National Statistical Institute