executive summary: - webgate.ec.europa.eu€¦  · web viewalthough there is a general trend...

52
ESSnet Big Data Specific Grant Agreement No 1 (SGA-2) https://webgate.ec.europa.eu/fpfis/mwikis/ essnetbigdata http://www.cros-portal.eu/ ......... Framework Partnership Agreement Number 11104.2015.006-2015.720 Specific Grant Agreement Number 11104.2016.010-2016.756 Work Package 1 Web scraping / Job vacancies Deliverable 2.2 Final Technical Report (SGA-2) Version 2018-04-13 ESSnet co-ordinator: Peter Struijs (CBS, Netherlands) [email protected] telephone : +31 45 570 7441 mobile phone : +31 6 5248 7775 Prepared by: Nigel Swier, Frantisek Hajnovic (ONS, UK) Thomas Declite (StatBel, Belgium) Martina Rengers, Chris-Gabriel Islam (DESTATIS, Germany Ingegerd Jansson, Dan Wu, Suad Elezovic (SCB, Sweden) Crt Grahonja (SURS, Slovenia) Christina Pierrakou, Eleni Bisotti (ELSTAT, Greece) Maxime Bergat, Alexis Eidelman (DARES, France) Rui Alves, Maria-Jose Fernandes (INE, Portugal)

Upload: buidieu

Post on 09-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

ESSnet Big Data

Specific Grant Agreement No 1 (SGA-2)

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata

http://www.cros-portal.eu/.........

Framework Partnership Agreement Number 11104.2015.006-2015.720

Specific Grant Agreement Number 11104.2016.010-2016.756

Work Package 1

Web scraping / Job vacancies

Deliverable 2.2

Final Technical Report (SGA-2)

Version 2018-04-13

Prepared by:

Nigel Swier, Frantisek Hajnovic (ONS, UK)

Thomas Declite (StatBel, Belgium)

Martina Rengers, Chris-Gabriel Islam (DESTATIS, Germany

Ingegerd Jansson, Dan Wu, Suad Elezovic (SCB, Sweden)

Crt Grahonja (SURS, Slovenia)

Christina Pierrakou, Eleni Bisotti (ELSTAT, Greece)

Maxime Bergat, Alexis Eidelman (DARES, France)

Rui Alves, Maria-Jose Fernandes (INE, Portugal)

ESSnet co-ordinator:

Peter Struijs (CBS, Netherlands)

[email protected]

telephone: +31 45 570 7441

mobile phone: +31 6 5248 7775

Contents

Executive Summary:6

1. Introduction81.1 Participation81.2 Format of the report91.3 Facts about OJV data:92. OJV Data Use Cases:122.1 Improving current job vacancy statistics:122.2 Classifying data from text descriptions:132.3. Measuring OJV coverage:142.4 Time series analysis:142.6 Data driven analysis:142.5. Other potential use cases:153. Data Access163.1 Introduction:163.2 Direct web scraping:163.3 Arranged access:183.3 Summary214. Data Handling and IT224.1 Data storage and data handling software224.2 Data cleaning and de-duplication:224.3 Text Analysis and Classification234.4 Flow to Stock transformation254.5 Conclusion:275. Methodology285.1 Definitions285.2 Quality Assessment Frameworks:295.3 Measuring Coverage:305.4 Matching and linking:315.5 Time series analysis:316. Statistical Outputs346.1 Estimates of on-line job vacancies:346.2 Indicators or nowcasts of labour market activity based on OJV data346.3 Geographic indicators:366.4 Concluding Remarks377. Future Perspectives38References:39Annex A: Belgium 40Annex B: France45Annex C: Germany52Annex D: Greece90Annex E: Portugal98Annex F: Slovenia111Annex G: Sweden127Annex H: United Kingdom140

Executive Summary:

Nine National Statistics Institutes (NSIs) with the European Statistical System (ESS) have been investigating the feasibility of using online job vacancy (OJV) data in the production of official statistics. OJV data contain information that are not generally collected by job vacancy surveys (JVSs) such as the occupations of advertised jobs, associated skills and their location. OJV data also offers the possibility of more frequent and near real-time data on the labour market. However, there are some important limitations about OJV data:

Not all job vacancies are advertised on-line and some types of jobs are more likely to be advertised than others.

There is no definitive source of OJV data. It is generated and managed by various and mostly commercial actors.

Data about on-line job ads usually contain a mix of structured and non-structured elements, but the specific structure and variables may vary between sources.

Some job ads are out of scope of the official definition of a job vacancy (e.g. student internships, international jobs, ghost vacancies).

The official definition of a job vacancy does not correspond directly to the concept of a live job ad. Critically, a vacancy will usually persist after the advertisement closes.

The specific OJV data landscape can varies considerably between countries, for example, in terms of the number and type of portals and use of on-line platforms. There may also be differences in terms of the role of the National Employment Agency and what type of information is contained in job ads. There may also be legal difference and finally, processing will often require language specific solutions.

In summary, OJV data is not representative of the overall labour market and there are various definitional issues that make it difficult to compare directly with official statistics.

There are different routes for accessing data. Broadly, these are direct web scraping (or either job portals or enterprise websites) or arranged access (e.g. with a job portals, the National Employment Agency, or commercial suppliers). There may be good reasons for a direct web scraping, depending on the specific aims of a project. However, in general the approach should be to arrange access to data that has already been collected. Apart from the technical and legal issues of web scraping, there is also the problem that it will take a long time to generate a sufficient time series to properly evaluate the data. Acquiring data directly from data owners will help circumvent these problems.

This work package has developed a close working relationship with the European Centre for the Development of Vocational Training (CEDEFOP). They are developing a web scraping system for all Member states and there is agreement that this should also aim to serve the long-term needs of the ESS. Therefore, NSIs should also generally avoid investing heavily in developing web scraping approaches since OJV data is expected to become widely available to EU member states via CEDEFOP by the end of 2020.

Several partners have had some success in using machine learning to derive structured variables (e.g. NACE and ESCO codes) from either structured variables (e.g. from advertised job titles) or the whole text of the job advertisement. However, these methods are imperfect and often only work well for some categories.

Various comparisons have been made for comparing and OJV and JVS data including: total vacancies, comparisons by industry sector (by NACE), and comparison of vacancy counts by enterprise. The results of these analyses have been mixed with some comparisons working reasonably well, with others showing only a very loose relationship between the OJV data and the JVS.

Slovenia has come closest to producing an end to end pipeline for producing estimates of OJV ads that can be approximately compared with official estimates. This suggests that only about 40% of all Slovenian job vacancies are published on-line. Although the total on-line coverage may be better in other countries, there are issues that would make the Slovenian approach difficult to replicate for larger countries. One issue is the greater number of important portals. An even greater problematic is that various matching problems (e.g. deduplication, and matching of OJV and business register or survey data) become more difficult at a larger scale.

One area that shows some promise is to use the time series properties of OJV data to improve existing statistics. The pilot has had modest success in predicting survey values using OJV data, so these data could be used for producing flash estimates. It may also be possible to use these time series properties to produce more frequent estimates, or even possibly reduce the frequency of the survey. Other possibilities include statistics on occupations, required skills and labour demand in local areas.

We conclude that OJV data cannot be used to directly replace existing surveys. Indeed, the quality issues are such that it is not clear if these data could be integrated in a way that would enable them to meet the standards expected of official statistics. On the other hand, it is clear that OJV data can provide insight that official estimates do not. This means that as well continued methodological development, there is a need to address the presentational challenge for how OJV data should be interpreted and used together with official figures.

1. Introduction

This report is an update of the work by the Big Data ESSNet WP1 Web Scraping for Job Vacancy Statistics for the SGA-2 period. This ran from August 2017 to the end of May 2018. The previous phase of the project (SGA-1) ran from February 2016 to July 2017 and delivered the following:

i. Qualitative Assessment of Job Portals (delivered July 2016)[footnoteRef:1] [1: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/File:Deliverable_1_1_final.docx]

ii. Interim SGA-1 Technical Report (delivered December 2016)[footnoteRef:2] [2: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/64/WP1_Deliverable_1_2_final.pdf]

iii. Final SGA-1 Technical Report (delivered July 2017)[footnoteRef:3] [3: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/File:Deliverable_1_3_main_report_final_1.0.pdf]

1.1 Participation

Six countries participated in the work package for SGA-1

Germany (DESTATIS)

Greece (ELSTAT)

Italy (ISTAT)

Slovenia (SURS)

Sweden (SCB)

United Kingdom (ONS)

Four countries joined the work package for SGA-2

Belgium (Statbel)

Denmark (DST)

France (DARES)

Portugal INE)

Denmark had to withdraw soon after the start of SGA-2 due to a lack of staff. Italys involvement throughout WP1 has been limited to collaborating on the use of the methods developed by a separate ESSNet work package (WP2 - Web scraping for enterprise statistics), led by ISTAT. This work package is of interest to WP1 because of the potential to use these techniques to collect data about jobs advertised on enterprise websites. There has also been some collaboration between WP1 and WP2 on legal issues on web scraping, which was published as a WP2 deliverable[footnoteRef:4]. [4: Stateva e al (2016), Legal aspects related to web scraping enterprise websites (Section 4 p.17) ; ]

1.2 Format of the report

The purpose of this report is two-fold. The first is to report to Eurostat on the work achieved during SGA02. The second is to provide information to assist future projects within the ESS and the wider official statistics community that aim to use on-line job vacancy (OJV) data. There are two main factors that have guided the presentation of this report.

First, OJV data is highly heterogeneous and the data landscape varies considerably between countries. Some countries have much bigger and well developed on-line channels than others. Also, while official job vacancy statistics are subject to EU regulation there are some differences between countries in terms of the range of variables, the frequency of the survey, and the availability of microdata. This means that approaches to data validation and integration may need to be different. In addition, legal barriers constraining an NSI in one country may not be an issue in others. Finally, a lot of the research in this pilot involves text processing, which is language specific and constrains the reusability of prototype solutions. Consequently, each country in this pilot has needed to find their own path. This means the work of each country naturally forms a distinct case study.

The second main factor influencing the structure of this report is that the work undertaken for SGA-2 is to a large extent a continuation of the work done during SGA-1. This creates a certain challenge the additional work undertaken in SGA-2 needs to be presented in a way that is coherent yet avoids unnecessary repetition.

For the reasons, the main part of this report is written in the form a summary and guide. It attempts to summarise the findings the country based studies along with recommendations and advice for NSIs wishing to do similar work. The guide then provides links to relevant country case studies in a series of Annexes which provide more detail. Links will also be provided to relevant material in the SGA-1 reports and other studies where necessary to minimise repetition.

As part of this introduction we iintriduce some basic facts about OJV data that need to be understood before starting any project:

1.3 Facts about OJV data:

i) Not all job vacancies are advertised on-line:

Although there is a general trend towards more job vacancies being advertised on-line, many continue to be filled through traditional channels, such as newspapers, employment agencies (who may or may not advertise on-line), noticeboards, or personal contacts (Carnvale, 2014). The results of the Slovenian study within the WP1 pilot suggest that only about 40% of all job vacancies in Slovenia are advertised on-line. A similar figure has been reported for France, although this figure may be higher in other countries. In addition, some types of jobs are more likely to be advertised on-line than others. This means that OJV data is not only missing many jobs, but is also not representative of the overall job market.

ii) There is no definitive source of OJV data

Although the situation varies between countries, OJV data is generally characterised by multiple job portals, with different business models and complex interplay between them. Job boards only publish original ads uploaded by employers. Job search engines republish ads from other portals. Hybrid job portals are a combination of both. Some portals advertise many different types of jobs while others may specialise in specific sectors. In addition, new job portals may appear while existing portals my decline in importance. All these factors mean that it is very difficult to capture all jobs advertised on-line and to reliably measure labour market trends in the real world. It also means that the complete set of all jobs advertised is a country will contain a lot of duplicates.

iii) Data about on-line job ads usually contain a mix of structured and non-structured elements

Most on-line job advertisements have some structured elements that are separate from the full text of the job ad, typically, job title, job location, employer name. Further information is contained in the full text of the job description such as skills and education, but will need to be extracted and converted into structured elements using natural language processing (NLP) and classification algorithms. Variables such as occupation and industry code will usually also be derived using text analysis and machine learning. Data derived in this manner will inevitably contain some processing errors.

iv) A job ad is only a proxy measure for the existence of a job vacancy within a company

There are additional factors that make it challenging to relate on-line job ads to established statistical concepts and definitions, such as those used by the Job Vacancy Survey (JVS):

Some ads may not represent and in-scope job vacancy. These include non-existent vacancies (referred to as ghost vacancies), international jobs, and non-paying student internships.

Some ads may be advertising more than one vacancy. The number of vacancies may sometimes be specified in the job ad, but often it is not. Even when the number is explicit, this is usually contained within the unstructured part of the job ad and is difficult to extract.

The Job Vacancy Survey (JVS) is a stock estimate of the number of vacancies for which businesses are actively seeking recruits from outside their organisation. Online job ads represent a flow of new vacancies but usually do not contain explicit information on when the recruitment activity will be concluded. Therefore, direct comparisons between these sources requires some assumptions about how long a company will be actively recruiting once an advertisement is published.

Different levels of information will be available for different vacancies. For example, vacancies advertised through agencies will usually not have the name of the employing business. This will both affect the quality of industry coding of job ads as well as the quality of any linkage with survey units.

General recommendations:

All these factors together make it very challenging to use OJV data to produce estimates about job vacancies with known error characteristics. Indeed, a key conclusion of this work package is that OJV data does not on its own provide a complete picture of labour demand. It may be possible to use these data to measure trends. However, even this requires caution as there is no easy way of separating underlying trends in the labour market from the trends in how jobs are being advertised. Therefore, any analysis of OJV data needs to be sense checked and where possible, validated against other data sources (including other OJV sources).

Since the specific characteristics of the on-line job market vary between countries, it is recommended that NSIs start with a landscaping exercise. This should involve the collection of information about the national job portal market such as the total number of portals, a more detailed survey of the largest portals, the role of the National Employment Agency (NEA), and any other relevant information about the national on-line job market.

NSIs within the EU should first contact the European Centre for the Development of Vocational Training[footnoteRef:5] (CEDEFOP). They already undertaken a comprehensive landscaping exercise for all EU member states and will be willing to share any information. They should also be able to provide contact names of the national experts involved in these landscaping activities. [5: http://www.cedefop.europa.eu/en/about-cedefop/contact-us]

2. OJV Data Use Cases:

2.1 Improving current job vacancy statistics:

In considering application of OJV data, an obvious starting point will be to consider how OJV data could be used to improve current job vacancy statistics. Official job vacancy statistics are subject to EC regulation No. 453/2008 and are collected primarily for the purposes of calculating the job vacancy rate, a key measure of labour market tightness. This harmonised approach enables these statistics to be compared between EU member states. Further details about the regulation and its definitions is provided in Section 5.1.

The key differences between what is required for official job vacancy statistics for EU member states and what is available within OJV data are summarised in Table 1.

Table 1: Differences between the Official survey based estimates of job vacancy statistics

Dimension

Official estimates

(JVS based)

OJV Data

Frequency

Quarterly

Real-time

Industry Sector

Yes

Yes

Enterprise size

Yes

Yes

Job title / Occupation / Skills[footnoteRef:6] [6: A survey of employer skills for all EU member states was carried on in 2014. Some member states have their own skills surveys, but these are usually infrequent.]

No

Yes

Sub-national

No

Yes

National totals (estimates)

Yes

No

The EC regulation requires estimates to be published every quarter and there is typically a lag of several weeks or months between the reference day and the publication date. In contrast, OJV data could potentially be used to produce high frequency statistics in near real-time. In terms of variables, the JVS collects data about industry sector and enterprise size, both of which can be derived (at least to some extent) from OJV data.

A key advantage of OJV data is that it contains information about job vacancies that are not mandated by the regulation, yet are often requested by users. This includes data on occupations in demand, information about skills as well as location information which could be used to gain insight into local labour markets. However, as explained, while the JVS produces estimates based on representative survey samples, online job ads are a highly selective subset of all jobs advertised by employers. Since OJV data does not cover the full labour market, it cannot be used to directly replace the JVS.

The final SGA-1 technical report proposed a theoretical outline of how OJV and JVS data could be integrated by linking data at the enterprise level and then constraining to totals from the JVS (Swier et al, 2017, p.15). The aim would be to produce data that included more variables and granularity, but was also consistent with official estimates. However, the work undertaken during SGA-2 casts doubt on the feasibility of this approach since the job vacancy count trends between OJV and JVS data at the company level are often too volatile to produce reliable scaling factors.

One possibility could be to reduce the number of surveys to reduce collections costs. For example, the quarterly survey could become an annual survey with the OJV data being used to estimate the remaining 3 quarters. If feasible, this could reduce survey costs. Belgium have done some concrete work in this area, but the findings are not yet ready for this report and so the feasibility remains unclear. The Belgium country report instead focuses on the feasibility of deriving NACE group codes from OJV data to replace the data collected from the survey (See Annex A).

Although OJV data has some distinct advantages, realising these improvements for official statistics purposes is very challenging for the reasons described in Section 1. There is not yet a clear pathway for using these data to produce statistics that meet the quality standards expected for official statistics. Slovenia have been able to produce some estimates of the number of on-line job vacancies, but these are not comparable with the statistics produced by the JVS.

For these reasons, it is important to be mindful of the limitations of OJV data and to have realistic objectives of what is possible. It is recommended that NSIs focus on specific problems that would help with the production of experimental type indicators, with integration into statistical production being viewed as only a possible longer-term goal. The rest of this section suggests some broad research areas and some more specific use cases.

2.2 Classifying data from text descriptions:

On-line job ads contain a combination of structured and unstructured text. Variables such the job title, location and employer/agency name are usually stored as separate fields while the full text description of the job ad will contain a wide range of different types of information. Another issue is that location information will usually not conform to standard geographical units. Therefore, the data needs to be classified in some way before it can be analysed. Usually, these classifications will take the form of a recognised nomenclature such as ESCO (for occupation and skills) or NACE (economic activity) However, OJV data could also be classified using unsupervised or semi-supervised clustering models, which do not use a pre-defined structure (e.g. Djumalieva et al., 2018). In all cases, this processing involves some combination of text pre-processing and classification methods, typically involving machine learning.

This is an aspect that has been explored in some considerable depth during SGA-2. Further details are provided in section 4.3.

2.3. Measuring OJV coverage:

The most important methodological challenges in using OJV data for official statistics is to be able to understand the differences between what is represented in these data and what is measured by the JVS. WP1 has explores three different ways of doing this:

Micro-level comparsions with the JVS

Aggregate comparisons with the JVS

A survey of advertising channels

This is discussed further in Section 5.3

2.4 Time series analysis:

Another strand of analysis involves using the time series properties of OJV data and explore how it relates to official job vacancy estimates. In this ESSNet, Sweden and the UK have considered the potential of using the near real-time availability of OJV data to produce nowcasts (or flash estimates) of job vacancies. A time series approach might be particularly useful for predicting turning points in the economy. This is discussed further in Section 5.5.

2.6 Data driven analysis:

Traditionally, the development of official statistics has been driven by clearly defined user needs. Data is collected that meets these needs as closely as possible, either directly through a survey, indirectly through administrative data sources, or possibly some combination of both. In future, big data sources such as OJV data, are expected to become somehow integrated into the activities of NSIs.

Although, it is expected that official statistics will continue to be led by clearly defined needs, the complex nature of big data means that some analysis could be more data driven. Big data in combination with data science offers the opportunity to identify unknowns, unknowns, that is, new insights of policy relevance into the quality of current statistics, that may have not previously been considered.

For example, analysis of OJV data shows that new job advertisements in the UK are less likely to be published on-line on a Friday compared to other week days. This may of some relevance as the survey is run monthly and the reference day is always on the first Friday of the month. This kind of analysis could possibly help inform the survey operation and estimation methodology.

2.5. Other potential use cases:

2.5.1: International Labour Market

There is potential to use OJV data to produce new statistics on aspects of the labour market, which are excluded from current statistics. One example includes international jobs, namely, job advertisements that are advertised in locations that are outside of country in which the job portal is based. This itself could be a useful measure of labour market tightness and could help identify types of vacancies that difficult to fill. It may also be that advertisements of this nature are very likely to be advertised on-line.

2.5.2 Identification of new job titles

Occupation classification coding frames require regular maintenance to ensure that they capture new job titles and OJV data is an excellent source of such information. This has already been used to in the UK at a small scale to update the national coding frame for the national occupation classification (UKSOC). There is also some consideration as to whether OJV data could be used to support maintenance of the classification itself by reflecting up to date information about job titles and skills in the labour market.

3. Data Access

3.1 Introduction:

When embarking on an OJV data project the options for accessing data need to be considered. The approaches to accessing on-line job advertisement data can be divided into two broad types: direct web scraping and arranged access. The most appropriate type of access depends on exactly how the data will be used. The following questions may help in guiding decisions:

How much data is needed?

Can the data be a one-off supply or will it be needed on an ongoing basis?

If required on an ongoing basis, is it required in real (or near real time)?

Is historical data needed?

Is the complete job advertisement required or just aggregated data?

Is the aim to combine the OJV data with other data (e.g. survey data)?

However, it is also to consider practical issues around data access such as:

Does your organisation permit web scraping and do your IT systems support it?

Does the project team have to develop, and if necessary maintain, a web scraping system?

How much project resource should be dedicated to web scraping compared to data processing and analysis?

How easy is it to access job vacancy data that is already available?

In general, obtaining sample data from one job portal through web scraping is relatively quick and easy. However, scaling up to include regular scraping of multiple websites and related pre-processing (e.g. de-duplication) can very quickly consume resources. Therefore, for it is recommended that more substantial projects consider the feasibility of accessing OJV data that already exists from a job portal or other data provider.

3.2 Direct web scraping:

Direct web scraping involves using web scraping techniques to collect data from on-line sources without an explicit data access agreement from the website owner. Target websites may include either job portals or enterprise websites. The main advantage of direct web scraping is that samples of data can be captured and analysed quickly. This may be achieved either through simple point and click web scraping tools or programmable web scraping tools. Direct web scraping also offers a high degree of control over what and how data is collected and provides an opportunity to produce near real-time.

3.2.1 Point and click web scraping:

Point and click web scraping tools have user interfaces that are designed to build web scraping robots without the user needing to write any code. A web scraping robot API is built through the point and click actions of the user highlighting and labelling web page elements of interest to train the robot to recognise the page layout. The point and click tools used during the WP1 pilot include Import.IO[footnoteRef:7] and Content Grabber[footnoteRef:8]. These proved very easy to use and effective for small scale data collection. One problem encountered during the pilot was that Import.io changed their business model so that functionality that was previously free charge became a paid for service. Another potential problem with Import.IO is that scraped data is physically held on their servers. However, the main limitation is that these tools are designed to stand alone and cannot be easily integrated into a production pipeline. Therefore, point and click tools are best suited for small scale experimentation only. [7: https://www.import.io/] [8: https://contentgrabber.com/]

3.2.2 Programmatic web scraping

Programmatic approaches to web scraping involve developing and deploy web scraping robots, usually using packages such as Python Scrapy[footnoteRef:9], or Apache Nutch. These packages require some programming skills, but offer much better control and pipeline integration. For example, a robot can be deployed to scrape on a regular basis and load new data into a database for further processing. While many web scraping projects will be able to meet requirements with just one of these packages, some scraping some websites may require additional tools. For example, for some websites additional content is loaded through scrollbar interaction. Accessing the full content of websites with this functionality requires the use of web site automation tools (e.g. Selenium). This requires a more advanced knowledge of software development. In contrast, some job portals[footnoteRef:10] provide a public API in which case job advertisement data can be accessed without any specific knowledge of web scraping. [9: This was the most widely used web scraping package within this pilot ] [10: Examples in the UK include Adzuna and Universal Job Match.]

3.2.3 Web scraping enterprise websites

For the most part, job portals will be the target websites for scraping OJV data. However, as discussed in the introduction, a specific objective of WP1 SGA-2 was to further explore the feasibility of collecting job vacancy information from enterprise websites. The thinking is that this approach could have advantages over web scraping job portals as it would both avoid duplication and produce data that could be assumed to be the most accurate on-line measure of job vacancies by enterprise. The proposal was to explore the feasibility of using the overall framework for producing statistics from enterprise websites developed by WP2 and to apply it to this use case.

WP2 established that it is possible to crawl enterprise websites and identify those that advertise their own job vacancies. However, reliably creating structured data, including identifying individual advertisements, from the relevant web pages is much more challenging. Work by Slovenia shows that it is possible to identify individual job ads for some enterprise websites but the highly variable nature of website design makes it very difficult to do this on a more representative basis.

A more brute force approach for collecting data from enterprise websites was also explored by the UK. This involved manually develop a small army of mini-robots to capture counts of vacancies from selected enterprise websites. This involved developing a framework of reusable components that could be assembled and applied for different enterprise websites depending on the specific website design. A test involving 150 websites showed that this approach of using reusable components allowed robots to be assembled relatively quickly and could be used to scrape vacancy counts for just about any type of website. However, ultimately the overall effort required to create and maintain the robots does not make this a viable approach for large scale data collection.

3.2.4 Web Scraping Legal Issues

The relevant legal issues are well documented in the WP2 report on web scraping (Stateva et al, 2016[footnoteRef:11]). The main issues to be aware of is that many websites have restrictions on what content may be scraped from a web site, and many prohibit web scraping entirely. The specific legal risk is around sui generis database rights, which are a form of ownership right pertaining to data that apply when scraping substantial parts of a website[footnoteRef:12]. However, NSIs should also consider the ethical and reputational risks of web scraping. These risks can be managed by following principles of web scraping netiquette, such as respecting the robots.txt exclusion protocol. [11: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/67/WP2_Deliverable_2_1_2017_02_15.pdf] [12: ]

Despite efforts to develop a consistent approach within the ESSNet, it is clear that legal departments in different NSIs have varying opinions about these matters. Ultimately, projects need to ensure that web scraping is compliant with their own organisational policties.

3.3 Arranged access:

This general approach to data access involves an explicitly agreed data access arrangement with either the website owner, or other organisations holding these data including job portals, employment agencies (government and private) and data aggregators.

There are several advantages of accessing data through an explicit agreement from owners of job vacancy data rather than through web scraping. The most important reason is that developing and maintaining web scraping robots can be resource intensive activity that can tying up scarce data science resources. Another important advantage is that it removes any uncertainty over legal issues in accessing and using the data. An explicit agreement may also offer a route for accessing historical data, which are rarely available through web scraping since job ads are normally removed once they expire. Finally, the data owner may also be able to provide metadata or other insights into the methods used to collect and process the data.

3.3.1. National Employment Agencies (NEAs):

Several countries in this pilot (i.e. Sweden, France, Belgium, Germany, Slovenia) managed to gain access to OJV data from their NEA. These agencies are typically the largest single source of OJV data. In some countries certain types of jobs (e.g. in the public sector) are required by law to be advertised by the NEA. Often, NEA OJV systems use common enterprise identifiers which greatly simplifies the process of linking to business registers and survey reporting units. Another important advantage is that access arrangements are less likely be complicated by commercial considerations. Therefore, a recommended first step for NSIs looking to acquire data is to explore the possibility of gaining access to the job vacancy data from the NEA.

3.3.2 Private Job Portal Owners

Several countries (i.e. Sweden, UK, Slovenia) managed to secure some data supply arrangements with at least one private job portal owner. When approaching these organisations, it is important to consider what they might want out of a data supply arrangement and whether this is consistent with the policies of your organisation. Motivations could include payment for data, payment for data services or some in-kind benefit. An example of the latter could be an offer to link to their data to JVS data to provide an aggregate report to show some new insight into the coverage of their data. The main motivation for private job portals agreeing to partner with the NSI seems to part of corporate social responsibility but they are also likely to be interested in the publicity benefits of their data being used by the NSI.

In the UK, procurement rules required an open tender process to ensure that all potential providers had an equal opportunity to reap these benefits. This was required even though the tender made it clear that ONS policy is not to pay for data. Multiple teams were needed to be involved to support this process including finance and procurement, and the press office. The UK has recently established a commercial data acquisition team which has really helped with navigating the correct process. The process in Sweden and Slovenia was less formal and so therefore it is important to establish what steps are required by your organisation.

Slovenia also undertook some specific engagement of some private employment agencies. Many job advertisements are made by employment agencies where the actual employer is not identified. This is major problem for any analysis involving the linking of OJV to survey data and indeed is also a problem for correctly classifying job vacancies by economic activity. This engagement process resulted in some modest success in obtaining some additional information. However, the sheer numbers of private employment agencies in some countries[footnoteRef:13] means that this is not always a practical option. [13: For example, the UK has over 14,000 employment agencies.]

3.3.3. Commercial suppliers of OJV data:

There are several international companies that scrape job advertisement data from the web, process and enrich it to provide commercial data and analytical services (e.g. Textkernel, Burning Glass, CEB Talent Neuron). The advantages of these sources are that the time-consuming processing to prepare this data for analysis (e.g. deduplication, cleaning and enrichment) will have already been done. The disadvantages are that these methods will usually not be transparent and some form of payment will normally be expected. Another limitation is that these products are often only available in larger job markets (e.g. Germany and UK). The UK pilot managed to negotiate access to UK Burning Glass data without direct payment, although there are specific circumstances may make this difficult to replicate elsewhere. For this reason, this is not generally considered a viable approach.

3.3.3 CEDFOP:

In 2015, the European Centre for Vocational Training (CEDEFOP) funded a web scraping pilot of selected job portals in five EU member states (Germany, Italy, UK, Ireland, Czech). The pilot included development of an experimental system to remove clean data, remove duplicates and classify for analysis. A detailed assessment of the 2015 pilot data is made by Germany (Annex C, Section 3).

In early 2017, CEDEFOP launched the second phase which plans to develop a system for collecting on-line job advertisement data for all EU member states. In Spring 2018, a system of ongoing web scraping and processing will start initially for eight countries with the first data becoming available by the end of 2018. This is then expected to be available to all member states by the end of 2020.

It is becoming clear that different public organisations, including different parts of the European Commission have an interest in OJV data. It is also clear that it is not efficient to have duplicate systems to collect and process these data for different purposes. An agreement is in now place between CEDEFOP and Eurostat to facilitate collaboration and to ensure that the statistical requirements of the ESS are considered in the development of the CEDEFOP system. Expertise in the areas of quality and statistical measurement means that the ESS could play an important role in the appropriate use of these data for policy purposes. In March 2018, a joint validation workshop between representatives from the ESSNet, Eurostat, CEDEFOP and their contracting partners from the University of Milan to help achieve alignment with the requirements of the ESS.

It is important that any NSIs wanting to work with OJV data take account of these developments as part of their long-term planning. Specifically, it is recommended that NSIs avoid committing too many resources into the development of web scraping, data cleaning, deduplication and data enrichment methodologies since in the long term CEDEFOP data is likely to become the main source of OJV data to support the statistical activities of the ESS.

3.3 Summary

In general, options for acquiring data directly from suppliers should be explored before direct web scraping. Exploring options for acquiring data from the NEA is a good place to start. Depending on the specific aims of the project, there may be good reasons for a direct web scraping. In general, NSIs should generally avoid investing too heavily in web scraping and data processing since processed OJV data is expected to become available to all EU member states via CEDEFOP by the end of 2020.

All countries in the WP1 pilot have managed to gain some form of access to job portal data, either through direct web scraping, agreed access or both. The various avenues to data access by each country are summarised in Table 2.

Table 2: Investigation of On-line Job Vacancy Sources by Country

Country

Direct Web scraping

Agreed Access

Enterprise websites

Job Portals

National Employment Agency

Private Employment Agencies

Other data aggregators

Germany

Yes

Yes

Yes (CEDEFOP)

Greece

Yes

Slovenia

Yes

Yes

Yes

Yes

Yes

Sweden

Yes

Yes

United Kingdom

Yes

Yes

Yes

Yes (CEDEFOP,

Burning Glass)

France

Yes

Yes

Belgium

Yes

Yes

Yes

Portugal

Yes

4. Data Handling and IT

4.1 Data storage and data handling software

For most countries in this pilot, the volumes of data that have handled so far are not very large and can be processed on a single machine. For this reason, big data IT solutions have not been explored in any depth. The UK pilot is using NoSQL data storage (i.e. Mongo DB), mainly because it had been established for other web scraping projects. The main advantage of this approach is that this will provide scalability if greater volumes of data are required in future. The UK web scrapers are hosted on a Google compute platform as this cannot currently be done from the main network.

Most countries used Python and related packages for data handling and machine learning although Belgium used R and related packages.

4.2 Data cleaning and de-duplication:

The raw data obtained from a job advertisement typically requires a lot of cleaning and pre-processing prior to analysis. For example, job titles often contain extraneous information, such as job location, key skills, and salary. This is because employers try to attract the attention of potential job seekers by stacking the job title field with other key information. In addition, OJV data will often include data that may be considered out of scope, for example, jobs based in another country or ads for unpaid student internships. These ads should to be identified and removed where possible.

Duplication of job ads is a key quality issue when combining data from multiple job portals. It can also be a problem within portals, particularly for job search engines where job ads from other portals. Job search engines may take steps to remove duplicates but the effectiveness of these procedures seem to be variable. Some duplicates within job search engines can be identified easily because the URLs linking back to the original ad will be identical. However, duplicates where a job ad has been posted on more than one job board are more difficult to identify.

Duplication methods were explored as part of a virtual sprint held during SGA-1[footnoteRef:14], These focused on matching common fields, comparing text content and then calculating a similarity metric to establish the likelihood that two job advertisements are the same. The first step is to prepare and standardize the data fields that are common and that can be compared to all data sources (i.e. job title, location, company name, date posted, job description). This involves text normalisation procedures such as, removal of white spaces, case standardisation and removal of stop words or other extraneous text, typically using regular expressions (regex). [14: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP1_2016_07_28-29_Virtual_Sprint_Notes. These are also documented within the WP1 - interim technical report]

The next step involves calculating a similarity metric to identify any likely duplicate job ads. One approach involved using Python Dedupe, which is designed to identify duplicate records using supervised learning methods. This uses an initial match using logistic regression and then identifies marginal cases for clerical resolution. The decisions of this clerical process are then reincorporated into the machine learning algorithm, to be applied for automated removal of duplicates. Other duplication methods simply focused on the unsupervised probabilistic matching on the similarity of text strings. Methods explored included, Levenshtein distance and longest common substring distance, with Jaccard Similarity performing best.

The initial focus was on the structured data fields rather than the unstructured content of the full job description. This was mainly because this information is often difficult to scrape from websites in full and often only a snippet of a certain number of characters from the full job ads is readily available. This information would be needed to achieve a good quality de-duplicated data set, especially where there are many records.

Slovenia use three main sources of data, i) jobs that are advertised through the National Employment Agency (ESS), ii) Deduplicated job vacancies from the two largest job portals and iii) data scraped directly from enterprise websites. When combining these data there is a further duplication step, which involves matching each source to the business register and then using the source based on a priority principle. ESS data is used first. If an enterprise in found in the job portal data but not the ESS data, then the job portal data is used. Last of all, vacancies are included that are found on enterprise websites where those enterprises cannot be found in the other two sources.

4.3 Text Analysis and Classification

Classification of OJV data is an important and complex topic with many different dimensions. OJV data contains information that can be used to derive occupation (e.g. ESCO), economic activity of the employer (e.g. NACE) and classifying to standard geographic units. It can also involve different elements of the job ad for related but distinct purposes. For example, it can be applied to matching occupation to job title but also to predicting occupation based on the text of the full job description. Finally, while official nomenclatures need to be used for official statistics purposes, it is possible to apply unsupervised clustering methods to gain new insights into the data.

Broadly, the methods for assigning classifications to OJV data comprise of two main steps:

i) Text pre-processing

ii) Classification methods[footnoteRef:15] [15: While the first step clearly relates to data handling, the second could be considered more about methodology, which is covered separately in Section 5. However they are covered together here are there are so tightly integrated.]

In the context of the ESS there is additional complexity because these approaches require language specific lists, look-ups and training data. This is a particularly relevant issue in countries where there is more than one official language (e.g. Belgium) or where the official language uses a combination of Latin and non-Latin scripts (e.g. Greece). In many countries it is common for jobs to be advertised in English, even where it is not an official language.

4.3.1 Text Pre-processing

Machine learning of text data requires some text pre-processing to ensure any algorithms work effectively. This typically involves the combination of different sub-processes:

Text standardization: This includes conversion to lowercase, removal of punctuation, repeated white spaces and non-alpha-numeric characters. This is normally implemented using regular expressions (regex), which is supported by all common programming languages.

Stop word removal: Stop words are common words (e.g. and, the a) that do not convey meaningful information in the context.

White/black lists: These are lists of words that are applied as filters to be allowed or disallowed into the processed dataset.

Stemming: This involves removing the end or the beginning of the word using a list of common prefixes and suffixes that are relevant to the language of the text. For example, the word making would be transformed to the stem mak

Lemmatization: This involves taking into consideration the morphological analysis of words and requires more detailed dictionaries. For example, the word making would be transformed to the lemma make. TreeTagger is an example of a lemmatization tool that supports a number of European languages. Some lemmatization tools (e.g. Morphalou) can take account of different meaning depending on context. For example:

Il conduit un bus (He drives a bus) -> Conduire bus

Je bus une vodka (I drank a vodka) -> Boire vodka

4.3.2 Text classification methods[footnoteRef:16] [16: ]

Two broad classification methods were explored. A phrase based classification approach (PBC) was explored by Greece, while other countries explored machine learning approaches. Phrase based classification involves the creation of rules which trigger an action when a certain phrase is encountered. The main advantage is a high degree of precision and results which are easily explainable. A disadvantage is scalability for more complex classification problems. Machine learning classification approaches exploit the relationships between the labelled features of a dataset and other features to build models that predict the values of unlabelled features in an unseen dataset. Machine learning is easier to apply on large scale data but these methods often lack transparency. A brief description of the various WP1 SGA-2 studies is given below:

Belgium: Machine learning approach used to classify NACE group codes based on the full job description.

France: Machine learning approach used to classify occupation codes based on the full job description, Also string matching.

Greece: Rules based approach using phrase based classification codes used to classify occupation codes for IT job advertisements in both Greek and English.

Germany: Machine learning approach used to classify NACE group codes based on the full job description.

Portugal: Machine learning approach used to classify occupation codes based on the job title only.

Further details are available in the relevant country Annexes.

A general finding from these studies is that these machine learning classification methods usually work better for certain parts of a classification than others. For example, in the Belgian and German studies, the NACE categories I (Accommodation and Food Services) and O (Public Administration and Defence) were much more often classified correctly than L (Real Estate activities) and R (Other Services). For some other NACE categories there was considerable variation in the results between the two countries and so there may be language specific or other technical factors that need further investigation.

An interesting finding from the German study was that the data from the German Federal Employment Agency (FEA) contains many more structured variables (e.g. occupation code, number of vacancies, type of degree) than the ads from private job portals. This makes the FEA data a feature rich source of training data which could be applied to the full text of job descriptions from private portals. This would seem to be true for other NEAs, which further strengthens the case for prioritising access to this type of data.

4.4 Flow to Stock transformation

When combining or comparing OJV data to the JVS it is important to consider the definitional differences between these two types of data. The JVS is a stock measure of the number of vacancies that employers are taking active steps to fill at a specific time point. OJV data may be collected as either a flow of new job ads if collected continuously, or a stock of live job ads open on a specific date. Some job ads will have a closing date, and while some portals will remove jobs when they expire, this does not always occur. However, even with a known closing date the problem remains that an employer will typically still be taking active steps to fill a vacancy even after the job advertisement has closed. This may even include re-advertising the vacancy. These definitional problems are illustrated in Figure 1.

Figure 1: Job Vacancy Lifecycle:

To make OJV and JVS data more directly comparable, it is necessary to estimate the time between when a job is advertised and then apply an adjustment to the OJV data[footnoteRef:17]. Slovenia have approached this problem by using data on public sector jobs advertised through their national employment agency (ESS). These data can be linked directly to compulsory health insurance applications and so the difference between the start of the job ad and when a person starts the job can be measure directly. This means a distribution can be created and then applied to the entire flow of new job ads to estimate the stock of unfilled vacancies at any time point. Various methods were tested to establish how the time periods between advertising and filling a vacancy are distributed. The best distribution was geometric with a mean of 48 days, a median of 33 days and a standard deviation of 48 days. The distribution has a long tail because a small number of vacancies take a long time to fill. This is discussed extensively in Annex F, Section 3.3. [17: De Pedreza at el (2017) take a different approach when comparing JVS and OJV data in the Netherlands. The JVS in Netherlands asks some additional questions about new vacancies created in the previous quarter. This means it is possible to directly compare new vacancies measured in the JVS with new vacancies in the OJV data source.]

For the UK Burning Glass data, a transformation step was essential because the data only had a start date. As there is no equivalent data such as used in Slovenia, an approach developed that systematically created a set of stock time series based on different assumptions on the average length of time to fill a vacancy. The time series with the best trend alignment between the transformed data and the JVS, which used an assumption of 36 days. This assumption is applied to all subsequent analysis of Burning Glass data to derive an estimated daily measure of job vacancy stocks. This is discussed further in Annex H, Section 3.1.

Although these two approaches are not directly comparable, the average time periods are roughly similar, and intuitively seem about right. It should be clear that while using these assumptions to transform data in this way will improve comparability between OJV and JVS data, these are coarse assumptions (especially in the case of the UK) and that will inevitably introduce some errors. It might be possible to refine this period value for different industry sectors, but this is limited by the problems of reliably disaggregating OJV data by industry.

4.5 Conclusion:

The effort required to handle large amounts of raw data from job portals into a form ready for analysis, should not be underestimated. Raw OJV data invariably requires a lot of cleaning, deduplication and data reduction. Key variables of interest such as occupation and economic activity need to be classified from text. Finally, there are important definitional differences between OJV and JVS data and data may need to be transformed to make comparisons.

For these reasons, NSIs need to consider what is the best use of their time and resources. NSIs should consider options that will minimise the amount of data handling required. NSIs should also keep in mind that CEDEFOP is expected to become a key source of OJV data for all EU member states over the long term and many of these data handling processes will have already been applied to the data. However, as CEDEFOP is focused on meeting EU requirements, only harmonised EU classifications are likely to be available (i.e. NACE and ESCO). Therefore, countries with specific national requirements may wish to focus on developing methods that are specific to their classification systems.

5. Methodology

As explained in Sections 1, 2, and 4.4 there are fundamental issues around the quality of OJV data, what these represent, and how they compare with official estimates produced from the JVS. This provides some major methodological challenges in using these data to produce official statistics.

The section starts with a discussion of the definitions and target concepts used for official job vacancy statistics and this differs from corresponding OJV data. This is followed by a discussion on data quality frameworks and some steps that have been taken to compare OJV data against these frameworks. Next is a section on approaches for coverage assessment followed by data matching (which is one means of assessing coverage). The final section is a discussion of time series methods, which focuses more on how OJV and JVS data compare over time.

5.1 Definitions

There are differences between the target concept of official surveys and what can be practically measured from on-line job advertisements. Job vacancy statistics within the ESS are current subject to EC regulation No. 453/2008. This defines a job vacancy as:

a paid post that is newly created, unoccupied, or about to become vacant:

(a) for which the employer is taking active steps and is prepared to take further steps to find a suitable candidate from outside the enterprise concerned; and

(b) which the employer intends to fill either immediately or within a specific period of time. [footnoteRef:18] [18: Regulation (EC) No 453/2008 of the European Parliament and of the Councilhttp://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2008:145:0234:0237:EN:PDF]

EC regulation 453/2008 has several mandatory elements:

Quarterly data that has been seasonally adjusted

Data broken down economic activity (using NACE[footnoteRef:19]) [19: http://ec.europa.eu/competition/mergers/cases/index/nace_all.html]

Data is relevant and complete, accurate and comprehensive, timely, coherent, comparable, and readily accessible to users.

There are other elements that are optional, or subject to feasibility, including:

Job vacancies in the agriculture, forestry and fishing sectors

Job vacancies in public administration, defence and education

Data on businesses with less than 10 employees

Distinguishing between fixed term and permanent jobs.

Member states are granted considerable flexibility regarding the implementation of regulation 453/2008 in the national statistical systems. Some countries use stand-alone surveys, while others combine the job vacancy survey with other business surveys. Some collect the minimum information required by the regulation while others collect data for their own national purposes. Although the regulation states that the data shall be collected using business surveys, the use of administrative data is equally permitted under the condition that the data are appropriate in terms of quality (according to the quality criteria of the European Statistical System).

The official definition of a job vacancy does not correspond exactly to the concept of a live job ad. Critically, a vacancy will normally persist after for a period after the closing date. In theory, this means that the stock of vacancies as measured by the JVS should generally be higher than corresponding OJV data.

5.2 Quality Assessment Frameworks:

It should already be clear that there are many quality issues in consider when working with OJV data. During SGA-1, a virtual sprint was held by WP1 to consider two possible quality frameworks:

1. The Quality Assessment Framework used by Statistics New Zealandas reporting tool for administrative data quality. The aim was to test the suitability of this framework for web-scraping for on-line job advertisements. This was in part in response from some initial proposals put forward by WP8 for approaching big data quality.

2. TheUNECE framework for the Quality of Big DataFramework for the Quality of Big Data developed by the UNECE Big Data Quality Task team.

A summary of this work is available in the final SGA-1 technical report[footnoteRef:20] (Section 4.2) and a more detailed report is available on the ESSNet wiki[footnoteRef:21]. The general conclusion was that the UNECE framework seems to be more intuitive and so it could be a more useful starting point for documenting issues on quality on on-line advertisement data. However, the Statistics New Zealand Quality Framework is designed to support a total survey error approach which could further deepen the accuracy and selectivity dimension of the UNECE framework. These elements would become more important when considering how on-line job advertisements could be moved into statistical production. [20: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/2/20/Deliverable_1_3_main_report_final_1.0.pdf] [21: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Virtual_Sprint_1_February_2017]

Further to this, a more comprehensive review against the UNECE framework was done in preparation for a March 2018 workshop to validate the CEDEFOP web scraping system and its suitability for official statistics purposes. The aim was to explore the feasibility of incorporating this, or a similar, quality framework into the CEDEFOP production system. One of key outcomes of this process was identifying the importance of having metadata and where possible, meaningful quality measures, of all data processes. This would help enable official statisticians to make appropriate judgements about CEDEFOP data. This is also available a separate report[footnoteRef:22]. [22: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/7/7d/WP1_Quality_Framework_v1.1.pdf]

5.3 Measuring Coverage:

As previously discussed, one of the fundamental quality issues in using on-line job advertisement data is that not all job vacancies are advertised on-line. Understanding these issues of coverage has been a key focus for this pilot and three distinct approaches have been identified for trying to better understand and measure these differences:

5.3.1 Micro-level comparisons:

This involves linking OJV data to either the reporting units of the JVS or the Business Register. The UK have linked JVS data to vacancy counts by company for a range of OJV sources. This has revealed that the pattern of job counts between the JVS and other sources by reporting unit/company is typically very messy and difficult to understand. Slovenia have also undertaken this kind of microdata analysis as part of the production of experimental estimates.

For more details see: Section 5.5 ; Annex H, Section 3.1 (UK) ; Annex F, Section 4. (Slovenia)

5.3.2 Aggregate comparisons:

This approach involves comparing JVS aggregates by industry sector (i.e. NACE) with the equivalent taxonomies from job portals. This approach has been used by Germany, who do not have access to JVS micro data. While this is quite a straightforward approach, private portals usually have their own taxonomies, which are often only approximately comparable with NACE.

For more details see: Annex C, Sections 2 and 3 (Germany) and Annex G, Section 4. (Sweden)

5.3.3 Measuring use of advertising channels via the JVS:

A third approach has involved surveying enterprises and asking specific questions about advertising channels. Surveys of this type have been carried out by both Germany and Slovenia. The German study found that large companies were more likely to advertise on-line, with small companies more likely to use other channels.

For more details see: SGA-1 Final Technical Report[footnoteRef:23], p.34 (Germany) and p.58 (Slovenia) [23: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/2/20/Deliverable_1_3_main_report_final_1.0.pdf]

5.4 Matching and linking:

Several country pilots have explored the matching of on-line job ads with their own JVS micro data or Business Register data as a means of understand coverage issues. The results have been somewhat mixed. Linking between at the enterprise level between the Swedish Employment Agency (PB) and the Business Register is straightforward since they both use a common enterprise identifier. However, there is no such identifier for local units and so linking data at this level requires probabilistic matching using variables including enterprise name and location. This approach in produces lower quality results.

In the UK, enterprise level matching is done on solely company name only, which has proven very problematic. Common problems include use of abbreviated names, trading names rather than the legal enterprise name, and misalignment between the company names and the JVS reporting unit. In Germany, there were difficulties in obtaining JVS microdata, since this is administered by another agency. In the Slovenian study, there are several enterprise matching steps that are made as part of the final deduplication process. This involves matching jobs from the national employment agency, deduplicated jobs from job portals, and additional jobs found on enterprise websites.

A major issue in trying match JVS reporting units to company names in job portals is that many jobs are advertised through private employment agencies and the employer is not usually identified in the advertisement. In some cases, there may be clues in the job ad about the type of business, or its location. Also, if matching counts between the JVS and direct employer counts from the on-line sources, then any shortfalls in the on-line data may provide further clues as to which employers and what type of jobs are being advertised through employment agencies. Work undertaken by Slovenia reveals that slightly less than 10 per cent of job offers are made through employment agencies. However, this proportion may well be higher in other countries. In 2015, the UK had 14,280 enterprise units classified as activities of employment placement agencies[footnoteRef:24]. [24: https://www.ons.gov.uk/businessindustryandtrade/business/activitysizeandlocation/datasets/businessdemographyreferencetable]

5.5 Time series analysis:

Despite the many issues with OJV data one promising approach is to exploit the time series relationships with the corresponding JVS estimates. A recent study using ten years of OJV data from a data aggregator in the Netherlands identified a clear relationship between macro-level trends in OJV data in the Netherlands and the official job vacancy estimates (de Pedraza et al, 2017). Although these relationships are not strong enough to suggest that OJV data to replace the survey, this does suggest that OJV data could be incorporated into a some kind of modelling approach.

Sweden have a continuous source of on-line vacancy data from their NES, Platsbaken (PB), going back to 2007. While, the overall levels of job openings are much higher than the Swedish JVS, a standardized view of the data shows that the data has similar time series properties, including a well-defined seasonal pattern (Figure 2).

Figure 2: Job openings from Swedish National Employment Agency (PB) and JVS (2007-2017)

Disaggregation of these series into public and private sector jobs shows greater trend correspondence for the public sector (especially in recent years) with a correspondingly weaker pattern for the private sector. Further details are available in the Annex G, Section 4.

The UK have explored the idea of using the time series properties of multiple OJV sources using a machine learning approach to nowcast job vacancy counts at the level of the individual enterprise or reporting unit. The advantage of this highly disaggregated approach is that it removes the problem sampling variance and uses the actual reported survey values as the training data. This uses a Long-Short Term Memory (LSTM) neural network algorithm using the JVS as training data[footnoteRef:25]. The basic idea is that different OJV sources may work better for different companies and that the model will choose the specific the specific time series the works best. An example is shown in Figure 3. [25: The UK JVS runs monthly which obviously provides more data points than the quarterly time series available in most other Member states.]

Figure 3: Example of a LSTM nowcasts for a specific company

The performance gave a modest improvement over a base-line persistence model (i.e. the predicted value based on the previous value) with the model being able to predict the correct direction of the trend in about 70% of cases. The limitations with this kind of approach is the short available time series for most of the OJV data an also it can only be applied to about 25% of survey units that are always in the sample. However, this kind of approach may be a plausible solution to the problem of a highly dynamic data environment where different sources may be growing or declining in importance. Further details are available in Annex H, Section 4.

6. Statistical Outputs

Most of the results of this work package are intermediate analysis and cannot be classified as statistical outputs, or even experimental statistics. However, there are a few results of the latter as well as some other examples of types of analysis that may be possible with this kind of data.

6.1. Estimates of on-line job vacancies:

Slovenia have come the closest to developing an approach for producing experimental statistics based on OJV data (Table 3). These figures are the fully deduplicated detected job ads from the National Employment Agency, the two largest job portals and enterprise websites. The detected job ads for the quarter include data from before the reference period where a distribution has been applied to adjust for unfilled jobs after the jobs have closed (See Section 4.4). A comparison with the official job vacancy estimates show that only about 40% of all Slovenian job vacancies can be found on-line.

Despite the large difference with the official estimate, the greater frequency of OJV data offers the possibility of variations to these statistics. These include job ads that are available (i.e. advertised) both during the reference month and reference day, as well as new ads for both the reference month and the reference day. These are also shown in Table 3.

Table 3: Experimental on-line job vacancy statistics for Slovenia

Estimates

Type

28 August 2017 (Q3)

30 November 2017 (Q4)

Detected job ads for quarter

Stock

6849

6327

Official JVS estimate

Stock

17221

15243

Available in reference month

Stock

3542

4493

Available on reference day

Stock

1368

1285

Newly available on reference month

Flow

1984

2115

Newly available on reference day

Flow

123

76

6.2 Indicators or nowcasts of labour market activity based on OJV data

Given, the complexities of duplication and of producing figures that are directly comparable with official JVS estimates, an approach could be to produce a simple indicator or index. Where there are number of different OJV sources available one very simple approach could be to produce an average. Figure 4a shows eleven UK job portals averaged by month, Figure 4b shows a daily average of the eleven job portals. In both cases these series are scaled to the JVS. These show that the OJV sources can detect a similar seasonal pattern that is captures in the survey.

Figure 4a: Time series of the total JV counts, averaged per month (scaled to the JVS scale).

Figure 4b: Time series of the JVS and daily average of the online sources (scaled to the JVS)

The UK tried compiling the LSTM-based nowcasted estimates by company to produce an aggregate nowcast indicator (which was also scaled to the JVS). Given the models weak predictive power and the small sub-sample (100 companies) it is perhaps not surprising that this aggregate nowcast indicator did not produce very good results. However, very late in the study, a different nowcasting approach was explored based on the S-ARIMA-X time series model. The green dotted line represents the nowcasts based only on JVS, while the red dotted line shows the (more precise) nowcasts that include the aggregate Burning Glass data as an exogenous variable. The shaded areas of respective colours indicate respective 95% confidence intervals. The inclusion of time varying coefficients made a noticeable improvement to the model and so this seems a promising area worthy of further exploration. H

This is discussed further in Annex H (UK), Section 5.4

Figure 5: Nowcasts based on the S-ARIMA-X time series model.

6.3 Geographic indicators:

Figure 6. shows an example of the type of geographic indicators that could be possible with OJV data. Two different set of indicators are shown based on the two UK data sources for which location information is available (i.e. Burning Glass and Adzuna). These indicators represent the number of online vacancies as a proportion of the working age population in each local authority. Both sets of data show similar patterns with higher rates in Central and South England and generally lower rates in Scotland, Wales, Northern Ireland and peripheral areas of England. However, there are also some differences between the two sources, which illustrates the problems of relying on any one source of OJV data. There is then also the problem that these data may be giving a distorted picture of how local labour markets are performing. For example, in predominantly rural local authorities, employers may rely less on on-line channels. For this reason,

This is discussed further in Annex H (UK), Section 5.4

Figure 6: Number of job vacancies as a proportion of working age population[footnoteRef:26] [26: Data for London has been removed due to the distortive effect on the scales]

6.4 Concluding Remarks

The results from Slovenia raise some fundamental questions about the OJV data and whether and how it should be used for official statistics. While this has shown that it is feasible to produce estimates of on-line job vacancies, there is a big difference between these and the official job vacancy estimates. It is therefore clear that these could never replace the official estimates. Further, the benefits of these estimates for policy making are not clear as they only give a partial (and not easily defined) view of overall labour market demand.

It therefore seems that the role of OJV data within official statistics is more likely to be as the basis for producing supplementary indicators. These could include indicators of local labour market demand (as shown above) and/or indicators of occupation groups and associated skills. However, rather than measuring absolute levels, these would be more useful for measuring change over time. Using OJV data for nowcasting purposes is another promising application. We also cannot yet rule out the possibility of using OJV data in conjunction with the JVS to reduce the frequency of the survey or reduce the size of survey samples and thereby reduce sampling costs. Another possibility could be using these data for imputing non-response in the JVS. However, it is important to recognise the considerable differences between different countries in terms of the OJV landscape. Thus, what is feasible in one country may not necessarily be reproducible in others.

7. Future Perspectives

The ESS Big Data Steering committee has agreed that on-line job vacancies will be one of four implementation work packages as part of the second Big Data ESSNet starting in early 2019. Therefore, this will be the organisational framework for taking this work forward within the ESS over the next few years.

It is expected that the current CEDEFOP web-scraping project will form a common supporting infrastructure to support the adoption of OJV data or the ESS during the second ESSNet and beyond. This means that NSIs may be able to reduce their activities around data access and data handling and focus more on the challenges around further methodological development. The ESSNet will need to continue to work with CEDEFOP to ensure that the data meets the needs of official statistics as far as possible.

However, one important limiting factor is that this system will only hold data from 2018 and so it will take several years at least to collect a reasonable time series. This will constrain what could be delivered within the timescales of the next ESSNet. This coupled with the various challenges in using OJV data for official statistics, means that we need to be realistic about what is achievable in second Big Data ESSNet. In addition, there is a need to clarify what is meant by implementation, what could reasonably satisfy this expectation within the stated timescales.

One final future perspective is to consider how technology and recruitment trends may impact on OJV data. For example, there are now websites which allow business to create video job ads that can be targeted to individuals based on their browsing history. If this were to start displacing traditional job portals, then this would raise a whole new of technical and methodological challenges.

References:

Carnevale A., Jayasundera T., Repnikov D., 2014, Understanding on-line job ads data: A Technical Report, Georgetown University; Available at: https://cew.georgetown.edu/wpcontent/uploads/2014/11/OCLM.Tech_.Web_.pdf (Accessed 25 June 2017:

Djumalieva, J., Lima, A. , Sleeman, C., 2018, Classifying occupations according to their skill requirements in job advertisements, NESTA, Available at: https://www.nesta.org.uk/sites/default/files/classifying_occupations_according_to_their_skill_requirements_in_job_advertisements_28-03-2018.pdf (Accessed 09 April 2018)

De Pedraza, P., Visintin S., Tijdens, k., Kismihk G, 2017, Survey vs scraped data: Comparing time series properties of web and survey Institute for Advanced Labour Studies, Working Paper Series, Available at: https://aias.s3-eu-central-1.amazonaws.com/website/uploads/1499760002407WP-175---de-Pedraza,-Visintin,-Tijdens,-Kismih%C3%B3k.pdf (Accesses 09 April 2018)

Stateva, G., ten Bosch, O., Maslankowski, J., Righi, A., Scannapieco, M., Greenaway, M., Swier, N., Jansson, I., Wu, D., 2016, Legal Aspect to Web Scraping of Enterprise Websites, Eurostat; Available at: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/a/a0/WP2_Deliverable_2_1_15_02_2017.pdf (Accessed 29 June 2017)

Ketner, A. and Vogler-Ludwig, K., 2010, The German Job Vacancy Survey: An Overview in 1st and 2nd International Workshops on Methodologies for Job Vacancy Statistics, Proceedings, Eurostat; Available at: http://ec.europa.eu/eurostat/documents/3888793/5847769/KS-RA-10-027-EN.PDF/87d9c80c-f774-4659-87b4-ca76fcd5884d (Accessed 24 Oct 2016)

Krner, T., Rengers, M., Swier, N., Metcalfe, E., Jannson, I., Wu, D., Nikic, B., Pierrakou, C., 2016, Inventory and qualitative assessment of job portals, Eurostat; Available at: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/File:Deliverable_1_1_draft_v5.docx (Accessed: 1 November 2016)

Swier, N., Metcalfe E., Jansson, I., Wu, D., Nikic, B., Pierrakou, C., Krner, T., Rengers, M., 2016, Interim Technical Report (SGA-1), Eurostat; Available at: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/64/WP1_Deliverable_1_2_final.pdf (Accessed: 31 July 2017)

Page 36 of 37