measuring activity in big data: new estimates of big data ......2 1. introduction to date, much of...

28
Measuring activity in big data: new estimates of big data employment in the UK market sector Omar Chebli, Peter Goodridge, Jonathan Haskel Discussion Paper 2015/04 July 2015

Upload: others

Post on 13-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

Measuring activity in big data: new estimates of big data employment in the

UK market sector

Omar Chebli, Peter Goodridge, Jonathan Haskel

Discussion Paper 2015/04

July 2015

Page 2: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

1

Measuring activity in Big Data: New estimates of Big Data

employment in the UK market sector*

Omar Chebli

Imperial College Business School

Peter Goodridge

Imperial College Business School

Jonathan Haskel

Imperial College Business School; CEPR and IZA

May 2015

Abstract

Statements around the growth in data and associated analytical activity are widespread but metrics are

rare. In the UK, exceptions to this include estimates of employment in the field of big data. We

document those studies and produce our own estimates using a new and novel dataset. We find that in

2010, estimated ‘big data employment’ in the UK market sector was 190,000. We show how this

estimate relates to official measures of employment in other knowledge creation activities, such as

own-account (in-house) production of software and also business performance of R&D.

*Contacts: Peter Goodridge, Jonathan Haskel, Omar Chebli, Imperial College Business School, Imperial

College, London. SW7 2AZ. [email protected] [email protected], [email protected]. We are very

grateful for financial support for this research from EPSRC (EP/K039504/1 and EP/I038837/1). We also thank

e-skills UK, TechUK and industry participants at a TechUK forum for helpful discussions. This work contains

statistical data from ONS which is Crown copyright and reproduced with the permission of the controller of

HMSO and Queen's Printer for Scotland. The use of these data does not imply the endorsement of the data

owner or the UK Data Service at the UK Data Archive in relation to the interpretation or analysis of the data.

This work uses research datasets which may not exactly reproduce National Statistics aggregates. This work

uses research datasets which may not exactly reproduce National Statistics aggregates. All errors and opinions

are of course our own.

Page 3: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

2

1. Introduction

To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

volume, or growth in volume, of data available to firms, and how it is being, or could be, put to use.

On volume, Google’s Eric Schmidt is commonly quoted as stating that as much data/information is

being created every two days as was created from the dawn of civilisation to 2003 (Wong 2012).

However, aside from broad statements, in the UK at least, few hard metrics are available on the scale

or volume of big data activity.

Exceptions to this include work that has sought to produce labour market statistics on big data. In a

series of reports, e-skills UK (2013a; 2013b; 2014), the Information Technology Sector Skills Council

for the UK, document estimates of big data employment in UK firms, associated salaries, as well as

current and future estimates of demand for big data staff (vacancies). In other work, Mandel has

sought to estimate big data employment in both the US (2012; 2013), and also in work in conjunction

with NESTA, for the UK (Mandel and Scherer 2014).

There is however little consensus on numbers. E-skills UK estimate UK big data employment of

31,000 in 2013, whilst Mandel estimates 294,000 in 2014. To give some sense of scale, according to

the Business Expenditure on Research and Development (BERD) survey,1 in 2013 UK firms

employed 178,000 workers engaged in R&D and, according to the Annual Survey of Hours and

Earnings (ASHE), in 2010, 749,000 workers engaged in the writing of software.2 In light of this, in

this paper we produce our own estimate of big data employment for the UK market sector using a new

data source, namely the publically available profiles of workers registered on an employment-based

social media network. We show how that estimate relates to the ONS measurement of employment in

occupations that produce own-account software(see Chamberlin, Clayton et al. (2007)). This is a

natural place to focus since, although it is mathematics and statistics that are the foundations of data

analytics, both data-building and data analytics require software programming skills so that

employment in these activities is very much related. In future work we will show how to relate our

estimate of employment to standard national accounting procedures for measuring investment in

intangible assets. This paper is therefore a first step to documenting the contribution that data and

data-based assets are making to UK growth.

The plan of the rest of this paper is as follows. Section two sets out an informal model of the activity

we are seeking to measure. Section three presents measures of big data employment documented in

1 Data available at http://www.ons.gov.uk/ons/rel/rdit1/bus-ent-res-and-dev/2013/index.html

2 Authors own estimates constructed from ASHE microdata held at the UK Data Archive (Office for National

Statistics).

Page 4: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

3

other studies. Section four presents our new estimate of big data employment in the UK market sector

and compares it with estimates for other knowledge based employment such as that in software and

R&D. Finally section five concludes.

2. An informal model of big data activity

Before setting out various measures of big data employment, it is first worth describing exactly what

activity we are seeking to measure. Figure 1 presents a simplified exposition of the big data process,

shown in three stages. Note that although represented linearly, various feedbacks likely exist between

stages. Also note that the three stages can either exist in-house, that is within the same vertically

integrated firm, or within distinct specialist firms.3 Employment estimates that follow are designed to

incorporate both these types of activity i.e. outsourced and in-house.

2.1.1. Data-Building (Transformation) (D)

Starting at the top of the diagram, we first consider the data-building or transformation (D) process,

which transforms raw records into data/information of a format ready for analysis. Raw records are

raw data of any source that require transformation into an analytical format. Data building may

involve digitising, structuring, formatting, and/or cleaning data. This process is sometimes referred to

as “data management”, “data acquisition” or “data warehousing”. The literature on data warehousing

and data analytics commonly describes this as the ETL process, an acronym for ‘Extract, Transform,

Load’. ‘Extract’ refers to the extraction of raw records; ‘Transform’ to the transformation of raw

records into data, often of improved quality, of a format ready for analysis; and ‘Load’ to the loading

of the data into the database or data warehouse. The linking, matching and aggregation of datasets

may take place in this stage, or later in the knowledge creation stage.

3 Currently it is expected that the three stages predominantly exist in-house. However, as the field develops, it is

likely that more companies will specialise at different points in the chain/process (i.e. provision of raw records,

producers of information, producers of data-based knowledge, etc.). As an example, Google are a case where

all three stages exist in-house. As a by-product of providing search services, Google automatically generate raw

records on the search histories of users. They then employ labour and capital to manage, clean and transform

those data into an analytical format, producing information. Google then use that transformed data (i.e. it rents

from the Google stock of transformed data) to produce commercial knowledge. As a trivial example, this may

be the knowledge that users that search for product X (say, flights) also consume product Z (say, hotel

accommodation). In the downstream, Google sell advertising services to other firms. In doing so Google rents

from its stock of commercial knowledge (including data and algorithms) to sell advertising that can be targeted

at specific consumers e.g. in this example, hotels in a region advertise to those searching for aeroplane flights to

that area. Alternatively, consider a firm such as Experian. They operate in the knowledge creation stage,

buying or acquiring transformed data from numerous sources, and using that information to produce data-based

knowledge which they sell to other firms. The credit scores they sell to banks are just one example of the data-

based knowledge services they provide.

Page 5: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

4

Figure 1: The Big Data Production Chain

Note to figure: Commercialisation is the embodiment of knowledge into the output of goods and services, which

may be sold for profit or made freely available. We therefore use the term commercialisation as our focus is on

the market sector, but note that the framework can also be applied to the non-market sector.

2.1.2. Knowledge creation (N)

The next stage is the knowledge creation process (N), more commonly referred to as “data analytics”.

This stage takes the output of the data-building stage, and uses that data/information to conduct

analysis. That analysis could take a number of forms. It will include activities commonly referred to

in the literature as “data science”, “data/text mining”, “knowledge recovery”, “business intelligence”

and “machine learning”, with the latter referring to the use of artificial intelligence to discover

correlations in data. Whatever the method, the output of the analytics process is a piece of

commercial knowledge formed from the analysis of information, and used to construct advice to be

implemented in the final production of goods and services.

2.1.3. Downstream production of final goods and services (Y)

The final stage incorporates the application of knowledge in the production of final goods and

services, in the downstream production (or operations) sector (Y). We emphasise that the

Page 6: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

5

downstream is a pure operations stage, that does not conduct any activity in the creation of

information or data-based knowledge but rather just employs/rents labour and (tangible and

intangible) capital, including data-based knowledge, to deliver final goods and services.

We stress that we are seeking to measure employment in the first two stages presented in Figure 1,

that is, employment in the transformation or building of data and in the extraction of data-based

knowledge. In the final stage some workers will be involved in the implementation/use of data-based

knowledge, as well as the implementation of other forms of knowledge such as that from R&D,

market research etc. For instance, the data-based insight implemented in the downstream could be the

knowledge that the cross-promotion of goods results in increased sales, or alternatively the knowledge

to re-optimise downstream processes and improve productivity, derived from data emitted by sensors

embedded in machines (the “Internet of Things”). Similarly data and data-based knowledge may be

used in the generation of other types of knowledge, such as that created in the conduct of R&D or

market research. We are not seeking to measure activity in implementation here, and such “users” of

data-based knowledge are not intended to be included in the employment estimates that follow.

Also note that we do not seek to measure activity or employment in the generation of raw records. A

feature of ‘big data’ is that raw records are typically generated as a by-product of some other process,

for instance where data comes as exhaust data.4 Workers involved in the production of raw records

would therefore include those employees that work at the point where raw records are created,

including workers at the point of sale such as cashiers in supermarkets. We do not attempt to measure

this part of the process.

Rather we are seeking to measure employment in the transformation/building of data (information)

and in the use of that data to extract knowledge, thus including the kinds of occupations that are

receiving more and more attention, such as “data scientists”, “data engineers” and “business

intelligence analysts”. In the data-building stage, we would expect to find occupations that include

“data administrators”, “data managers”, “data engineers” and workers in “data control”. The

knowledge creation stage is more likely to contain workers with job titles that include “data

scientists”, “business intelligence” and “data/statistical analysts”.5 In practice, the roles of some

workers/occupations could include some aspects of both data-building and knowledge creation.

4 Typically unstructured data generated as a by-product of some online or digital process.

5 In a following section we document work by e-skills UK (2013b) which estimates employment in the

following occupations: “data engineers”, “data administrators”, “data analysts” and “data scientists”.

Page 7: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

6

2.2. Big and Little data It is worth making one other definitional point. It may have been noticed that we have not attempted

to formally define “big data”. Commonly used definitions of big data typically refer to the “3 V’s”,

that is the large volume, variety and velocity of data that is being created, largely as a result of the

spread of the digital economy. But in this paper, and future work, we are primarily concerned with

measurement of activity in data and data analytics that generates knowledge to be used in final

production. The volume, source, variety and type of data employed, or the speed with which it is

generated, is less of a concern. It therefore does not seem helpful to introduce a distinction between

“big” and “little” data, after all, each are based on the same foundations, that is mathematics,

statistics, computer science etc.

Further, data and data analytics have been making contributions to final production long before the

term “big data” became so widespread, even if some of the techniques, tools, technologies and

approaches are new. For example, the major supermarket chains have been collecting data on their

customers purchasing patterns and preferences for some time. That activity has just been made easier

and richer with the new types of data that are becoming available and to which they can link to.

Similarly insurance companies, who seek to create risk profiles of actual or potential customers, and

banks who use credit scores to assess customer applications for their products. We therefore see the

emergence of the field of big data analytics as growth in an activity that has long existed. The 3 V’s

mean that many more raw records are available, and that much more information can be created,

facilitating growth in the knowledge creation sector. Therefore in our measurement, we will not seek

to specifically exclude types of data and data analytics activity that do not meet particular strict

definitions of big data in terms of data type or the size of datasets, although we continue to refer to

“big data” for reasons of simplicity/shorthand.

2.3. Initial estimates

This framework suggests we can make an initial guess at the scale of UK big data employment. First,

we know from our discussions with industry and from the empirical work that follows that there is

some overlap with software. From the ASHE, we know that in 2010 there were around 289,000

workers in the UK market sector recorded under the occupation of “software professionals”. More

broadly, there were around 749,000 workers in IT occupations that the ONS consider are involved in

the writing of software (Office for National Statistics). Some proportion of these workers will be

engaged in the building of data and data-based assets. Second, there is also a potential overlap with

another category of knowledge workers, namely those in R&D. According to BERD, in 2010 there

were 154,000 workers engaged in R&D in UK firms, with 22,000 of those engaged in R&D in the

product field “Computer programming and information service activities”. Some proportion of these

workers may also be considered to be working in the sphere of big data.

Page 8: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

7

3. Big data employment

How are we to measure big data employment? Looking at the diagram in Figure 1, were the stages in

this chain served by separate industries, then we could look at official industry data. The problems

with this approach are that, first, to the extent that these activities are provided by specialist industries,

the Standard Industrial Classification (SIC) is not currently detailed enough to separately identify

these firms. Second, much of the activity detailed in Figure 1 actually takes place in-house in

industries not classified as ‘data industries’, be that manufacturing, retail etc. Therefore we go to the

data on occupations. The obvious sources for data on employment by occupation are the Labour

Force Survey (LFS) or Annual Survey of Hours and Earnings (ASHE), categorised according to the

Standard Occupational Classification (SOC). However, inspection of the SOC shows that official

occupational classifications have also not kept pace with the new occupations emerging in and around

data and data analytics. This may not be surprising, with many of the job titles associated with this

field having emerged relatively recently. Whilst it is possible to identify the codes where workers in

data-building and data analytics are likely allocated, the codes are not exclusive so such workers are

mixed in with other occupations in rather broad groups. Therefore in our work and other studies,

some other source must be used instead.

Survey data: e-skills UK 3.1.1.

Few studies have produced metrics of the resources devoted to data-building or data analytics.

Exceptions to this include a series of reports by e-skills UK (2013a; 2013b; 2014) which document

UK employment in big data activity. In conjunction with SAS, e-skills UK ran a survey of larger

market organisations asking firms about their adoption/use of data analytics, and questions on the

number of “big data staff” employed. They found that in 2013, 14% of firms with more than 100

employees had adopted big data analytics, and that 31,000 employees work in big data positions, with

32% (10,000) in IT-focused roles, 55% (17,000) in data-focused roles and 13% (4,000) in other roles.

E-skills estimates of big data employment are presented below in Table 1 .6

We note the following from Table 1. Of the data-focused roles, from our discussions with e-skills

UK, we consider the 3,000 Data Engineers and 1,000 Data Administrators to be likely employed in

the data-building (D) stage; and the 8,000 Data Analysts and 1,000 Data Scientists to be likely

employed in the knowledge creation (N) stage. With another 4,000 in undefined “other data-focused”

roles, these estimates suggest employment of 17,000 in the two stages combined, with 4-8,000 in

data-building/transformation, and 9-13,000 in knowledge creation. Alternatively we could use a

6 E-skills UK also conducted a survey of smaller organisations in conjunction with Experian. Of the 541 SMEs

they contacted, they concluded none had implemented big data analytics, suggesting the proportion of SMEs in

the UK population that had implemented is less than 0.2%.

Page 9: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

8

broader definition that incorporates supporting IT-focused staff, implying big data employment of

31,000 in UK firms in 2013.

Table 1: Big data employment, 2013 (e-skills UK 2013b)

Big data employment: 2012* 2013 2014* 2015*

20,000 31,000 39,000 47,000

IT-focused roles: 10,000

Strategy/planning/design 2,000

Development/Implementation 3,000

Administration/operations 1,000

Support 1,000

Other IT-focused 2,000

Data-focused roles: 17,000

Data Engineers 3,000

Data Administrators 1,000

Data Analysts 8,000

Data Scientists 1,000

Other data-focused 4,000

Other roles: 4,000 Source: Table 1 and Figure 9 in e-skills UK (2013b)

Note to table: Survey-based employment numbers for 2013. Numbers for 2012, and 2014-15 are

estimates/forecasts from e-skills UK/Experian (Figure 9 in e-skills UK (2013b)). 2013 survey was of firms, so

estimates relate to the UK market sector. . IT-focused roles defined as “enabling roles focused on the design,

development, implementation, administration, maintenance and support of big data related systems and

applications”. Data-focused roles defined as “analytical roles focused on identifying, acquiring, managing,

manipulating, analysing, understanding, utilising and presenting big data and related inferences/propositions”.

3.2.NESTA: Mandel and Scherer (2014)

In work in conjunction with NESTA, Mandel and Scherer (2014) produce alternative estimates of UK

big data employment. They too note that the Standard Occupational Classification has not kept pace

with new and changing occupations in growing, innovative fields such as big data. Therefore they

turn to using data on the number of jobs advertised on the job aggregator website Indeed.co.uk as a

means of measuring employment. They therefore assume that the number of job ads proxies the

number of gross hires, and that the number of gross hires has a strong correlation with employment,

and present evidence to support those assumptions. We note that job ads may have referred to jobs in

both the public and private sectors, so that final estimates reflect the UK whole economy rather than

just the market sector.

Specifically, the job descriptions and skill content contained within job ads are searched using a list of

14 keywords or phrases that include program names such as “Hadoop”, “MapReduce” and “Python”,

or job titles such as “data scientist”, “data engineer” or “data analyst”. In April 2014 such a search

returned 18,720 big data job advertisements. Then, in order to transform that estimate to an

employment number, it is multiplied by a “job/want ad multiplier” (of 15.7), based on the ratio of jobs

Page 10: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

9

to ads in general IT occupations, derived using official employment data based on the SOC. This

calculation translates to an estimate for UK (whole economy) big data employment of 294,000 in

2014, as summarised below in Table 2.7

Table 2: UK big data employment (Mandel and Scherer 2014)

Big data job ads Job-want ad multiplier Big data employment

2014 (April) 18,720 15.7 294,000

Note to table: Data from Mandel and Scherer (2014). Snapshot for April 2014. Column 1 is number of big data

job ads identified from Indeed.co.uk. Column 2 is the ratio of jobs to job advertisements for general IT

occupations. Column 3 is estimated UK big data employment, calculated as column 1 times column 2.

We note that this estimate is far larger than that produced by e-skills UK, which ranged from 17,000

to 31,000 depending on the definition used (e-skills estimates also referred to 2013 rather than 2014,

but project a figure of 39,000 for 2014 using their broad definition), although we do note that the

focus of this study is on the geographical distribution of innovative employment activity, rather than

absolute numbers in employment.

We consider there to be three predominant reasons for the large divergence between the estimates

produced by e-skills UK and Mandel/Scherer. First, as noted above, e-skills UK ran a survey of firms

so their estimates refer to the UK market sector, whereas results in Mandel/Scherer are based on an

aggregation of all job vacancies and so refer to the whole economy.

Second, is the heavy dependence of the Mandel/Scherer result on the “job/want ad multiplier” of 15.7.

The multiplier used is based on the ratio of jobs to vacancies in general IT occupations. However, the

big data arena is one that is relatively new, and so we may not expect the ratio of jobs:vacancies to be

as high as in general IT. Further, from a survey of 45 data-focused companies, Bakhshi, Mateos-

Garcia et al. (2014) report that 80% of firms are struggling to hire the skilled labour they require,

stating that the supply of data skills is insufficient for current (let alone future) demand. In their

labour market assessments, e-skills UK (2013a; 2013b; 2014) similarly report that firms are struggling

to fill vacancies in this area, giving us extra reason to suspect that the appropriate multiplier for big

data is lower. In fact, based on the (narrowly-defined) employment estimates constructed by e-skills,

the true multiplier may actually be in the order of one (17,000 in employment compared to the 18,720

job ads identified by Mandel/Scherer). Alternatively, using the broader e-skills definition, e-skills UK

(2013b) reports a vacancy estimate of 3,790 in 2012, compared to employment of 20,000 in the same

year, which would suggest a multiplier of around 5, rather than the 15.7 in Mandel/Scherer.

7 For information, according to ONS data, in April-June 2014 the jobs to vacancies ratio for all vacancies was

41.67.

Page 11: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

10

Third, e-skills UK, when contacting firms, restricted the definition of big data workers, whereas

estimates from Mandel/Scherer potentially include forms of data/analytics or business intelligence

activity that do not meet the stricter definition employed by e-skills UK.

To summarise, estimates of UK big data employment taken from work by e-skills UK and NESTA lie

in the rather large range of 17,000 to 294,000. In order to validate our final estimate, we must

therefore turn to some other source of information, which we do in the next section.

4. New estimates of big data employment: social media data

In this section we present our own estimates of big data employment derived from a novel dataset

built from the publically available profiles of members of an employment based social media network.

Before describing our method and results, it is first worth setting out some detail on estimates of

employment in a related investment activity in the national accounts, namely software or

‘computerised information’.

4.1.Employment in related occupations: “computerised information” The System of National Accounts (SNA) (United Nations 2008) recommends the capitalisation of

expenditures on ‘computerised information’, comprised of software and databases, both purchased

and own-account (in-house). This means that statistical authorities gather data on employment in

software-writing occupations, in order to estimate in-house investments in creating software. From

Chamberlin, Clayton et al. (2007), the list of occupations used by the ONS is presented below in

Table 3: columns 1 and 2 of present the seven occupational codes used in measurement (based on

SOC 2000), column 3 shows the approximate mapping to SOC 2010, column 4 provides typical

responsibilities and column 5 lists job titles related to each code. Some of the job titles considered

most relevant to data-building and data-based knowledge creation are highlighted in red. In column 6

we conjecture at which stage in the big data production chain these occupations are likely engaged.

From reading the associated responsibilities and related job titles in columns 4 and 5 it is clear that

workers allocated to these codes will include workers involved in the upstream stages shown in Figure

1,8 in particular software professionals which includes job titles such as “analyst-programmer”,

“systems analyst” and “data communications analyst” and whom we would expect to find in the

knowledge creation stage of our framework. Other workers with job titles such as “data processing

manager”, “data entry clerk” and “data processor” are more likely involved in the data-building stage.

8 The methodology for estimating investment in computerised information is based on a past vintage of the SOC

(2000). However, inspection of the latest revision to the SOC (2010) shows that the occupational coding is still

not sufficiently granular to separately identify the workers we are seeking to measure.

Page 12: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

11

In practice some of these workers are likely involved in a mix of data-building and data analytics

activity.

Table 3: Occupations used in estimation of UK investment in OACI (own-account computerised information)

SOC

(2000) Occupation Where in SOC (2010)? Responsibilities Related job titles included (SOC00 and SOC10):

Stage in Big Data

Production Chain

1136: Information technology and

telecommunications directors

2133: IT Specialist Managers

2131IT strategy and

planning professionals

2134: IT project and programme

managers

Providing advice on the effective

utilisation of information technology

in order to solve business problems

or to enhance the effectiveness of

business functions.

computer consultant, software consultant, IT

consultant, implementation manager (computing), IT

project manager, programme manager (computing),

project leader (software design)D/N

2135: IT Business Analysts,

Architects and Systems Designers

2136: Programmers and software

development professionals

2137: Web design & development

professionals

3131: IT operations technicians

3132IT user support

technicians3132: IT user support technicians

Providing technical support, advice

and guidance for customers or IT

users within an organisation, either

directly or by telephone, e-mail or

other network interaction.

helpdesk operator, helpline operator (computing), IT

helpline support officer, support technician

(computing), systems support officer D/N

4136Database

assistants/clerks4131: Records clerks and assistants

Creating, maintaining, preserving

and updating information held in

electronic databases, computer files,

voice mailboxes and e-mail systems.

computer clerk, data entry clerk, data processor, VDU

operator.

D

5245

Computer

engineers,installation

and maintenance

5245: IT engineers

Installing, maintaining and repairing

personal computers, mainframe and

other computer hardware.

computer engineer, computer maintenance manager,

computer service engineer, computer service

technician, computer repairer, hardware engineer

(computer), maintenance engineer (computer servicing)D/N

3131IT operations

technicians

The day-to-day running of computer

systems and networks, including the

preparation of back-up systems, and

performing regular checks to ensure

the smooth functioning of such

systems.

computer operator, database manager, IT technician,

network technician, systems administrator, web

master, database administrator

1136

Information and

communication

technology managers

2132 Software professionals

computer manager, computer operations manager,

data processing manager, IT manager, systems

manager, telecom manager, IT director, technical

director (computer services), telecommunications

director, data centre manager, IT support manager,

network operations manager (computer services),

service delivery manager

analyst-programmer, computer programmer, software

engineer, systems analyst, systems designer, business

analyst (computing), data communications analyst,

database developer, games programmer

Planning, organising and directing

work necessary to operate and

provide ICT services, maintaining

and developing associated network

facilities and providing software and

hardware support.

All aspects of the design application

and development and operation of

software systems.

D

N

D

Source: Table 1 and Table 6 of Chamberlin, Clayton et al. (2007) modified with mapping to SOC 2010 and to

the big data production chain..

Notes to table: Column 1 is the official occupational code used to identify workers that produce assets in

computerised information, and column 2 the occupational title for that code. Since the methodology is based on

SOC 2000, column 3 maps to the latest revision of the SOC (2010). Column 4 lists typical responsibilities in the

role. Column 5 shows other job titles typically used for that occupation, taken from documentation for SOC

2000 and SOC 2010. Job titles most relevant to data-based activity are highlighted in red. Column 6 shows

which stage of the Big Data production chain these workers are likely engaged.

As can be seen from column 2, one of the occupations used in the measurement of own-account

computerised information (OACI) is “database assistants/clerks” (SOC00 4136). Whilst we might

expect such workers to be engaged in the data-building (D) stage of our framework, the detailed job

description and tasks in the SOC documentation are actually a better fit with administrative roles

rather than occupations typically associated with big data and data analytics. Indeed in the latest

revision to the SOC (SOC 2010), this occupational group maps to secretarial/administrative

occupations that are not associated with IT, or components of IT such as software and/or data and data

analytics.

Page 13: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

12

Figure 2 shows how market sector employment in each of the occupations in Table 3 has changed

over time.

Figure 2: UK market sector employment in own-account software occupations, by occupation code

Note to figure: Each line represents the number of people employed in each occupational code in Table 3 in the

UK market sector. Market sector defined as UK economy excluding public administration & defence (O),

education (P) and health (Q). Constructed from ASHE microdata held in the Secure Data Service at the UK

Data Archive (Office for National Statistics)

From Figure 2 the largest employment group among these occupations is “Software professionals”

(2132) which included 290,000 workers in 2011. The next largest group is “Information and

communication technology managers” (1136) with 154,000 workers in the same year, followed by

“IT strategy and planning professionals” (2131) with 103,000 workers, “IT operations technicians”

(3131) at 83,000, “IT user support technicians” (3132) at 55,000, “Database assistants/clerks” (4136)

at 28,000 and “Computer engineers, installation and maintenance” (5245) at 13,000.

It is worth noting the steady decline in the number of “Database assistants/clerks”, further supporting

the idea that this occupational group does not include the types of workers we are searching for. We

also note the decline in the number of “IT operations technicians” (3131), which from Table 3

includes “database manager” as a related job title.

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

199719981999200020012002200320042005200620072008200920102011

1136: Information andcommunication technology managers

2131: IT strategy and planningprofessionals

2132: Software professionals

3131: IT operations technicians

3132: IT user support technicians

4136: Database assistants/clerks

5245: Computer engineers,installation and maintenance

Page 14: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

13

Much of the growth in employment in OACI occupations is driven by growth in the number of

“Software professionals” (2132). As outlined above, we suspect that some proportion of these

workers work in the D and N stages of the data supply chain, either in-house or in specialist firms.

However, they do not exclusively include D and N workers, and further, D and N workers alsoreside

in occupational codes outside this list, which we explore more below.

4.2.Other D and N workers in the Standard Occupational Classification (SOC)

As well as the occupations used by the ONS in measuring investment in OACI, inspection of the SOC

reveals additional occupations that will include workers involved in D and N stage activity. Table 4 is

laid out in the same format as Table 3 above, and highlighted in column 5 are job titles considered

most relevant to the activity we are seeking to measure.

Other occupational codes that may include workers in data-building and/or data-based knowledge

creation include: research professionals (232),9 management consultants, actuaries, economists and

statisticians (2423); and business and related associate professionals n.e.c. (3539). Other occupational

codes that we speculate may include big data workers include research and development managers

(1137) and science professionals (211). However, even if these occupations include those working on

big data, we still don’t know what fraction are working on big data and we cannot know that until the

official occupational codes become narrow enough to enumerate them separately e.g. data scientist.

To take the next step we therefore have to turn to some other data source.

9 Since we are focusing on the market sector, in our estimation we shall exclude workers in the education sector.

Thus researchers will exclude those working in universities but include researchers in market organisations.

Page 15: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

14

Table 4: Occupations outside official software occupations that potentially include workers in data-building and/or

data-based knowledge creation

SOC

(2000) Occupation Where in SOC (2010)? Responsibilities/tasks include: Related job titles included (SOC00 and SOC10):

1137Research and

development managers

2150: Research and development

managers

Plan, organise, coordinate and direct

resources to undertake the

systematic investigation necessary

for the development of new, or to

enhance the performance of existing

products and services.

director of research, laboratory manager, research

manager, creative manager (research and

development), design manager, market research

manager, research manager (broadcasting),

211 Science professionals211: Natural and Social Science

Professionals

Planning, directing and undertaking

research and development,

providing, technical, advisory and

consultancy services in the fields of

chemistry, biological sciences,

physics, geology and meteorology.

analytical chemist, chemist, development chemist,

biomedical scientist, geologist, anthropologist,

archaeologist, criminologist, epidemiologist,

geographer, historian, political scientist, social

scientist, geophysicist, medical physicist,

meteorologist, oceanographer, physicist. seismologist,

forensic scientist, horticulturist, microbiologist,

pathologist, industrial chemist, physical chemist,

research chemist, biochemist, biologist, botanist,

medical laboratory scientific officer, microbiologist,

pathologist, zoologist, geologist, mathematician,

physicist, development chemist, bioinformatician,

research scientist

211: Natural and Social Science

Professionals

2426: Business and related

research professionals

2423: Management consultants

and business analysts

2425: Actuaries, economists and

statisticians

3539 Business and related

associate professionals

n.e.c

3539: Business and related

associate professionals n.e.c.

Studies particular department or

problem area and assesses its

interrelationships with other

activities; - Studies work methods

and procedures by measuring work

involved and computing standard

times for specified activities, and

produces report detailing

suggestions for increasing efficiency

and lowering costs

business systems analyst, data analyst, marine

consultant, planning assistant, project administrator,

project coordinator, conference coordinator, exhibition

officer, management information officer, work study

engineer, work study officer

actuary, business analyst, economist, management

consultant, management services officer, statistician,

business adviser, business consultant, business

continuity manager, financial risk analyst, actuarial

consultant, statistical analyst

Management

consultants, actuaries

economists and

statisticians

2423

Advise industrial, commercial and

other establishments on a variety of

management, personnel, computing

and technical matters, and apply

theoretical principles and practical

techniques to analyse/interpret data

used to assist in formulation of

financial, business and economic

policies.

232 Research professionals

Planning, directing and undertaking

scientific, qualitative and

quantitative research through the

application of theoretical principles

and practical techniques in order to

address a research objective

research assistant, research associate, researcher,

university research fellow, crime analyst (police force),

fellow (research), games researcher (broadcasting),

inventor, postdoctoral researcher

Note to table: Other occupational codes that may include workers in data-building and data-based knowledge

creation. Columns 1 and 2 are occupational groups in SOC 2000 and column 3 maps to occupations from SOC

2010. Column 4 summarises typical responsibilities or tasks. Column 5 shows related job titles.

4.3.Estimating UK big data employment using social media data

If we are to isolate those workers in the big data sphere, (either those currently in computerised

information, or those not so classified e.g. economist/statistician) we need to decide how to allocate

them. We might for example undertake a detailed work-study of the occupations and allocate them in

this fashion. This is prohibitively expensive, so we proceeded using social media data. We gathered

data on UK (market sector) employees in 2010 using a snapshot of publically-available information

Page 16: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

15

on job titles/descriptions and employee skills from an employment-based social media network, with

the dataset constructed in 2011.10

First, we classified employees according to their occupations transforming the job titles that workers

report on the particular platform to the SOC. Second, we then computed the fraction of workers in that

occupation with big data skills: those for example, who can use Hadoop, Python etc. or that report an

application of skills or job description that related to big data (e.g. data/text mining, data visualisation,

predictive analytics etc.). Our method therefore has similarities with that used by Mandel and Scherer

(2014). That is, we construct a list of keywords and search the profiles of members to estimate the

number of workers with skills in the production of (transformed) data and/or data-based knowledge.

Our list of keywords is provided in the Appendix. We believe the list to be relatively comprehensive,

although there will obviously be some terms/words we haven’t included. However, as noted by

Mandel and Scherer (2014), only one matching word is required to extract a relevant profile, meaning

that there are diminishing returns to an ever expanding list of keywords.

Third, we need a method to convert our sample to estimates of the population. We proceed as

follows. We take the share, described above, for each occupation with big data skills, and apply that

share to grossed-up estimates of employment (by SOC) from ASHE (Office for National Statistics).

In particular, we benchmark to the occupations used in the measurement of own-account

computerised information as detailed in Table 3, as well as other occupations where such workers

may reside as detailed in Table 4, and some additional occupations that we found reporting big data

skills in our preliminary analysis of the data.

There are of course a number of issues in this procedure. First, we necessarily assume that workers

with (big) data skills work in (big) data occupations. We note however that those registered with such

networks are very aware of the growing interest in such skills among employers, and so there may be

some bias to our estimates if members have enhanced or exaggerated their skill profile in response, or

if they are simply advertising their skills but not currently working in (big) data related roles.

Second, our dataset is a snapshot of member profiles in 2010, providing data on the job titles, job

descriptions, industry and skills of 43.6m members worldwide. Of those 43.6m, 3.6m are based in the

UK.11

Of those 3.6m UK members, around 2.4m report a job title, and of those, 1.5m work in the UK

market sector.12

Of those market sector members that report a job title, 0.46m report at least one skill

10

Since we are estimating employment in the UK market sector, we exclude workers whose self-reported

industry maps to public administration and defence (O), education (P) or health (Q) in SIC07. 11

ONS Labour Market Statistics, released in June 2014, show that in February to April 2014, UK employment

was 30.54m. The corresponding figure for February to April 2010 was 28.84m. 12

We focus on the UK data but note that the larger worldwide sample shares very similar characteristics.

Page 17: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

16

on their profile. We are therefore working with a sample of employment, but note that sample may

not be representative. In particular we note that workers in data transformation and data analytics, as

well as other professional and technical occupations/industries, are on average more likely to be

represented on such networks, as are younger workers who may also be more likely to possess such

skills.

Third, the number of market sector workers that report a job title but do not list a skill is therefore

large, around 1m or 69%. An obvious concern worth noting is that this may introduce bias into our

analysis if (big) data workers are more or less likely to report a skill than the average member.

Results 4.3.1.

We find that in total, 12,548 UK (market sector) members report either a big data skill, competency or

description (from those listed in Appendix Table A1). The results by occupation are set out in Table

5. The table is split into two panels. Panel 1 presents data for occupations used to estimate own-

account investment in computerised information and panel 2 for other occupations outside that list.

Columns 1, 2 and 3 are the occupational groups taken from SOC 2000 and SOC 2010 respectively.

Column 4 reports the number of members that fall under each occupation. Column 5 reports the

number identified for each occupation that also report (big) data skills, competencies or job

descriptions. Column 6 reports the ratio of column 5 to column 4. Column 7 reports market sector

employment for each occupation from ASHE (Office for National Statistics). Finally, column 8

reports our estimate of big data employment, by occupation, estimated as column 6 times column 7.

Using our list of keywords, we identify 12,548 instances of ‘big data workers’ in the UK market

sector. Summing down column 5 of panels 1 and 2 shows that we allocate 9,942 (79%) of those to

occupations in the SOC, leaving 2,606 (21%) unallocated. Inspection of those 2,606 unallocated job

titles shows that the majority are: undergraduate or postgraduate students (particularly PhD students);

members that report themselves as owners, co-owners or founders; or members that report themselves

as freelance with no additional information. We therefore assume that students are not employed, and

we exclude owners/founders/freelancers as we are benchmarking to ASHE, a survey of employees

which does not include the self-employed.

Page 18: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

17

Table 5: Social media data: Big data employment by SOC

ASHE data (2010) Final estimates

SOC

(2000) Occupation SOC (2010)

(A): Number of

people for each

SOC

(B): Number of

"Big Data

workers" by SOC

(C): Ratio of "Big

Data workers" to

SOC = (B)/(A)

(D): ASHE

Employment (UK

market sector,

2010)

(E): Big Data

Employment

(scaled, UK

market sector) =

(C)*(D)

1136: Information technology and telecommunications directors

2133: IT Specialist Managers

2131 IT strategy and planning professionals 2134: IT project and programme managers 11,936 659 5.5% 99,387 5,487

2135: IT Business Analysts, Architects and Systems Designers

2136: Programmers and software development professionals

2137: Web design & development professionals

3131: IT operations technicians

3132 IT user support technicians 3132: IT user support technicians 2,186 241 11.0% 61,860 6,820

4136 Database assistants/clerks 4131: Records clerks and assistants 248 10 4.0% 30,796 1,242

5245 Computer engineers,installation and maintenance 5245: IT engineers 637 49 7.7% 13,499 1,038

Subtotal - All "software occupations" 60,634 7,669 12.6% 748,769 122,855

1132 Marketing and sales managers 3545: Sales accounts and business development managers 20,192 332 1.6% 514,489 8,459

2472: Public relations professionals

2473: Advertising accounts managers and creative directors

1137 Research and development managers 2150: Research and development managers 1,001 36 3.6% 40,848 1,469

211 Science professionals 211: Natural and Social Science Professionals 845 87 10.3% 51,703 5,323

212 Engineering professionals 212: Engineering professionals 7,412 413 5.6% 396,375 22,086

211: Natural and Social Science Professionals

2426: Business and related research professionals

2423: Management consultants and business analysts

2425: Actuaries, economists and statisticians

342 Design Associate Professionals 342: Design Occupations 498 27 5.4% 59,625 3,233

3539 Business and related associate professionals n.e.c. 3539: Business and related associate professionals n.e.c. 860 82 9.5% 67,338 6,421

Subtotal - Other (non-software) occupations 51,617 2,273 4.4% 1,384,396 67,351

Total - All occupations 112,251 9,942 8.9% 2,133,165 190,206

2,119 27.9%

723 81 11.2%

17,095 1,144 6.7%

71 2.4%2,991

Panel 2:

Other

occupations 232 Research professionals

2423Management consultants, actuaries economists and

statisticians

1134 Advertising and public relations managers

3131 IT operations technicians

Data building and Data analytics: Official occupation groups and job titles related with those occupations

(Panel 1: ONS software occupations; Panel 2: Other occupations) Social media data (2010)

Panel 1:

ONS

software

occupations

3,636 26.2%

1136 Information and communication technology managers

2132 Software professionals

24,143 955 4.0%

13,889

7,595

159,974 6,328

289,823 75,873

93,430 26,067

30,252 718

103,455 11,590

120,311 8,051

Notes to table: Column 1 is the SOC code for the occupations used in the estimation of own-account investment in computerised information as well as some additional

occupations in which D and N workers reside. Column 2 is the occupation title for that code. Column 3 shows the mapping to SOC 2010. Column 4 are the number of

social network members identified for that occupation. Column 5 is a subset of column 4, and is the number of members who report big data skill(s) that fall under that

occupation. Column 6 is the ratio of column 5 to column 4. Column 7 are the number of UK jobs in those occupations in 2010, constructed from ASHE microdata held in the

Secure Data Service at the UK Data Archive (Office for National Statistics). Column 8 is estimated big data employment derived by applying the ratios in column 6 to

ASHE employment in column 7

Page 19: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

18

Looking at estimates by occupation, in the top line, for example, we see that 4% of workers whose

occupation is “Information and communication technology managers” (1136) report having big data

skills. For “IT strategy and planning professionals” (2131) we find the number to be 6%. For

“Software professionals” (2132) it is higher, at 26%, and for “IT operations technicians” (3131) it is

higher still at 28%. The fractions for “IT user support technicians” (3132), “Database

assistants/clerks” (4136) and “Computer engineers” (5245) are 11%, 4% and 8% respectively. The

bulk of such workers therefore seem to be in occupations like “software professionals” and “IT

operations technicians”. Looking at the OACI occupations as a group, we estimate that 12.6% of

workers in these software-related occupations are engaged in data and data analytics activity, which is

equivalent to 122,855 workers when grossed up to the UK market sector population.

Of the second group, which lie outside these IT occupations, we find that 11% of “Research

professionals” (232), 10% of “Science professionals” (211), 10% of “Business and related associate

professionals n.e.c.” (3539), 7% of “Management consultants, actuaries, economists and statisticians”

(2423), 6% of “Engineering professionals” (212), 6% of “Design Associate professionals” (342), 4%

of “Research and development managers” (1137), 2% of “Advertising and public relations managers”

(1134) and 2% of “Marketing and sales managers” (1132) are identified as having big data skills.

Grossed up to the UK market sector population, these estimates imply an additional 67,351 workers

not already counted in the measurement of OACI. Taken together, the results in panels one and two

provide an estimate of UK (market sector) big data employment of 190,000, of which two-thirds are

already counted in the measurement of OACI.

There are two ways to interpret our results. The first is that the professional and technical occupations

we are looking to identify are so well represented on this social media network that we are effectively

capturing the universe, or close to it, of UK D and N workers. Note that the 12,548 workers identified

is relatively close to the 17,000 estimate for data-focused roles collected in the e-skills survey, and the

18,720 job ads identified in Mandel and Scherer (2014). The second is that the identified workers are

only a sample and do not represent the universe of workers with big data skills. We take the second

view and gross up our results to the UK population. Alternatively, we may consider the estimates as

lower and upper bounds, .at respectively 12,548 and 190,206, with the latter lying in between

estimates from e-skills UK (2013b) and Mandel and Scherer (2014). We do note however that those

estimates from Mandel/Scherer are for 2014 and the whole economy, compared to ours for 2010 and

the market sector. Allowing for the non-market sector and some growth in activity between those

dates, the two estimates are fairly consistent.

Page 20: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

19

From Table 5we have that 122,855 out of the 748,769 recorded in official software (and databases)

employment has big data skills, about 16%. In future work we will show that the link between

employment in (big) data and software production has an important implication for measurement. For

now we just note that our results suggest that of the 190,206 identified big data workers, 65% of those

workers are already accounted for in the measurement of employment in OACI. The remaining 35%

are employed in outside occupations.

We therefore find that the majority of data workers are recorded in IT occupations. Similarly, Hawk,

Powers et al. (2015) find that nearly two-thirds of employment in data occupations is in the broad

categories of business/financial occupations and computer/mathematical occupations (34 percent and

31 percent, respectively), including management and market research analysts, software application

developers, computer user support specialists and computer systems analysts.

We note that our objective was to measure employment in the two upstream stages of Figure 1.

However, our second panel includes some occupations of members that, although report big data

skills, may either be more likely involved in the implementation of (big) data-based knowledge in the

downstream, or alternatively in the use of big data insights in upstreams for the production of other

forms of knowledge-based capital such as R&D, branding or design e.g. marketing and sales

managers (1132), advertising and public relations managers (1134) and design associate professionals

(342). Excluding these three occupations results in an estimate of big data employment of 177,796, as

opposed to 190,206 with them included.

Potential overlap with employment in R&D 4.3.2.

As outlined in Figure 1, we are seeking to measure employment in data-based knowledge creation as

well as in the production/transformation of data that feeds into that process. Therefore, as well as

software, there appears a clear link with employment in another knowledge creation activity, namely

R&D. The national accounts definition of R&D is taken direct from the Frascati Manual (OECD

2002) and is defined as comprising of: “creative work undertaken on a systematic basis in order to

increase the stock of knowledge, including knowledge of man, culture and society, and the use of this

stock of knowledge to devise new applications”. Clearly the analysis of data and the creation of data-

based knowledge would appear to meet this rather broad definition.

Data from the Business Expenditure on R&D (BERD) 13

survey reports R&D employment by product

field, with one of those products being “computer programming and information service activities”.

13

Available at http://www.ons.gov.uk/ons/publications/re-reference-tables.html?edition=tcm%3A77-329762.

Accessed on 18th

September 2014.

Page 21: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

20

Employment data are presented below in Table 6, and show 27,000 full-time equivalents (FTEs)

working on R&D in this product field in 2013, compared to 178,000 performing Business R&D in

total. Of those 27,000, 14,000 are scientists and engineers. We conjecture that some of those will

include workers deployed in the extraction of data-based knowledge, although some other part will be

working on the development of new or improved software. Where data is used in the production of

knowledge to be applied to some other good/process, such employment may also be recorded

elsewhere in the BERD data, for instance to general R&D or to the primary product of that industry

e.g. pharma. From our discussions with the ONS, we are aware that they consider that BERD data

will potentially include activity in data analytics provided it meets the Frascati definition above. We

also note that R&D employment in this product field grew strongly in 2003, and then remained stable

until 2009, before growing again by 35% between the years 2009 and 2013, possibly reflecting

growth in data analytics activity.

Table 6: UK BERD employment

Employment

(FTEs 000s): Of which: Of which:

Year Total

Scientists

and

engineers

Technicians,

laboratory

assistants

and

draughtsmen

Administrative,

clerical,

industrial and

other staff Total

Scientists

and

engineers

Technicians,

laboratory

assistants and

draughtsmen

Administrative,

clerical,

industrial and

other staff

2000 145 86 30 30 10 5 1 4

2001 152 93 28 31 11 6 1 3

2002 158 96 31 31 13 7 2 4

2003 155 99 28 29 19 12 2 5

2004 150 94 27 29 19 11 3 5

2005 146 94 25 26 19 12 4 4

2006 147 92 27 28 20 13 4 4

2007 158 90 35 33 21 13 5 3

2008 151 86 37 28 20 11 6 3

2009 151 85 40 26 20 11 7 2

2010 154 87 41 27 22 11 8 3

2011 159 90 42 27 23 12 9 2

2012 161 91 44 27 24 13 9 2

2013 178 98 52 28 27 14 10 2

UK BERD: TotalUK BERD: Product: Computer programming and information

service activities

Source: archives of BERD data

Unfortunately however we have no official information or gauge on just how much data/analytics

activity may be included in the BERD data. The guidance notes to the BERD survey do state that

“consumer surveys, advertising and market research” and “general purpose or routine data collection”

are to be excluded from R&D figures provided, but the potential for some data analytics activity to be

included does remain.

From Table 5 we do have that 40,000 (or 21%) of our identified D and N workers are recorded under

the occupational codes science professionals (211), engineering professionals (212), research

Page 22: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

21

professionals (232) and research and development managers (1137). Therefore, from total estimated

big data employment of 190,000, we estimate that 123,000 (65%) are already recorded in software

occupations, and 40,000 (21%) may be recorded in the measurement of R&D. Note that the 40,000

identified is higher than the 22,000 employed in R&D for “computer programming and information

service activities” in 2010, as reported in Table 6. We also conjecture above that 12,000 (6%) may be

involved in the creation of other forms of knowledge-based capital such as advertising, market

research or design (based on the 12,000 employed as marketing and sales managers (1132),

advertising and public relations managers (1134) and design associate professionals (342) in Table 5).

It is also worth making a broader general point about the comparison between these knowledge

creation activities. BERD results state that in 2010, UK R&D employment was 154,000, consisting of

87,000 scientists and engineers, 41,000 technicians and 27,000 administrative staff. If we take the

sum of the first two of those occupations, then our estimates, and those from e-skills UK, suggest that

respectively big data employment lies in the range of (31,000/150,000=)21%14

and

(190,000/128,000=)148% of UK R&D employment. If we also incorporate clerical staff, the range is

(31,000/178,000=)17% to (190,000/154,000=)123%. Considering the attention devoted to R&D,

these are clearly significant estimates. We do note however that traditional R&D is largely

concentrated in manufacturing15

whilst data activities are likely to be more dispersed across industries,

potentially being a feature of any firm/industry that generates, or has access to, raw records or

information.

4.4. Alternative data sources: Employment in the D and N industries

So far we have presented total estimates of employment of D and N workers. As discussed, some of

those workers will be employed in specialist D and N firms in the D/N industry, and some will be

employed in-house in outside industries. Unfortunately, just as with the Standard Occupational

Classification (SOC), the Standard Industrial Classification (SIC) has not kept pace with this

emerging field, and is not yet sufficiently granular to separately identify economic activity in such

firms. Inspection of the 2007 SIC reveals two industries of particular interest, whose activities are

potentially relevant to data-building (transformation) or data analytics (knowledge creation).

Table 7 provides detail on the economic activities of two industries: Business and domestic software

development (62012) and Data processing, hosting and related activities (63110). The third column

lists the activities included in each industry and highlighted in red are the activities we consider

potentially part of either data-building (D) or knowledge creation (N) (indicated in final column).

14

E-skills UK estimates are for 2013, so we use an estimate of R&D employment of 150,000 (98,000 scientists

and engineers and 52,000 technicians) as in Table 6. 15

Table 27 of the BERD release shows that, in 2012, of the £12.4bn of R&D that occurred outside of the R&D

industry, £6.7bn (54%) took place in manufacturing.

Page 23: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

22

Unfortunately, data are not available for each activity, column 3, in Table 7. Instead, the five-digit

level of the SIC (as in Column 1) is the lowest level of aggregation available. From Table 7, we can

assume that some part of the sales of industry 63110 relate to data-building. We can also assume that

some part of industry 62012 relates to knowledge creation, and another part to data building. Industry

data for SICs 62012 and 63110 are presented in Table 8.

Table 7: SIC07 Industries whose activities might include data-building and/or data analytics

SIC (2007) Industry Activity Where in our framework?

62012

Business and

domestic software

development

Business and domestic software development

Custom software development

Data analysis consultancy services Knowledge Creation (N) sector

Database structure and content design

Designing of structure and content of business and

domestic software database

Made-to-order software

Programming services

Software house

Software systems maintenance services

System maintenance and support services

Systems analysis (computer) Knowledge Creation (N) sector

Web page design

63110

Data processing,

hosting and related

activities

Batch processing

Data conversion

Data preparation services

Data processing

Data storage services

Database running activities

Tabulating service

Time sharing services (computer)

Web hosting

Data-building (D) sector

Data-building (D) sector

Note to table: Excerpt from the 2007 Standard Industrial Classification

Table 8: Annual Business Survey (ABS) data

Source: ONS Annual Business Survey (ABS)

*indicates disclosive

Standard

Industrial

Classification

(Revised

2007)

Section

Division

Group

Class

Subclass

Description Year Number of

enterprises

Total

Turnover

Approximate

gross value

added at

basic prices

Total

purchases

Total

employment

- average

during the

year (1)

Total

employment

costs

Total net

capital

expenditure

Number £ million £ million £ million Thousand £ million £ million

62.01/2 Business and domestic software 2008 18,323 13,681 6,712 6,978 107 4,614 220

development 2009 11,197 12,899 6,928 6,126 82 4,128 97

2010 15,653 12,859 7,355 5,558 102 3,888 158

2011 22,085 14,889 8,877 6,017 108 4,399 209

2012 27,147 15,562 9,319 6,329 109 4,771 273

63.110 Data processing, hosting and related 2008 2,856 5,059 3,270 1,789 38 1,633 191

activities 2009 2,850 4,876 3,447 1,433 * 1,409 *

2010 2,783 5,640 3,711 1,882 * 1,618 *

2011 2,996 6,437 4,220 2,168 * 1,621 228

2012 3,038 6,676 4,221 2,438 * 1,643 *

Page 24: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

23

Therefore official data does not give us a precise estimate of employment in the D and N industries.

We do however have an additional point of information, namely the estimate in e-skills UK (2013b)

that of big data activities, 89% are conducted in-house, and 11% are purchased. Using that estimate

allows us to split the employment numbers from various sources into the part that we estimate work in

the specialist D and N industries, and the part that operates in-house in outside industries. Of those

employed in-house in outside industries, e-skills UK (2014) suggests that primary employers are those

in financial services, games, retail and marketing.

Estimates for employment in the data industry compared to in-house employment in outside industries

are summarised below in Table 9. In the final row we present data for various memo items including

employment for the wider industry as defined by the SIC, big data vacancies, software employment

and R&D employment.

How do our estimates compare to the employment numbers in industries 62012 and 63110 reported in

Table 8? From there we have that industry 62012 (Business and domestic software development)

employed 102,000 people in the year 2010. The figure for 63110 (Data processing, hosting and

related activities) is disclosive for the year 2010 and other years, but that industry employed 38,000 in

the year 2008. Employment costs for 63110 in 2010 are very similar to those in 2008 implying that

the employment level is also similar. Therefore, in total, employment for these two industries in 2010

stood at around 140,000. From Table 9 we estimate that around 20,900, or (20,900/140,000=)15%, of

those workers are in the D and N industries. The remainder will be employed in the production of

either pre-packaged or custom software,16

maintenance, consultancy, support, web page design and/or

web hosting. Alternatively, using the estimates from e-skills (wide definition) would imply a figure

of around (3,410/140,000=)2.5% and those from Mandel/Scherer a figure of (32,340/140,000=)23%.

16

Note there will be an element of crossover here in the sense that software provision now includes the sale of

software and business solutions that have analytics tools built in. In our framework, such software is capital that

is used in the D and N stages of the data supply chain, but the labour that produces that software is not directly

employed in D and N activity i.e. in data transformation or data-based knowledge creation.

Page 25: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

24

Table 9: Estimated workers in D and N industries and in-house in outside industries

Source Year

Estimated D and

N employment

Implied D and N

employment in D

and N industries

In-house D and

N employment

in outside

industries

narrow definition, 2013 17,000 1,870 15,130

wide definition, 2013 31,000 3,410 27,590

Mandel and Scherer / NESTA 2014 294,000 32,340 261,660

Social media data (this paper) 2010 190,000 20,900 169,100

Memo items:

Employment in D, N

and wider industry

ABI (ONS): SIC 62012 & 63110 2010* - 140,000 -

Big data vacancies (Mandel/Scherer) 2014 18,720

Software employment 2010 748,769

R&D employment 2010 154,000

Of which: programming & info services 22,000

e-skills UK

Note to table: Estimates of D and N workers located in D and N industries, and in outside industries, based on

the information that 11% of big data activities are outsourced/purchased (e-skills UK 2013b). Thus column 1 is

estimated employment, column 2 is 11% of estimated employment which we allocate to the D and N industries.

Column 3 is the remainder of employment, corresponding to in-house/own-account activity, and is column 1

minus column 2. Memo items include estimates of employment in wider industry that includes the D and N

industries as defined by the SIC. *Data for 2010 is partially disclosive, so employment partly based on 2008

data, but employment costs in 2010 similar to 2008 suggesting employment is also similar. Other memo items

are: big data vacancies as estimated in Mandel and Scherer (2014), employment in software occupations, and

R&D employment in general and also in the product field “computer programming and information service

activities”.

5. Conclusions

Much has been published on the volume, and growth in volume, of data that is available to firms and

used to generate new knowledge via analytics. However, aside from broad statements, in the UK at

least, few hard metrics are available on the scale or volume of big data activity. In this paper we

document various estimates of UK big data employment and produce our own estimates using a novel

dataset. We estimate that in 2010, UK employment in the big data sphere stood at 190,000. Of those

190,000, 65% are measured as part of official measurement of employment in the own-account (in-

house) production of computerised information, 21% are potentially included in the measurement of

business R&D, and 14% are employed in other occupations. Of those other occupations we note the

potential overlap with other measures of knowledge-based capital such as advertising, market research

and design. In future work we will show how to relate our estimates of employment to standard

national accounting procedures for measuring investment in intangible assets. This paper is therefore

a first step to documenting the contribution that data and data-based assets are making to UK growth.

Page 26: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

25

References

Bakhshi, H., J. Mateos-Garcia, et al. (2014). "Model Workers: how leading companies are recruiting

and managing their data talent."

Chamberlin, G., T. Clayton, et al. (2007). "New measures of UK private sector software investment."

Economic and Labour Market Review 1(5): 17-28.

e-skills UK (2013a). "Big Data Analytics: An assessment of demand for labour and skills, 2012-

2017." Report for SAS.

e-skills UK (2013b). "Big Data Analytics: Adoption and Employment Trends, 2012-2017."

e-skills UK (2014). Big Data Analytics: Assessment of Demand for Labour and Skills 2013-2020. T.

Partnership.

Hawk, W., R. Powers, et al. (2015). The Importance of Data Occupations in the U.S. Economy, US

Department of Commerce, Economics and Statistics Administration.

Mandel, M. (2012). "Where the jobs are: The app economy." South Mountain Economics, LLC.

Retrieved June 28: 2012.

Mandel, M. (2013). "Building a Digital City: The Growth and Impact of New York City's

Tech/Information Sector " South Mountain Economics, LLC Prepared for the Bloomberg

Technology Summit(September 30, 2013).

Mandel, M. and J. Scherer (2014). "Using Want-Ad Data for Mapping of Jobs and Economic Activity

Related to Innovative Technologies." Study funded by NESTA.

OECD (2002). Frascati Manual 2002: Proposed Standard Practice for Surveys on Research and

Experimental Development, Paris: OECD.

Office for National Statistics "Annual Survey of Hours and Earnings, 1997-2011: Secure Data Service

Access [computer file]. Colchester, Essex: UK Data Archive [distributor], April 2013. SN:6689."

United Nations (2008). "System of National Accounts 2008."

Wong, D. (2012). Data is the Next Frontier, Analytics the New Tool, London: Big Innovation Centre,

November. Available at: http://www. biginnovationcentre. com/Publications/21/Data-is-the-

nextfrontier-Analytics-the-new-tool.

Page 27: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

26

Appendix 1 Appendix Table A1: ‘Big Data’ keywords

"big data",

"sparql",

"mongodb",

"neo4j",

"elasticsearch",

"lucene",

"nosql",

"cassandra",

"couchdb",

"node.js",

"scala",

"graph databases",

"titan",

"machine learning",

"mlaas",

"data mininig",

"text mining",

"text analytics",

"hbase",

"mapreduce",

"pig",

"web scale architecture",

"hadoop",

"hdfs",

"zookeeper",

"impala",

"datameer",

"riak",

"redis",

"couchbase",

"memcached",

"mysql",

"data science",

"python",

"ruby",

"rest",

"rdf",

"owl",

"semantic web",

"web ontology",

"pattern recognition",

"natural language processing",

"nlp",

"sentiment analysis",

"data visualization",

"predictive analytics",

"computational linguistics",

"informatica",

"predictive modeling",

"semantic technologies",

"hive",

"recommender systems",

"nodejs",

"grid computing",

"sentiment analysis",

"velocity",

"data warehouse architecture"

Page 28: Measuring activity in big data: new estimates of big data ......2 1. Introduction To date, much of what has been published on ‘Big Data’ and data analytics has focused on the sheer

This paper has been produced by the Department of Management at Imperial College Business School

Copyright © the authors 2014 All rights reserved

ISSN: 1744-6783

Imperial College Business School

Tanaka Building South Kensington Campus London SW7 2AZ United Kingdom

T: +44 (0)20 7589 5111 F: +44 (0)20 7594 9184

www.imperial.ac.uk/business-school

This work is licensed under a Creative Commons Attribution 4.0 International License.