al-mqbali, leila, big data - research project

66
Big Data: An Investigation of the Big Data Phenomenon and its Implications for Accuracy in Modelling and Analysis By Leila Al-Mqbali Directed Research in Social Sciences: SCS 4150 Supervisor: Roman Meyerovich, Canada Revenue Agency Program Director: Professor Kathleen Day Disclaimer: Any views or opinions presented in this report are solely those of the student and do not represent those of the Canada Revenue Agency. April 23, 2014

Upload: leila-al-mqbali

Post on 23-Jan-2018

322 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Al-Mqbali, Leila, Big Data - Research Project

Big Data:

An Investigation of the Big Data Phenomenon

and its Implications for Accuracy in Modelling

and Analysis

By Leila Al-Mqbali

Directed Research in Social Sciences: SCS 4150 Supervisor: Roman Meyerovich, Canada Revenue Agency Program Director: Professor Kathleen Day

Disclaimer: Any views or opinions presented in this report are solely those of the student and do not

represent those of the Canada Revenue Agency.

April 23, 2014

Page 2: Al-Mqbali, Leila, Big Data - Research Project

1

Table of Contents

Introduction ..................................................................................................................................... 3

What is Big Data? ........................................................................................................................... 5

Volume......................................................................................................................................... 6

Velocity ........................................................................................................................................ 7

Variety ......................................................................................................................................... 7

The Rise of Big Data and Predictive Analytics .............................................................................. 9

Recording Data through the Ages: From Ancient to Modern..................................................... 9

Datafication vs. Digitization ..................................................................................................... 12

Big Data: Inference Challenge ...................................................................................................... 14

Introducing “Messiness” .......................................................................................................... 14

Data Processing and Analysis: Making a Case for Sampling .................................................. 14

Random Error vs. Non-Random Error...................................................................................... 15

Precision vs. Accuracy .............................................................................................................. 16

Accuracy, Non-Random Error, and Validity............................................................................. 17

Precision, Random Error, and Reliability ................................................................................ 18

Mathematical Indicators of Precision .................................................................................... 18

Precision and Sample Size ..................................................................................................... 19

Minimizing Random Errors and Systematic Errors .................................................................. 20

Hypothesis Testing and Sampling Errors .................................................................................. 21

Big Data: The Heart of Messiness ............................................................................................ 26

Patterns and Context: Noise or Signal? ........................................................................................ 28

The Predictive Capacity of Big Data: Understanding the Human’s Role and Limitations in

Predictive Analytics ...................................................................................................................... 31

Models and Assumptions ........................................................................................................... 31

The Danger of Overfitting ......................................................................................................... 34

Personal Bias, Confidence, and Incentives ............................................................................... 38

Big Data: IT Challenge ................................................................................................................. 40

The Big Data Stack.................................................................................................................... 40

Legacy Systems and System Development ................................................................................ 40

Page 3: Al-Mqbali, Leila, Big Data - Research Project

2

Building the Big Data Stack ...................................................................................................... 41

Storage ................................................................................................................................... 42

Platform Infrastructure........................................................................................................... 43

Data ........................................................................................................................................ 43

Application Code, Functions and Services ............................................................................ 44

Business View ....................................................................................................................... 44

Presentation and Consumption .............................................................................................. 44

Big Data: Benefits and Value........................................................................................................ 46

Unlocking Big Data’s Latent Value: Recycling Data ............................................................... 46

Product and Service Innovation ................................................................................................ 47

Competitive Advantage ......................................................................................................... 47

Improved Models and Subsequent Cost Reductions ............................................................. 48

Improved Models and Subsequent Time Reductions ............................................................ 49

Big Data: Costs and Challenges.................................................................................................... 51

Conceptual Issues: How to Measure the Value of Data ........................................................... 51

Recycling Data: Does Data’s Value Diminish? ........................................................................ 53

Big Data and Implications for Privacy ..................................................................................... 55

Cautiously Looking to the Future ................................................................................................. 57

Can N=All? ............................................................................................................................... 57

Can Big Data Defy the Law of Diminishing Marginal Returns? .............................................. 58

Final Remarks ............................................................................................................................... 60

Reference List ............................................................................................................................... 61

Page 4: Al-Mqbali, Leila, Big Data - Research Project

3

Introduction

As the volume and variety of available data continues to expand, many industries are

becoming increasingly fixated on harnessing data for their own advantage. Coupled with ever

advancing technology and predictive analytics, the accumulation of larger datasets allows

researchers to analyze and interpret information faster, and at a much lower cost than has ever

previously been viable. Undoubtedly, big data has many advantages when applied to a broad

range of business applications; such as cost reductions, time reductions, and more informed

decision making. However, big data also presents its own set of challenges, including a higher

potential for privacy invasion, a higher level of imprecision, and mistaking noise for true insight.

This paper will address the following. First, we attempt to construct a unified and

comprehensive definition of Big Data, characterized by rapidly increasing volume, velocity, and

variety of data. Subsequently, we will discuss the progressions in analytical thinking which

prompted the emergence of Big Data and predictive analytics. In particular, increased processing

capacity and a willingness to permit “messiness” in datasets were instrumental factors in

facilitating the shift from “small” data to “big”. The nature of this “messiness” is explored

through a discussion of sampling, random error, and systematic error.

In addition, we will address the importance of considering correlations in context, in

order to discern noise from signal. The predictive capacity of Big Data is restricted by the

models used to infer testable hypotheses, and so we must consider the limitations and

shortcomings models introduce into the analysis through their underlying assumptions.

Specifically, we will assess the dangers of overfitting, personal bias, confidence, and incentives.

Page 5: Al-Mqbali, Leila, Big Data - Research Project

4

Big Data signifies major environmental changes in firm objectives, often requiring

considerable modifications to computer and processing systems. To address these changes, a new

architectural construct has emerged known as the “Big Data Stack”. The architectural construct

is made of several interlinked components, and we address each part in turn in order to assess the

effectiveness of the Big Data Stack in Big Data analytics.

As Big Data continues to develop, it is essential that we undertake careful examination of

the potential costs and benefits typically associated with its deployment. In particular, we will

discuss the benefits of Big Data in terms of data’s potential for re-use, as well as cost and

decision time reductions resulting from improved modelling techniques. With regards to the

potential challenges presented by Big Data, we discuss conceptual issues, privacy concerns, and

the loss of data utility over time.

Page 6: Al-Mqbali, Leila, Big Data - Research Project

5

What is Big Data?

Before discussing Big Data in depth, it is essential that we have a good comprehension of

what we understand Big Data to be. Definitions are important, and a single accepted definition

across industries and sources would be ideal, as ambiguous definitions often lead to inarticulate

arguments, inaccurate cost/benefit analyses, and poor recommendations. Unfortunately,

following the review of various sources it is evident that providing a definitive and cohesive

definition of Big Data is perhaps not so simple. Some institutions view Big Data as a broad

process used to encompass the continuous expansion and openness of information; for example

Statistical Analysis System (SAS) characterizes Big Data as “a popular term used to describe the

exponential growth and availability of data, both unstructured and structured.” (“Big Data: What

it is and why it matters”, n.d.) Others focus more on the increased processing capabilities that the

use of Big Data necessitates in order to construct a definition. Strata O’Reilly, a company

involved with big data technology and business strategy, state that “Big Data is data that exceeds

the processing capacity of conventional database systems.” (Dumbill, 2012, para. 1). Yet further

ambiguity is introduced by other authorities who define Big Data as a repository of its potential

future, rather than current, use and value. For instance, Forbes magazine claims that Big Data is

“a collection of data from traditional and digital sources inside and outside your company that

represents a source for ongoing discovery and analysis” (Arthur, 2013, para. 7).

Is there a way forward, given such divergent opinions? Obviously, each of the definitions

captures an important conceptual element, worthy of note. Big data can be big in terms of

volume, and indeed it also requires advanced processing techniques and has important

implications for firm profitability and innovation. However, where all these definitions fall short

Page 7: Al-Mqbali, Leila, Big Data - Research Project

6

is in comprehending that Big Data must embody all three of these characteristics at once. Big

Data is more akin to a subject or discipline than a description of a single event or a process, and

therefore emphasizing individual characteristics is not enough to distinguish it as such. Big Data

is unique in that its elements of data volume, velocity, and variety, are increasing rapidly at

different rates and in different formats. Ultimately, this leads to challenges in data integration,

measurement, and interpretation and replicability of the results.

Volume

The volume of Big Data is increasing due to a combination of factors. Primarily, as we

move forward in time the amount of data points available to us increases. Moreover, where

previously all data were amassed internally by company employees’ direct interaction with

clients, innovation in other industries, such as the invention of cellular devices, has resulted in

more and more data being indirectly generated by machines and consumers. Cellular usage data

did not exist before the creation of the cell phone, and today millions upon millions of cellular

devices transmit usage data to various networks all over the world. Furthermore, the creation of

the internet has permitted consumers to play an active role in data generation, as they knowingly

and willingly provide information about themselves that is available to various third parties. The

internet and web logs have allowed for transactions-based data to be stored and retrieved, and

have facilitated the collection of data from social media sites and search engines. Thus, the

combination of a larger number of data sources with a larger quantity of data results in the

exponential growth of data volume, which is related to improved measurement. However, it is

Page 8: Al-Mqbali, Leila, Big Data - Research Project

7

important to stress that it is not greater volume in absolute terms that characterizes Big Data, but

rather a higher volume relative to some theoretical, final set of the clients’ data.

Velocity

With regards to velocity, as innovation continues to modify many industries, the flow of

data is increasing at an unparalleled speed. Ideally, data should be processed and analyzed as

quickly as possible in order to obtain accurate and relevant results. Analyzing information is not

an issue as long as the velocity of incoming data is slower than the time it takes to process it. In

this case, the information will still be relevant when we obtain the results. However, with Big

Data the velocity of information is so rapid that it undermines previous methods of data

processing and distillation, and new tools and techniques must be introduced to produce results

that are still relevant to decision-makers. Data is streaming in at an accelerated rate, while

simultaneously processing delays are decreasing at such a rate that the data arrival and

processing may eventually approach real time. However, at present such vast capabilities are not

yet on the horizon, and the need for immediate reaction to data collection poses an ongoing

challenge for many firms and industries.

Variety

Finally, the third central element of Big Data is its variety. Consisting of many different

forms, Big Data represents the mix of all types of data, both structured and unstructured.

McKinsey Global Institute (2011) defines structured data as “data that resides in fixed fields.

Page 9: Al-Mqbali, Leila, Big Data - Research Project

8

Examples of structured data include relational databases or data in spreadsheets” (Manyika,

Chui, Brown, Bughin, Dobbs, Roxburgh, & Byers, 2011, p.34). In contrast, unstructured data is

described as “data that do not reside in fixed fields. Examples include free-form text, (e.g. books,

articles, body of e-mail messages), untagged audio, image and video data” (Manyika et al., 2011,

p. 34). However, Big Data is trending towards less structured data and a greater variety of

formats (due to a rising number of applications). Where increased volume is related to improved

measurement, increased variety is associated with greater potential for innovation. Lacking

cohesion in input configuration, the effective management and reconciliation of the varying data

formats remains a persistent obstacle that organizations are attempting to overcome.

Having discussed the various components of Big Data, it is evident that articulating a

succinct and precise definition in a few simple sentences is challenging, if not impossible. Big

Data is not a series of discreetly separable trends; it is rather a dynamic and multi-dimensional

phenomenon. In confining our definition to a few lines, we restrict our understanding and

introduce a haze of ambiguity and uncertainty. Instead, by focussing on Big Data as a

multidimensional process, we bring ourselves a step closer to a fuller and deeper understanding

of this new phenomenon.

Page 10: Al-Mqbali, Leila, Big Data - Research Project

9

The Rise of Big Data and Predictive Analytics

Previously, we defined Big Data as consisting of three intertwined dimensions: volume,

velocity, and variety. Now, we briefly look at changes in analytical thinking that took place over

a long period of time, and in the final analysis gave rise to Big Data. At their core are several

concurrent changes in the analysts’ mindset that support and reinforce each other. Firstly, there

was a move towards the capacity to process and analyze increasingly sizeable amounts of data

pertaining to a question of interest. Second, there was a readiness to permit messiness in datasets

rather than restricting our analysis to favour the utmost accuracy and precision.

Recording Data through the Ages: From Ancient to Modern

The emergence of Big Data is rooted in our natural desire to measure, record, and evaluate

information. Advances in technology and the introduction of the Internet have simply made

documentation easier and faster, and as a result we are now able to analyse progressively larger

datasets. In fact, the methods used to document history have been developing for millennia; from

Neanderthal cave art to early Sumerian pictograms, and finally to the digital era we know today.

Basic counting and an understanding of the passage of time are possibly the oldest conceptual

records known to us, but in 3500BC, the early Mesopotamians made a discovery that

transformed the way information was transmitted through the generations and across regions.

Mesopotamians had discovered a method of record keeping (now known as Cuneiform) by

inscribing symbols onto clay tablets, which were used to communicate objects or ideas. It was

this – the invention of writing – that gave rise to the dawn of the information revolution,

Page 11: Al-Mqbali, Leila, Big Data - Research Project

10

permitting “news and ideas to be carried to distant places without having to rely on a messenger's

memory”(“Teacher Resource Center Ancient Mesopotamia: The Invention of Writing”, para. 3).

Cuneiform script formed the basis of future record keeping, and as records advanced to printed

text and then again to the digital world, Big Data emerged in its wake.

Essentially, the combination of “measuring and recording ... facilitated the creation of

data” (Schonberger & Cukier, 2013, p. 78), which in turn had valuable effects on society.

Sumerians employed what is known today as descriptive analytics: they were able to draw

insight from the historical records they created. However, somewhere along the journey of

documenting information, a desire was born to use it. It was now possible for humanity to

reproduce past endeavours from documentation of their dimensions, and the process of recording

allowing for more methodical experimentation – one variable could be modified while holding

others constant. Moreover, industrial transactions could be calculated and recorded, aiding in

predicting events such as annual crop yield, and further developments in mathematics “gave new

meaning to data – it could now be analyzed, not just recorded and retrieved” (Schonberger &

Cukier, 2013, p. 80). Thus, it is evident that developments in data documentation had significant

implications for civilization. Parallel to these advances, means of measurement were also

increasing dramatically in precision – allowing for more accurate predictions that could be

derived from the collected documentation.

Nurtured by the rapid growth in computer technology, the first corporate analytics group

was created in 1954 by UPS, marking the beginning of modern analytics. Characterized by a

relatively small volume of data (mostly structured data) from internal sources, analytics were

mainly descriptive and analysts were far removed from decision makers. Following the

beginning of the second millennia, internet-based companies such as Google began to exploit

Page 12: Al-Mqbali, Leila, Big Data - Research Project

11

online data and integrate Big Data-type analytics with internal decision making. Increasingly,

data was externally sourced and the “fast flow of data meant that it had to be stored and

processed rapidly” (Davenport & Dyché, 2013, p. 27).

Advancements in computers aided in cementing the transition from descriptive analytics to

predictive analytics, as efficiency was increased through faster computations and increased

storage capacity. Predictive analytics is defined by SAS as “a set of business intelligence (BI)

technologies that uncovers relationships and patterns within large volumes of data that can be

used to predict behavior and events” (Eckerson, 2007, p. 5). As the amount of data continues to

grow with technological developments, these relationships are being discovered at a much faster

speed and with greater accuracy than previously attainable. In addition, it is important to

distinguish between predictive analytics and forecasting. Forecasting entails predicting future

events, while predictive analytics adds a counter-factual by asking “questions regarding what

would have happened... given different conditions” (Waller & Fawcett, 2013, p. 80).

Furthermore, there is a growing interest in the field of behavioural analytics; consumers are

leaving behind “‘digital footprint(s)’ from online purchases ... and social media commentary

that’s resulting in part of the Big Data explosion” (Davenport & Dyché, 2013, p. 27). Effectively,

these communications are informing targeting strategies for various industries and advertisers1.

In sum, using larger quantities of information to inform and enrich various types of

business analytics was a fundamental factor in the shift to Big Data. Thus, as the volume of data

increased exponentially with the arrival of computers and the internet, so too did the variety of

the information and the potential value that could be extracted from it. Continuously developing

1 While promising tremendous benefits, behavioural analytics entails certain risks and challenges for society (such as implications for the role of free will) which must be addressed in a timely manner to avoid political and social backlash. These issues are beyond the scope of this paper.

Page 13: Al-Mqbali, Leila, Big Data - Research Project

12

computing technologies and software, combined with their increasingly widespread use,

facilitated the shift to Big Data.

Datafication vs. Digitization

One important technological development in the evolution of Big Data is what

Schonberger and Cukier (2007) call “datafication”, distinct from the earlier invention of

digitization. Digitization refers to the process of converting data into a machine-readable digital

format. For example, an image from a soft copy book scanned to a computer and saved as a

bitmap file.

Datafication, on the other hand, involves taking something not previously perceived to

have informational worth beyond its original function, and transforming it into a “numerically

quantified format” (Mayer-Schonberger & Cukier, 2007, p. 76), so that it may then be charted

and analyzed. Data that has no informational worth beyond its original function is said to lack

stored value, as it cannot be held and retrieved for analytical purposes, and has no usefulness

other than what it presents at face value. Essentially, digitization is an initial step in the

datafication process. For example, consider Google Books: pages of text were scanned to

Google’s servers (digitized) so that they could be accessed by the public through use of the

internet. Retrieving this information was difficult as it required knowing the specific page

number and book title; one could not search for specific words or conduct textual information

analysis because the pages had not been datafied. Lacking datification, the pages were simply

images that could only be converted into constructive information by the act of reading –

offering no value other than the narrative they described.

Page 14: Al-Mqbali, Leila, Big Data - Research Project

13

To add value, Google used advanced character-recognition software that had the ability to

distinguish individual letters and words: they had transformed the digital images to datified text

(Mayer-Schonberger & Cukier, 2007, p. 82). Possessing inherent value to readers and analysts

alike, this data allowed the uses of particular words or idioms to be charted over time, thus, as an

example, providing new insight on the progression of human philosophy. For instance, it was

able to show that “until 1900 the term ‘causality’ was more frequently used than ‘correlation,’

but then the ratio reversed” (Mayer-Schonberger & Cukier, 2007, p. 83). Combined with

advances in measurement techniques, the development of digital technology has further

increased our ability to analyze a larger volume of data.

Page 15: Al-Mqbali, Leila, Big Data - Research Project

14

Big Data: Inference Challenge

Introducing “Messiness”

Despite Big Data’s noted advances in technological sophistication, it has been argued that

“increasing the volume [and complexity of data] opens up the door to inexactitude” in results

(Mayer-Schonberger & Cukier, p. 32). This inexactitude has been referred to as Big Data

“messiness”, and the following sections will explore the nature of messiness and why it seems to

be unavoidable in Big Data analytical solutions. Furthermore, we will consider how sampling

errors and sources of data bias are impacted by the use of Big Data analytics.

Data Processing and Analysis: Making a Case for Sampling

Historically, data collection and processing was slow and costly. Attempts to use whole

population counts (i.e. census) produced outdated results that were consequently not of much use

in making meaningful inferences at the time they were needed. This divergence between growth

in data volume and advances in processing methods was only increasing over time, leading the

U.S. Census Bureau in the 1880s to contract inventor Herman Hollerith to develop new

processing methods for use in the 1890 census.

Remarkably, Hollerith was able to reduce the processing time by more than 88%, so that

the results could now be released in less than a year. Despite this feat, it was still so expensive

for the Bureau to acquire and collect the data that the Census Bureau could not justify running a

census more frequently than once every decade. The lag, however, was unhelpful because the

country was growing so rapidly that the census results were largely irrelevant by the time of their

Page 16: Al-Mqbali, Leila, Big Data - Research Project

15

release. Here lay the dilemma: should the Bureau use a sample as opposed to the population in

order to help facilitate the development of speedier census procedures (Mayer-Schonberger &

Cukier, 2013, pp. 21-22).

Clearly, gathering data from an entire population is the ideal, as it affords the analyst far

more comprehensive results. However, using a sample is much more efficient in terms of time

and cost. The idea of sampling quickly took root, but with it emerged a new dilemma – how

should samples be chosen? And how does the choice of sample affect the results?

Random Error vs. Non-Random Error

The underlying assumption in sampling theory is that the units selected will be

representative of the population from which they are selected. In the design stage, significant

efforts are undertaken to ensure that, as far as possible, this is the case. Even when conceptually

correct processing methods for sampling selection are used, a sample cannot be exactly

representative of the entire population. Inevitably, errors will occur, and these are known as

sampling errors. True population parameters differ from observed sample values for two

reasons: random error and non-random error (also called systematic bias).2 Random error refers

to the “statistical fluctuations (in either direction) in the measured data due to the precision

limitations of the measurement” (Allain, n.d.). More specifically, random error comes as a result

of the chosen sampling method’s inability to cover the entire range of population variance

(random sampling error), the way estimates are measured, and the subject of the study.

2 Random and non-random errors are both types of sampling errors. Non-sampling errors will be discussed later.

Page 17: Al-Mqbali, Leila, Big Data - Research Project

16

On the other hand, systematic errors describe “reproducible inaccuracies that are

consistently in the same direction [and] are often due to a problem which persists throughout the

entire experiment3” (Allain, n.d.). For example, non-random error may result from systematic

overestimation or underestimation of the population (scale factor error), or from the failure of the

measuring instrument to read as zero when the measured quantity is in fact zero (zero error).

Non-Random errors accumulate and cause bias in the final results. In order to evaluate the

impact these non-random errors have on results, we must first consider the concepts of accuracy

and precision.

Precision vs. Accuracy

Bennett (1996) defines accuracy as “the extent to which the values of a sampling

distribution for a statistic approach the population value of the statistic for the entire population”

(p. 135). If the difference between the sample statistic and the population statistic is small, the

result is said to be accurate (also referred to as unbiased), otherwise it is said to be inaccurate. It

is important to note that accuracy depends on the entire range of sample values, not a particular

estimate, and so we refer to the accuracy of a statistic as opposed to that of an estimate.

In contrast, precision reveals “the extent to which information in a sample represents

information in a population of interest” (Bennet, 1996, p. 136). An estimator is called precise if

the sample estimates it generates are not far from their collective average value. Note however,

that these estimates may all be very close together, and yet all may be far from the true

3 Note that human error or “mistakes” are not included in error analysis. Examples of such flaws include faults in calculation, and

misinterpretation of data or results.

Page 18: Al-Mqbali, Leila, Big Data - Research Project

17

population statistic. Therefore, we can observe results which are accurate but not precise, precise

yet not accurate, both, or neither. To put it differently, “precision does not necessarily imply

accuracy and accuracy does not necessarily imply precision” (Bennett, 1996, p.138). These

outcomes are illustrated below, where the true statistic is represented graphically by the bulls-

eye:

The first drawing is precise because the sample estimates are clustered close to one another. It is

not accurate, however, because they are far from the centre of the inner circle. The

interpretations of the other drawings follow similar analysis.

Accuracy, Non-Random Error, and Validity

The accuracy of a statistic is primarily affected by non-random error. For example, as

previously discussed, non-random error may result from estimates being scaled upwards or

downwards if the instrument persistently records changes in the variable to be greater or less

than the actual change in the observation. In this case, we might find the sample means of our

estimates – though clustered together – are persistently higher than the population mean by a

particular value or percentage, producing a consistent but wholly inaccurate set of results.

Source: Vig, (1992). Accuracy, Stability, and Precision Examples for a Marksman. Introduction to Quartz Frequency Standards.

Page 19: Al-Mqbali, Leila, Big Data - Research Project

18

Moreover, Bennett (1996) notes that “probably the greatest threat to accuracy is failure to

properly represent some part of the population of interest in the set of units being selected or

measured” (p. 140).

For example, a mailed literacy survey that participants are invited to fill out and return will

result in gross inaccuracy, as it is bound to exclude those people who are illiterate. The concept

of accuracy is also closely linked to validity. Validity is the term used to indicate the degree to

which a variable measures the characteristic that it is designed to measure. Put differently, an

estimator is not valid when it “systematically misrepresents the concept or characteristic it is

supposed to represent” (Bennett, 1996, p.141). For example, taxable income may not be a valid

indicator of household income if particular types of income (such as welfare payments) are

excluded from the data. It is important to note that the validity of an estimator is largely

determined by non-random (systematic) errors in measurement and experimental design.

Therefore, eliminating a systematic error improves accuracy but does not alter precision, as an

increase in precision can only result from a decrease in random error.

Precision, Random Error, and Reliability

Mathematical Indicators of Precision

The extent of the random error present in an experiment determines the degree of precision

afforded to the analyst. In addition, precision and random error are also closely linked to the

perceive reliability of an estimator. Fundamentally, an estimated statistic is considered reliable

when it produces “the same results again and again when measured on similar subjects in similar

Page 20: Al-Mqbali, Leila, Big Data - Research Project

19

circumstances” (Bennett, 1996, p.144). Put differently, results which closely resemble one

another represent a more precise estimator and a lower degree of random error.

Recall that random error is the part of total error that varies between measurements, all else

held equal. The lower the degree of random error, the more precise our estimate will be. How

then do we measure the extent to which random error is present in experiments? Confidence

intervals are commonly used as an indicator of precision, as they measure the probability that a

population statistic will lie within the specified interval.

For example, given a 95% confidence interval, the mean of a sample may be between 5.3

and 6.7. In effect, this means that we can expect 95% of our sample estimates for the mean to fall

somewhere between 5.3 and 6.7. The narrower the confidence band, the more precise the

estimator. Moreover, the standard error of an estimate is also used to indicate precision.

Standard error is essentially the extent of the fluctuation from the population statistic due to pure

chance in sample estimates, and is calculated by dividing the sample variance by the sample size

and then taking the square root. An estimate with high precision (and thus small random error)

will have low standard error.

Precision and Sample Size

Depending on the statistic under consideration, precision may be dependent on any number

of factors (such as the unit of measurement etc.). However, it is always dependent on sample

size. The explanation for this comes from the nature of non-random errors. As we have

discussed, non-random errors can occur in any number of observations in an experiment, and

each observation is not necessarily distorted to the same degree. Therefore, if we were to repeat a

Page 21: Al-Mqbali, Leila, Big Data - Research Project

20

test with random error and average the results, the precision of the estimate will increase. Also,

“the greater the variation in the scores of a variable or variables on which a statistic is based, the

greater the sample size necessary to adequately capture that variance” (Bennett, 1996, p.139).

Essentially, an experiment with higher random error necessitates a larger sample size to achieve

precision, and the estimate will become more precise the more times the experiment is repeated.

This result follows from the Central Limit Theorem, which states that as the sample size

increases, the sample distribution of a statistic approaches a normal distribution regardless of the

shape of the population distribution. Thus, the theorem demonstrates why sampling errors

decrease with larger samples.

Minimizing Random Errors and Systematic Errors

While it is possible to minimize random errors by repeating the study and averaging the

results, non-random errors are more difficult to detect and can only be reduced by improvement

of the test itself. This is due to the fact that non-random errors systematically distort each

observation in the same direction, whereas random errors may irregularly distort observations in

either direction. To illustrate this more clearly, let us consider the following example. If the same

weight is put on the same scale several times and a different reading (slightly higher or lower) is

recorded with each measurement, then our experiment is said to demonstrate some degree of

random error. Repeating the experiment many times and averaging the result will increase the

precision. However, if the same weight is put on the same scale several times and the results are

persistently higher or persistently lower than the true statistic by a fixed ratio or amount, the

Page 22: Al-Mqbali, Leila, Big Data - Research Project

21

experiment is said to have systematic error. In this case, repeating the test will only reinforce the

false result, and so systematic errors are much more difficult to detect and rectify.

Hypothesis Testing and Sampling Errors

An important principle of sampling is that samples must be randomly selected in order to

establish the validity of the hypothesis test. Hypothesis testing is a method of statistical inference

used to determine the likelihood that a premise is true. A null hypothesis H0 is tested against an

alternate hypothesis H1 (hence H0 and H1 are disjoint) and the null hypothesis is rejected if there

is strong evidence against it, or equivalently if there is strong evidence in favour of the alternate

hypothesis. It is important to note that failure to reject H0 therefore denotes a weak statement; it

does not necessarily imply that H0 is true, only that there did not exist sufficient evidence to

reject it.

As an example, imagine a simple court case: the null hypothesis is that a person is not

guilty, and that person will only be convicted if there is enough evidence to merit conviction. In

this case, failure to reject H0 merely implies there is inadequate evidence to call for a guilty

verdict – not that the person is innocent. Moreover, it is possible to repeat an experiment many

times under different null hypotheses and fail to reject any of them. Consider if we were to put

each person in the world on trial for a crime – we could hypothetically fail to find sufficient

evidence to convict anyone, even if someone did commit a crime. Therefore, the goal of

hypothesis testing should always be to reject the null hypothesis and in doing so confirm the

alternate, as it represents a much stronger statement than failure to reject the null.

Page 23: Al-Mqbali, Leila, Big Data - Research Project

22

In the probabilistic universe, there is always some level of imprecision and inaccuracy,

however small. Occasionally, an innocent person will be convicted, and sometimes a guilty

person will walk free. Every hypothesis test is subject to error as we are imperfect beings with

imperfect empirical knowledge, as some data points are always missing which affects

measurement. Furthermore, every study has a level of “acceptable” error (typically denoted by

alpha), which is directly related to the probability that the results inferred will be inexact. For

example, alpha=0.05 is indicative of accepting 5% error in our results – and so if we repeated an

experiment 1000 times we would reasonably expect 950 significant results from the 1000. Type I

error occurs as a result of random error, and results when one rejects the null hypothesis when it

is true. The probability of such an error is the level of significance (alpha) used to test the

hypothesis. Put differently, Type I error is a “false positive” result and a higher level of

acceptable error (i.e. a larger value of alpha) increases the likelihood of imprecision.

According to the Central Limit Theorem, larger samples result in lower Type 1 error. As

noted previously, Big Data is not only represented by bigger data sets (volume) but also by

different data types (variety). Therefore, in the case of Big Data, the probability that a Type I

error will occur is significantly higher than it would be in a “small” data problem, as the move to

Big Data involves increasing the value of alpha due to increased variety. Indeed, “the era of Big

Data only seems to be worsening the problems of false positive findings in the research

literature” (Silver, 2012, p.253).

To lower the likelihood of Type I error, one lowers the level of acceptable error: one

tightens the restrictions regarding which data are permitted in the analysis, thereby reducing the

size of the sample to a small data problem. However, experiments making use of small data are

more prone to errors of Type II: accepting the null hypothesis when it is not true (the alternative

Page 24: Al-Mqbali, Leila, Big Data - Research Project

23

is true). In other words, Type II error refers to a situation where a study fails to find a difference

when in fact a difference exists (also referred to as a false negative result). Effectively,

committing a Type II error is caused by systematic error, and entails accepting a false

hypothesis. This can negatively impact results as adopting false beliefs (and drawing further

inferences from analyses under the assumption that your beliefs are correct) can result in further

erroneous conclusions. The possible outcomes for hypothesis testing are shown in the table

below:

Outcomes from Hypothesis Testing

RealitysfjsdgReality

The null hypothesis is

true (no.difference)

The alternative

hypothesis is true (difference)

Result from Research

The null hypothesis is true (no difference)

Accurate Type 2 Error

The alternative

hypothesis is true (difference)

Type 1 Error Accurate

Thus, for a given sample size the real problem is to choose alpha so as to achieve the greatest

benefit from the results; we consider which type of error we deem to be “more” acceptable. This

is not a simple question as the level of acceptable error is contingent upon the type of research

we are conducting.

For instance, if a potential benefactor refuses to fund a new business venture, they are

avoiding Type I error – which would result in a loss of finances. At the same time, however, they

open themselves to the possibility of Type II error; that they may by bypassing a potential profit.

Page 25: Al-Mqbali, Leila, Big Data - Research Project

24

It is simply an issue of potential costs vs. potential benefits, and weighing the risk and

uncertainty. Risk is “something you can put a price on” (Knight, 1921, as cited by Nate Silver,

2012, p. 29), whereas uncertainty is “risk that is hard to measure” (Silver, 2012, p. 29).Whereas

risk is exact (e.g. odds of winning a lottery), uncertainty introduces imprecision. Silver (2012)

notes that, “you might have some vague awareness of the demons lurking out there. You might

even be acutely concerned about them. But you have no idea how many of them there are or

when they might strike” (p. 29). In the case of the potential backer, there was too much

uncertainty surrounding the outcome for him to feel comfortable financing the new business.

Similarly to our hypothetical patron, many people are averse to uncertainty when making

decisions – that is many people would prefer lower returns with known risks as opposed to

higher returns with unknown risks – and are consequently more inclined to minimize Type I

errors and accept Type II errors when making decisions.

Consider a second example; results from cancer screening, where the null hypothesis is

that a patient is healthy. Type I error entails telling a patient they have cancer when they do not,

and Type II error involves failing to detect a cancer that is present. Here, the costs of the errors

seem to be much higher, as the patient’s life may be at stake. Type I error can lead to serious side

effects from unnecessary treatment and patient trauma, however an error of Type II could result

in a patient dying from an undiagnosed disease which could have potentially been treated. In this

scenario, the cost of a Type II error seems to be much greater than that of a Type I error.

Therefore, in this scenario a false positive is more desirable than a false negative, and we seek to

minimize Type II errors. This is exactly the case with hypothesis tests which utilize Big Data; by

Page 26: Al-Mqbali, Leila, Big Data - Research Project

25

increasing the sample size, the power4 of the test is amplified, and thus Type II errors are

minimized.

Assessing the costs of different decisional errors, we can see that the choice of alpha (and

relative likelihood of Type I and Type II errors) must be made on a situational basis, and making

any decision will involve a trade-off between the two types. Furthermore, one cannot easily

make the argument that one type of error is always worse than the other; the gravity of Type I

and Type II errors can only be gauged in the context of the null hypothesis.

The discussion of sampling errors and other sources of bias have significant implications

for Big Data. Decisions regarding Type I and Type II errors introduce bias into datasets, as each

organization executes these decisions in order to fulfil their individual objectives. Typically, each

party is not obligated (or inclined) to share their decision making processes with other parties,

and therefore each organization has imperfect information regarding the data held by others. The

resulting set of Big Data employed by each organization represents an unknown combination of

decisions (biases) to all other organizations. Society’s continuing shift to Big Data implies the

costs of false positives are not perceived to be serious (or the costs of false negatives are

understood to be relatively more serious) for the types of issues being addressed in the

experiments.

4 The power of a test refers to the ability of a hypothesis test to reject the null hypothesis when the alternative

hypothesis is true.

Page 27: Al-Mqbali, Leila, Big Data - Research Project

26

Big Data: The Heart of Messiness

Consider a coin that is tossed 10 times and each observation recorded. We might find the

probability of heads to be 0.8 from our sample, and therefore we do not have sufficient evidence

to reject H0: P(heads)=0.75. In this case, we would be making a Type II error as we would fail to

reject the null when the alternative is true. However, as we increase our sample size to 10000 we

may find that the probability of heads is now 0.52 and so we may consider this sufficient to

reject the null hypothesis. Clearly, in this case, a bigger sample is better as it allows us to gather

more data which can be used as evidence. This result follows from the Law of Large Numbers,

which states that as sample size increases, the sample mean approaches the population mean.

However, it is important to note that this law is valid only for samples that are unbiased: a larger

biased sample will yield next to no improvement in accuracy. Bias is the tendency of the

observed result to fall more on one side of the population statistic than the other: it is a persistent

deviation to one side. With regards to our example, a coin is fair and unbiased in nature (unless it

has been tampered with). A coin toss is just as likely to come up tails as it is to come up heads,

and since there are only two possible outcomes the probability of either is 0.5. In other words,

the unbiased coin “has no favourites”. Thus, as the sample size of coin tosses increases, the

sample mean approaches the true population mean.

Let us now consider a biased experiment. For example, internet surveys are explicitly

(through not deliberately, by design) biased to include only those people who use the internet.

Increasing the number of participants in the survey will not make it any more representative of

the whole population as each time it is repeated it replicates the same bias against people who do

not use the internet. It is important, therefore, to note that increasing the size of a biased sample

is not likely to result in any increase in accuracy.

Page 28: Al-Mqbali, Leila, Big Data - Research Project

27

Herein lays the crux of messiness. Following from the Central Limit Theorem, we

previously discussed precision as increasing with sample size. However, the Law of Large

Numbers states that if the sample is biased, using a larger sample does not reduce the bias and

may even amplify it, thereby magnifying inaccuracies. Big Data samples may contain a number

of biases, such as self-reporting on social media sites, etc., making accuracy extremely unlikely

to increase with the larger sample size. Some data points are likely to be missing, and it can

never be known with complete certainty exactly what has been omitted. With additional volume

and variety in data points comes additional random errors and systematic bias. The problem lies

in our inability to discern which is increasing faster. This is the nature of messiness in Big Data.

Clearly, the total absence of error is unattainable, as there are always some data points

missing from the experiment. While in theory increasing sample size can increase precision, the

biases inherent in Big Data mean the increase in volume is unlikely to result in any meaningful

improvement.

Page 29: Al-Mqbali, Leila, Big Data - Research Project

28

Patterns and Context: Noise or Signal?

When dealing with Big Data solutions, it is important to distinguish between data and

knowledge so as not to mistake noise for true insight. Data “simply exists and has no

significance beyond its existence... it can exist in any form, usable or not” (Ackoff, 1989. As

cited in Riley & Delic, 2010, p. 439). Stated differently, data is signified by a fact or statement of

event lacking an association to other facts or events. In contrast, knowledge is “the appropriate

collection of information, such that it's intent is to be useful. Knowledge is a deterministic

process” (Ackoff, 1989. As cited in Riley & Delic, 2010, p. 439). Therefore, knowledge involves

data which has been given context, and is more than a series of correlations; it typically imparts a

high degree of reliability as to events that will follow an expressed state. To put it differently,

knowledge has the potential to be useful, as it can be analyzed to reveal latent fundamental

principles. The table below provides examples of these related concepts:

Data vs. Knowledge:

Data Knowledge

Example 1 2, 4, 8, 16 Knowing that this is equivalent to 21, 22, 23, 24, and

being able to infer the next numbers in the sequence.

Example 2 It is raining. The temperature dropped and then it started raining.

Inferring that a drop in temperature may be correlated with the incidence of rain.

Example 3 The chair is broken I sat heavy items on a chair and it broke. Inferring that

the chair may not be able to withstand heavy weights.

Clearly, understanding entails synthesizing different pieces of knowledge to form new

knowledge. By understanding a set of correlations, we open the door to the possibility for the

prediction of future events in similar states. Fundamentally, Big Data embodies the progression

Page 30: Al-Mqbali, Leila, Big Data - Research Project

29

from data to understanding with the purpose of uncovering underlying fundamental principles.

Analysts can then exercise this newfound insight to promote more effective decision making.

Further to this intrinsic procedure from data to understanding, patterns often also emerge

from the manipulation and analysis of the data itself. For example, results demonstrate that there

are visible patterns and connections in data variability; trends in social media due to seasonal and

world events can disrupt the typical data load. Data flows have a tendency to vary in velocity and

variety during peak seasons and periods throughout any given year, and on a much more intricate

scale it is even possible to observe varying fluctuations in the data stream across particular times

of day.

This variability of data can often be challenging for analysts to manage (e.g. server crashes

due to unforeseen escalations in online activity) and furthermore it can significantly impact the

accuracy of the results. Inconsistencies can emerge which obscure other meaningful information

with noise; not all data should necessarily be included in all types of analysis as inaccurate

conclusions may result from the inclusion of superfluous data points. Therefore, it is necessary to

weigh the cost of permitting data with severe variability against the potential value the increase

in volume may provoke. Whereas the internal and patterned process from data to understanding

serves to provide us with insight, the patterns of data variability often present us with obstacles

to this insight.

Identifying a pattern is not enough. Almost any large dataset can reveal some patterns,

most of which are likely to be obvious or misleading. For example, late in the 2002 season

Cleveland Cavaliers basketball team showed a consistent tendency to “go over” the total for the

Page 31: Al-Mqbali, Leila, Big Data - Research Project

30

game.5 Upon investigation, it was found that the reason behind this trend was that Ricky Davis’

contract was to expire at the end of the season, and so he was doing his utmost best to improve

his statistics and thereby render himself more marketable to other teams (Ricky Davis was the

teams’ point-guard). Given that both the Cavaliers and many of their opponents were out of

contention for the playoffs and thus their only objective was to improve their statistics, a tacit

agreement was reached where both teams would play weak defence so that each team could

score more points.

The pattern of high scores in Cavalier games may seem to be easily explainable. However,

many bettors committed a serious error when setting the line; they failed to consider the context

under which these high scores were attained (Silver, 2012, pp. 239-240). Discerning a pattern is

easily done in a data-rich environment, but it is crucial to consider these patterns within their

context in order to ascertain whether they indicate noise or signal.

5 When assigning odds to basketball scores, bookmakers set an expected total for the game. This total refers to the number of points l ikely to be scored in the game. Thus, a tendency to “go over” this total refers to the fact that

consistently, in any given game, more points are being scored than expected.

Page 32: Al-Mqbali, Leila, Big Data - Research Project

31

The Predictive Capacity of Big Data: Understanding the

Human’s Role and Limitations in Predictive Analytics

Advocates of the Big Data movement argue that the substantial growth in volume, velocity,

and variety increases the potential gains from predictive analytics. According to them, the shift

towards Big Data should effectively afford the analyst greater capacity to infer accurate

predictions. More sceptical observers argue that in connecting the analyst’s subjective view of

reality with the objective facts about the universe, the possibility for more accurate predictions

hinges on a belief in an objective truth and an awareness that we can only perceive it imperfectly.

As human beings we have imperfect knowledge, and so “wherever there is human

judgement there is the potential for bias” (Silver, 2012, p.73). Forecasters rely on many different

methods when making predictions, but all of these methods are contingent upon specific

assumptions and inferences about the relevant states or events in question – assumptions that

may be wrong. Let us now further examine the limitations that assumptions introduce to the

analysis.

Models and Assumptions

Assumptions lie at the foundation of every model. A model is essentially a theoretical

construct which uses a simplified framework in order to infer testable hypotheses regarding a

question of interest. Dependent upon the analyst’s perceptions of reality, models guide selection

criteria regarding which data is to be included and how it is to be assembled. The analyst must

Page 33: Al-Mqbali, Leila, Big Data - Research Project

32

decide which variables are important and which relationships between these variables are

relevant. Ultimately, all models contain a certain degree of subjectivity, as they employ many

simplifying assumptions and thus capture only a slice of reality. However, the choice of

assumptions in data analysis is of critical importance, as varying assumptions often generate very

different results.

All models are tools to help us understand the intricate details of the universe, but they

must never be mistaken for a substitute for the universe. As Norbert Wiener famously put it, “the

best material model for a cat is another, or preferably the same cat.” (Rosenblueth & Wiener,

1945, p.320). In other words, every model omits some detail of the reality, as all models involve

some simplifications of the world. Moreover, “how pertinent that detail might be will depend on

exactly what problem we’re trying to solve and on how precise an answer we require” (Silver,

2012, p.230).

Again, this emphasizes the importance of constructing a model in such a way that its

design is consistent with appropriate assumptions and examines the relationship between all

relevant variables. As Big Data attracts increasing focus, we must not fail to recognize that the

predictions we infer from analysis are only as valid and reliable as the models they are founded

on.

For example, consider a situation where you are asked to provide a loan to a new company

which operates ten branches across the country. Each branch is determined to have a relatively

small (say 3%) chance of defaulting, and if one branch defaults, their debt to you will be spread

between the remaining branches. Thus, the only situation where you would not be paid back is

the situation in which all ten branches default. What is the likelihood that you will not be repaid?

Page 34: Al-Mqbali, Leila, Big Data - Research Project

33

In fact, the answer to this question depends on the assumptions you will make in your

calculations.

One common dilemma faced by analysts is whether or not to assume event independence.

Two events are said to be independent if the incidence of one event does not affect the likelihood

that the other will also occur. In our hypothetical scenario, if you were to assume that each

branch is independent of the other, the risk of the loan defaulting would be exceptionally small

(specifically, the chance that you would not be repaid is (0.03)10). Even if nine branches were to

default, the probability that the tenth branch would also fail to repay the loan is still only 3%.

This assumption of independence may be reasonable if the branches were well diversified, and

each branch sold very distinct goods from all other branches. In this case, if one branch defaulted

due to low demand for the specific goods that they offered, it is unlikely that the other branches

would now be more prone to default as they offer very different commodities.

However, if each branch is equipped with very similar merchandise, then it is more likely

that low demand for the merchandise in one branch will coincide with low demand in the other

branches, and thus the assumption of independence may not be appropriate. In fact, considering

the extreme case where each branch has identical products and consumer profiles are the same

across the country, either all branches will default or none will. Consequently, your risk is now

assessed on the outcome of one event rather than ten, and the risk of losing your money is now

3%, which is several hundred thousand times higher than the risk calculated under the

assumption of independence.

Page 35: Al-Mqbali, Leila, Big Data - Research Project

34

Evidently, the underlying assumptions of our analysis can have a profound effect on the

results. If the assumptions upon which a model is founded are inappropriate, predictions based

on this model will naturally be wrong.

The Danger of Overfitting

Another root cause of failure in attempts to construct accurate predictions is model

overfitting. The concept of overfitting has its origins in Occam’s Razor (also called the principle

of parsimony), which states that we should use models which “contain all that is necessary for

the modeling but nothing more” (Hawkins, 2003, p.1). In other words, if a variable can be

described using only two predictors, then that is all that should be used: including more than two

predictors in regression analysis would infringe upon the principle of parsimony. Overfitting is

the term given to the models which violate this principle. More generally, it describes the act of

misconstruing noise as signal6, and results in forecasts with inferior predictive capacity.

Overfitting generally results from the use of too many parameters relative to the quantity of

observations, thereby increasing the random error present in the model and obscuring the

underlying relationships between the relevant predictors. In addition, the potential for overfitting

also depends on the model’s compatibility with the shape of the data, and on the relative

magnitude of model error to expected noise in the data. Here, we define model error as the

divergence between the outcomes in the model and reality due to approximations and

assumptions.

6 This is in contrast with underfitting, which describes the scenario when one does not capture as much of the signal as is possible. Put differently, underfitting results from the fact that some relevant predictors are missing

from the model. We focus here on overfitting as it is more common in practice.

Page 36: Al-Mqbali, Leila, Big Data - Research Project

35

In order to see how the concept of overfitting arises in practice, consider the following.

Suppose we have a dataset with 100 observations, and we know beforehand exactly what the

data will look like. Clearly, there is some randomness (noise) inherent in the dataset, although

there appears to be enough signal to identify the relationship as parabolic. The relationship is as

shown below:

However, in reality, the number of observations available to us is usually restricted. Suppose we

now only have access to 25 of the hundred observations. Without knowing the true fit of the data

beforehand, the true relationship appears to be less certain. Cases such as these are prone to

overfitting, as analysts design complex functional relationships that strive to include outlying

data points – mistaking the randomness for signal (Silver, 2012). Below, the overfit model is

represented by the solid line; and the true relationship by the dotted line:

Source: Silver (2012). True Distribution of Data. The Signal and the Noise.

Page 37: Al-Mqbali, Leila, Big Data - Research Project

36

Errors such as these can broaden the gap between the analyst’s subjective knowledge and the

true state of the world, leading to false conclusions and decreased predictive capacity.

Overfitting is highly probable in situations where the analyst has a limited understanding of the

underlying fundamental relationships between variables, and when the data is noisy and too

restricted.

Now that we have examined potential scenarios in which overfitting may occur, we will

examine why it is undesirable in Big Data predictive analytics. First, including predictors which

perform no useful function necessitates the need to “measure and record these predictors so that

you can substitute their values in the model” (Hawkins, 2003, p.2) in all future regressions

undertaken with the model. In addition to wasting valuable resources by documenting ineffectual

parameters, this also increases the likelihood of random errors which can lead to less precise

predictions. A related issue comes from the fact that including irrelevant predictors and

estimating their coefficients increases the amount of random variation (fluctuations due to mere

chance) in the resulting predictions. Despite these issues, however, perhaps the most pressing

concern regarding overfitting results from its tendency to make the model appear to be more

Source: Silver, (2012). Overfit Model.The Signal and the Noise.

Page 38: Al-Mqbali, Leila, Big Data - Research Project

37

valid and reliable than it really is. One frequently used method of testing the appropriateness of a

model is to measure how much variability in the data is explained by the model. In many cases,

overfit models explain a higher percentage of the variance than the correctly fit model. However,

it is critical that we recognize that the overfit model achieves this higher percentage “in essence

by cheating – by fitting noise rather than signal. It actually does a much worse job of explaining

the real world.” (Silver, 2012, p.167).

The crux of the problem of overfitting in predictive analytics is that, because the overfit

model looks like a better imitation of reality and thus provides the illusion of greater predictive

capacity, it is likely to receive more attention from publications etc. than models with a more

correct fit and lower prediction value. If the overfit models are the ones which are accepted, the

decision-making suffers as a result of misleading results.

With Big Data, the problem of overfitting may be amplified, as the nature of Big Data tools

and applications allows us to investigate increasingly complex questions. There are various

techniques for avoiding the problem, some of which are designed to explicitly penalize models

which violate the principle of parsimony. Other techniques test the model’s performance by

splitting the data, and using half to build the model and half to validate it (also known as early

stopping). The choice of avoidance mechanism is at the discretion of the analyst and is

influenced by the nature of the issue the test addresses.

Page 39: Al-Mqbali, Leila, Big Data - Research Project

38

Personal Bias, Confidence, and Incentives

As we have discussed, many predictions will fail due to the underlying construct of the

model: assumptions may be inappropriate, key pieces of context may be omitted, and models

may be overfitted. However, even if these problems are avoided, there is still a risk that the

prediction may fail due to the attitudes and behaviours of humans themselves. In fact, failure to

recognize our attitudes and behaviours as obstacles to better prediction can potentially increase

the odds of such failure. As Silver (2012) notes, “data driven predictions can succeed – and they

can fail. It is when we deny our role in the process that the odds of failure rise” (p. 9).

Again, the root of the problem is that all predictions involve exercising some degree of

human judgment, where each individual bases his/her judgments on their own subjective

knowledge, psychological characteristics, and even monetary incentives. It has been shown that,

rather than accepting results from statistical analysis at face value, individual judgmental

adjustments result in “forecasts that were about 15% more accurate” (Silver, 2012, p. 198). For

example, a more cautious individual - or one with a lot at stake if their prediction is wrong – may

choose to believe the average (aggregate) prediction rather than the prediction of any one

individual forecaster. In fact, specialists in many different fields of study have observed the

tendency for group forecasts to outperform individual forecasts, and so choosing the aggregate

forecast may be a reasonable judgment in some cases. However, in other cases choosing the

aggregate prediction may hinder potential improvements to forecasts, as improvements to any

individual prediction will subsequently improve the group prediction as well.

Moreover, applying individual judgments to analyses introduces the potential for bias, as it

has been shown that people may construct their forecasts to cohere with their personal beliefs

Page 40: Al-Mqbali, Leila, Big Data - Research Project

39

and incentives. For instance, researchers have found that forecasts which are managed

anonymously outperform – in the long run - predictions which name their forecaster. The reason

for this trend lies in the fact that incentives change when people have to take responsibility for

their predictions: “if you work for a poorly known firm, it may be quite rational for you to make

some wild forecasts that will draw big attention when they happen to be right, even if they aren’t

going to be right very often” (Silver, 2012, p. 199). Effectively, individuals with lower

professional profiles have less to lose by declaring bolder or riskier predictions. However,

concerns with status and reputation distract from the primary goal of making the most precise

and accurate prediction possible.

Page 41: Al-Mqbali, Leila, Big Data - Research Project

40

Big Data: IT Challenge

The Big Data Stack

In an attempt to infer more accurate predictions, many experts are analyzing larger

volumes of data and are aiming for increasingly sophisticated modelling techniques. As the trend

towards the “Big Data” approach to knowledge and discovery grows, there is a new architectural

construct which requires development and through which data must travel. Often referred to as

the “Big Data stack”, this construct is made of several moving components which work together

to “comprise a holistic solution that’s fine-tuned for specialized, high-performance processing

and storage” (Davenport & Dyché, 2013, p. 29).

Legacy Systems and System Development

Developing a computer-based system requires a great deal of time and effort, and therefore

such systems tend to be designed for a long lifespan. For example, “much of the world’s air

traffic control still relies of software and operational processes that were originally developed in

the 1960s and 1970s” (Somerville, 2010, para. 1). These types of systems are called legacy

systems, and they combine dated hardware, software, and procedures in their operation.

Therefore, it is difficult and often impossible to alter methods of task execution as these methods

rely on the legacy software: “Changes to one part of the system inevitably involve changes to

other components” (Somerville, 2010, para. 2).

However, discarding these systems is often too expensive after only several years of

implementation, and so instead they are frequently modified to facilitate changes to business

Page 42: Al-Mqbali, Leila, Big Data - Research Project

41

environments. For example, additional compatibility layers may be regularly added as new tools

and software are often incompatible with the system. Clearly, the development of computer-

based systems must be considered in juxtaposition with the evolution in its surrounding

environment. Somerville (2010) notes that “changes to the environment lead to system change

that may then trigger further environmental changes” (p.235), in some cases resulting in a shift

of focus from innovation to maintaining current status.

Building the Big Data Stack

The advent of Big Data constitutes a major environmental change in terms of firms’

objectives, and it has necessitated considerable modifications and redesign of computer and

process systems. One such solution, the Big Data Stack, is well equipped to facilitate businesses’

continuous system innovations, as its configuration uses packaged software solutions that are

specifically fine-tuned to fit the variety of data formats. The composition and assembly of the

Stack is shown below:

Page 43: Al-Mqbali, Leila, Big Data - Research Project

42

Storage

The storage layer is the foundation of the edifice. Before data is collected, there must be

space for it to be recorded and held until it has been processed, distilled, and analyzed.

Previously available technologies offered limited space capacity, and storage devices with large

capacity were new commodities and therefore not cost-effective. As a result, the amount of data

that could be used in analysis was restricted right from the outset. However, disk technologies

are becoming increasingly efficient which is producing a subsequent cost decrease in the storage

of large and varied data sets, and increased storage capacity represents new possibilities to

collect larger amounts of data.

Source: Davenport & Dyché, 2013. The Big Data Stack. International Institute for Analytics.

Page 44: Al-Mqbali, Leila, Big Data - Research Project

43

Platform Infrastructure

Data can move from the storage layer to the platform infrastructure, which is comprised of

various functions which collaborate to achieve the high-performance processing that is

demanded in companies which utilize Big Data. Consisting of “capabilities to integrate, manage,

and apply sophisticated computational processing to the data” (Davenport & Dyché, 2013, p. 9),

the platform infrastructure is generally built on a Hadoop foundation. Hadoop foundations are

cost-effective, flexible, and fault tolerant software frameworks. Fundamentally, Hadoop enables

the processing of high volume data sets across collections of servers, and it can be created to

scale from an individual machine to a multitude of servers.

Offering high performance processing at a low price to performance ratio, Hadoop

foundations are both flexible and resilient as the software is able to detect and manage faults at

an early stage of the process.

Data

As previously discussed, Big Data is vast and structurally complex, and the data layer

combines elements such as Hadoop software structures with different types of databases for the

purpose of combining data retrieval mechanisms with pattern identification and data analysis.

This combination of databases is used to design Big Data strategies, and therefore the data layer

manages data quality, reconciliation, and security when formulating such schemes.

Page 45: Al-Mqbali, Leila, Big Data - Research Project

44

Application Code, Functions and Services

Big Data’s use differs with the underlying objectives of analysis, and each objective

necessitates its own unique data code which often takes considerable time to implement and

process. In solution to these issues, Hadoop employs a processing engine called MapReduce.

Using this engine, analysts can redistribute data across disks and at the same time perform

intricate computations and searches on the data. From these operations, new data structures and

datasets can then be formed using the results from computation (e.g. Hadoop could apply

MapReduce to sort through social media transactions, looking for words like “love”, “bought”,

etc. and thereby establish a new dataset listing key customers and/or products (Davenport &

Dyché, 2013, p. 11).

Business View

Depending on the application of Big Data, additional processing may be necessary.

Between data and results, an intermediate stage may be required, often in the form of a statistical

model. This model can then be analysed to achieve results consistent with original objective.

Therefore, the business view guarantees that Big Data is “more consumable by the tools and the

knowledge workers that already exist in an organization” (Davenport & Dyché, 2013, p. 11).

Presentation and Consumption

One particular distinguishing characteristic of Big Data is that it has adopted “data

visualisation” techniques. Traditional intelligence technologies and spreadsheets can be

Page 46: Al-Mqbali, Leila, Big Data - Research Project

45

cumbersome and difficult to navigate in a timely manner. However, data visualization tools

permit information to be viewed in the most efficient manner possible.

For example, information can be presented graphically to depict trends in the data, which

may lead to a faster gain in insight or give rise to further questions – thereby prompting further

testing and analysis. Many data visualization software are now so advanced that they are more

cost and time effective than traditional presentation systems. It is important to note though that

data visualizations become more complicated to read when we are dealing with multivariate

predictor models, as the visualization in these cases encompasses more than two dimensions.

Methods to address this challenge are in development, and there now exist some

visualization tools that select the most suitable and easy-to-read display given the form of the

data and the number of variables.

Page 47: Al-Mqbali, Leila, Big Data - Research Project

46

Big Data: Benefits and Value

Attracting attention from firms in all industries, Big Data offers many benefits to those

companies with the ability to harness its full potential. Firms using “small” and internally

assembled data derive all of the data’s worth from its primary use (the purpose for which the data

was initially collected). With Big Data, “data’s value shifts from its primary use towards its

potential future uses” (Mayer-Schonberger & Cukier, 2013, p.99) thus leading to considerable

increases in efficiency. Employing Big Data analytics allows firms to increase their innovative

capacity, and realize substantial cost and decision time reductions. In addition, Big Data

techniques can be applied to support internal business decisions by identifying complex

relationships within data. Despite these promising benefits, it is also important to recognize that

much of Big Data’s value is “largely predicated on the public’s continued willingness to give

data about themselves freely” (Brough, n.d., para. 11). Therefore, if such data were to be no

longer publically available due to regulation etc., the value of Big Data would be significantly

diminished.

Unlocking Big Data’s Latent Value: Recycling Data

As the advances in Big Data take hold, the perceived intrinsic monetary value of data is

changing. In addition to supporting internal business decisions, data is increasingly considered to

be the good to be traded. Decreasing storage costs combined with the increased technical

capacity to collect data means that many companies are finding it easier to justify preserving the

data rather than discarding it when they have completed its primary processing and utilization.

Page 48: Al-Mqbali, Leila, Big Data - Research Project

47

Effectively, increased computational abilities in Big Data analytics have helped to facilitate data-

re-use. Data is now viewed as an intangible asset, and unlike material goods its value does not

diminish after a one-time use. Indeed, data can be processed multiple times – either in the same

way for the purpose of validation, or in a number of different ways to meet different goals and

objectives. After its initial use, the intrinsic value of data “still exists, but lies dormant, storing its

potential... until it is applied to a secondary use” (Mayer-Schonberger & Cukier, 2013, p. 104).

Implicit in this is the fact that even if the first several exploitations of the data generate little

value, there is still potential value in data which may eventually be realized.

Ultimately, the value of data is subject to the analyst’s abilities. Highly creative analysts

may think to employ the data in more diverse ways, and as such the sum of the value they extract

from the data’s iterative uses may be far greater than that extracted by another analyst with the

same dataset. For example, sometimes the latent value of data is only revealed when two

particular datasets are combined, as it is often hard to discern their worth by examining some

datasets on their own. In the age of Big Data, “the sum is more valuable than its parts, and when

we recombine the sums of multiple datasets together, that sum too is worth more than its

individual ingredients” (Mayer-Schonberger & Cukier, 2013, p. 108).

Product and Service Innovation

Competitive Advantage

Big Data analytics have enabled innovation in a broad range of products and services,

while reshaping business models and decision making processes alike. For example, advances in

Big Data storage and processing capabilities have facilitated the creation of online language

Page 49: Al-Mqbali, Leila, Big Data - Research Project

48

translation services. Advanced algorithms allow users to alert service providers whenever their

intentions are misunderstood by the systems. In effect, users are now integrated into the

innovation process, educating and refining systems as much as the system creators do. This not

only allows for vast cost and time reductions, but also offers a powerful competitive advantage to

companies who integrate these types of analytic processes.

For instance, a new provider of an online translation service may have trouble competing

with an established enterprise not only due to the lack of brand recognition, but also because

their competitors already have access to an immense quantity of data. The fact that so much of

the performance of established services, such as Google Translate, are the result of the consumer

data they have been incorporating over many years may constitute a significant barrier to entry

by others into the same markets. In other words, what is considered an advantage of Big Data

innovation for one firm (competitive advantage) is conversely a disadvantage for other firms

(barriers to entry).

Improved Models and Subsequent Cost Reductions

Many innovations are made possible due to the increased capacity of Big Data analytics

to identify complex relationships in data. For example, the accumulation of a greater volume of

observations has made it much easier to correctly discern non-linear relationships, which may

revise predictive and decision making models to afford the analyst greater accuracy. For

instance, models used in fraud detection are becoming increasingly sophisticated, often allowing

anomalies to be detected in near real time and resulting in significant cost reductions. Some

estimate that Big Data will “drive innovations with an estimated value of GDP 24 billion... and

Page 50: Al-Mqbali, Leila, Big Data - Research Project

49

there are also big gains for government, with perhaps GBP 2 billion to be saved through fraud

detection” (Brough, n.d., para. 7).

In addition, innovations and subsequent cost reductions have been achieved through the

development of personalized customer preference technologies. The nature of Big Data allows

for greater insight into human behavioural patterns, as fewer data points are omitted from

analysis in an attempt to glean all possible value from each observation. Consequently, greater

attention to detail “makes it possible to offer products and services based on relevance to the

individual customer and in a specific context” (Brough, n.d., para. 3). Innovative services such

as Amazon’s personalized book recommendations allow firms to efficiently direct their

advertising campaigns to target those consumers perceived as more likely to be interested in their

products, and thereby reduce advertising and marketing costs. Therefore, much of the value of

Big Data comes not only from offering new products, but often from offering the same product

(Amazon is still offering books) in a more efficient way. In other words, value from Big Data

can also be enhanced when we do the same thing as before, but cheaper or more effectively using

advanced analytic models.

Improved Models and Subsequent Time Reductions

As well as offering significant cost reductions, Big Data processing techniques have

helped facilitate vast reductions in the time it takes to complete tasks. For instance, by employing

Big Data analytics and more sophisticated models, Macy’s was able to reduce the time taken to

optimize the pricing of its entire range of products by approximately 73% from over 27 hours to

approximately 1 hour (Davenport and Dyché, 2013). Not only does this substantial time

Page 51: Al-Mqbali, Leila, Big Data - Research Project

50

reduction afford Macy’s greater internal efficiency, but it also “makes it possible for Macy’s to

re-price items much more frequently to adapt to changing conditions in the retail marketplace”

(Davenport and Dyché, 2013, p. 5), thereby affording the company a greater competitive edge.

Firms can gain larger shares of their respective markets by being able to make faster decisions

and adapt to changing economic conditions faster than their rivals, making decision-time and

time-to-market reductions in to a significant competitive advantage of using Big Data over

“small”.

The power of Big Data analytics also affords companies greater opportunities to drive

customer loyalty through time reductions in interactions between firms and consumers. For

example, for firms utilizing small data, once a customer leaves their store/facility, they are

unable to sell/market to that person until that individual chooses to return to the store. However,

firms using Big Data technologies are able to exercise much more control over the marketing

process as they can interact with consumers whenever they wish, regardless of whether or not an

individual is specifically looking to buy from their company at any given moment. Many firms

now possess advanced technology that allows them to send e-mails, targeted offers, etc. to

customers and interact with them in real time, potentially impacting customer loyalty.

Page 52: Al-Mqbali, Leila, Big Data - Research Project

51

Big Data: Costs and Challenges

Notwithstanding Big Data’s obvious benefits, Big Data potentially poses challenges with

regards to privacy, and (operationally) in the determination of which data to include in the

models’ development process. As the Wall Street Journal notes: “in our rush to embrace the

possibilities of Big Data, we may be overlooking the challenges that Big Data poses – including

the way companies interpret the information, manage the politics of data and find the necessary

talent to make sense of the flood of new information” (Jordan, 2013, para. 2).

For every apparent benefit in using Big Data, there exists a potential challenge. Thus, the

efficacy of data re-use will not necessarily add further value if data loses utility over time. In

addition, increased enterprise efficiency due to reduced costs and decision time has to be

balanced against large investments that are required to develop Big Data infrastructure. Thus,

companies with large investments in Big Data technologies stand to lose their investment and

incur opportunity costs if Big Data does not help them realize their objectives more effectively.

Employing Big Data analytics requires careful cost/benefit analysis with decisions of when and

how to utilize Big Data being made according to the results.

Conceptual Issues: How to Measure the Value of Data

The market-place is still struggling to effectively quantify the value of data, and since

many companies today essentially consist of nothing but data (e.g. social media websites) it is

increasingly difficult to appraise the net value of firms. Consider Facebook: on May 18th 2012,

Facebook officially became a public company. Boasting an impressive status as the world’s

Page 53: Al-Mqbali, Leila, Big Data - Research Project

52

largest social network, on May 17th 2012 Facebook had been evaluated at $38 per share,

effectively setting it up to have the third largest technology initial public offering (IPO) in

history7. If all shares were to be floated, including monetizing those stock options held by

Facebook executives and employees, the company’s total worth was estimated at near $107

billion (Pepitone, 2012).

As is often the case with IPOs, stock prices soared by close to 13% within hours of the

company going public to reach a high of approximately $43. However, within that same day

stocks began to decline, and Facebook stocks closed at the end of the day at just $38.23. Worse

still, Madura (2015) notes that “three months after the IPO, Facebook’s stock price was about

$20 per share, or about 48% below the IPO open price. In other words, its market valuation

declined by about $50 billion in three months” (p. 259).

What was the explanation for such a drastic plunge? To explain it, we must first look at the

company’s valuation using standard accounting practices. In its financial statements for the year

2011, Facebook’s assets were estimated at $6.3 billion, where assets’ values accounted for

hardware and office equipment, etc. Financial statements also include valuations of intangible

assets such as goodwill, patents, and trademarks etc., and the relative magnitude of these

intangible assets as compared with physical assets is increasing. Indeed, “there is widespread

agreement that the current method of determining corporate worth, by looking at a company’s

‘book value’ (that is, mostly the worth of its cash and physical assets), no longer adequately

reflects the true value” (Mayer-Schonberger & Cukier, 2013, p. 118). Herein lies the reason for

the divergence between Facebook’s estimated market worth and its worth under accounting

criteria.

7 Visa has had the largest tech IPO to date, followed by auto maker General Motors (GM).

Page 54: Al-Mqbali, Leila, Big Data - Research Project

53

As mentioned previously, intangible assets are generally accepted as including goodwill

and strategy, but increasingly, for many data-intensive companies, raw data itself is also

considered an intangible asset. As data analytics have become increasingly more prominent in

business decision making, the potential value of a company’s data is increasingly taken into

account when estimating corporate net worth. Companies like Facebook contain data on millions

of users – Facebook now reports over 1.11 billion users – and as such each user represents a

monetized sum in the form of data.

In essence, the above example serves to illustrate that as of yet there is no clear way to

measure the value of data. As discussed previously, data’s value is contingent on its potential

worth from re-use and recombination, and there is no direct way to observe or even anticipate

what this worth may be. Therefore, while data’s value may now be increased exponentially as

firms and governments alike begin to realize its potential for re-use, exactly how to measure this

value is unclear.

Recycling Data: Does Data’s Value Diminish?

Previously, it was noted that advanced storage capacities combined with the decreasing

costs of storing data have provided strong incentives for companies to keep and reuse data for

purposes not originally foreseen, rather than discard it after its initial use. It does seem, however,

that the value which can be wrought from data re-use has its limits.

It is inevitable that most data loses some degree of value over time. In these cases

“continuing to rely on old data doesn’t just fail to add value; it actually destroys the value of

fresher data” (Mayer-Schonberger & Cukier, 2013, p.110). As the environment around us is

Page 55: Al-Mqbali, Leila, Big Data - Research Project

54

continually changing, newer data tends to outweigh older data in its predictive capacity. This

then raises the question: how much of the older data should be included in order to guarantee the

effectiveness of the experiment?

Consider again Amazon’s personalized book recommendations site. This service is only

representative of increased marketing efficiency if its recommendations adequately reflect the

individual consumer’s interests. A book a customer bought twenty years ago may no longer be

an accurate indicator of their interests, thereby suggesting that Amazon should perhaps exclude

older data from analysis. If this data is included, a customer may see related recommendations,

presume all recommendations are just as irrelevant, and subsequently fail to pay any attention to

Amazon’s recommendations service. If this is the case for just one customer, it may not pose

such a huge problem or constitute a large waste of resources. However, if many customers

perceive Amazon’s service as being of little worth, then Amazon are effectively wasting precious

money and resources marketing to customers who are not paying any attention. This example

serves to illustrate the clear motivation for companies to use information only so long as it

remains productive. The problem lies in knowing which data is no longer useful, and in

determining the point beyond which it begins to diminish the value of more recent data.

In fact, many companies (including Amazon) have now introduced advanced modelling

techniques to resolve these challenges. For example, Amazon can now keep track of what books

people look at, even if they do not purchase them. If these books were recommended to

customers based on previous purchases, Amazon’s models interpret that those previous

purchases are still representative of the consumer’s current preferences. In this way, previous

purchases can now be ranked in order of their perceived relevance to customers, to further

advance the recommendations service. For instance, the system may interpret from your

Page 56: Al-Mqbali, Leila, Big Data - Research Project

55

purchase history that you value both cooking books and science fiction, but because you buy

cooking books only half as often as you buy science fiction they may ensure that the large

majority of their recommendations pertain to science fiction (the category which they believe to

be more representative of your interests). Knowing which data is relevant and for how long still

represents a significant obstacle for many companies. However, successful steps to resolve these

challenges can result in positive feedback in the form of improved services and sales.

Big Data and Implications for Privacy

Another issue central to the discussion of Big Data is its implications for peoples’

privacy. The rise of social media in recent years has resulted in a rapid increase in the amount of

unstructured online data, and many data driven companies are using this consumer data for

purposes that individuals are often unaware of. When consumers post or search online, their

online activities are being closely monitored and stored, often without their knowledge or

consent. Even when they do consent to have companies such as Amazon or Google keep records

of their consumer history, they still do not often have any awareness of many potential secondary

uses of this data. At the heart of Big Data privacy concerns are questions regarding data

ownership and use, and the future of Big Data is contingent upon the answers.

As explained previously, data’s value is now more dependent on its cumulative potential

uses than its initial use. Since it is unlikely that a single firm will be able to unlock all of the

latent value from a given dataset, in order to maximize Big Data’s value, many firms license the

use of accumulated data to third parties in exchange for royalties. In doing so, all parties have

Page 57: Al-Mqbali, Leila, Big Data - Research Project

56

incentive to maximize the value that can be extracted by means of re-using and recombining

data.

Threats to privacy result from companies which conduct data aggregation on a massive

scale, particularly personal information, and “data brokers” who realize that there is money to be

made in selling such information. For these firms, data is the raw material, and because they

compete on having more data to sell than their competitors, they have incentive to over collect

data. Firms which pay for this information include insurance companies and other corporations

which collect and create “profiles” of individuals in order to establish indicators such as credit

ratings, insurance tables, etc. Due to the large inherent biases in Big Data, it has, for instance,

been shown that these credit reports are often inaccurate, leading some experts to express

concerns that “people’s virtual selves could get them written off as undesirable, whether the

[consumer profile] is correct or not” (White, 2012). Such outcomes have been dubbed by some

as “discrimination by algorithm”. In other words, in Big Data solutions, “data may be used to

make determinations about individuals as if correlation were a reasonable proxy for causation”

(Big Data Privacy, 2013).

Page 58: Al-Mqbali, Leila, Big Data - Research Project

57

Cautiously Looking to the Future

As the Big Data movement continues to evolve, questions are emerging regarding its

limitations. Increasingly, Big Data technologies are facilitating the aggregation of ever larger

datasets, but it has yet to be determined whether “N” will ever equal “all”, thus resolving vastly

accumulating biases inside. Furthermore, while it has been shown that employing Big Data

analytics can lead to improved efficiency and better predictions, it has not been “shown that the

benefit of increasing data size is unbounded” (Junqué de Fortuny, Martens, & Provost, 2013, p.

10) Questions still remain concerning whether, given the required scale of investment in data

infrastructure, the return on investment would be positive, and if it is, can it continue to increase

at the rate exceeding the costs of necessary upgrades in infrastructure?

Can N=All?

There is an increasing focus on collecting as much data as possible, and many specialists

are beginning to question whether it may eventually be possible to obtain a theoretically

complete, global dataset. In other words, can N=all?

Aiming for a comprehensive dataset necessitates advanced processing techniques and

storage capacity. In addition, forecasters must have the ability to create and analyze sophisticated

models to obtain meaningful results. Previously, each of these issues presented obstacles to the

progression of Big Data, but as new methods and procedures are developed, “increasingly, we

will aim to go for it all” (Mayer Schonberger & Cukier, p. 31).

Page 59: Al-Mqbali, Leila, Big Data - Research Project

58

While striving for a dataset which approaches N=all may appear to be more feasible, it is

questionable whether one can ever obtain a dataset which is equivalent to N=all. For instance,

although it is hypothetically possible to “record and analyse every message on Twitter and use it

to draw conclusions about the public mood... Twitter users are not representative of the

population as a whole” (Harford, 2014). In this case, N=all is simply an illusion. We have N=all

in the sense that we have the entire set of data from Twitter, however the conclusions we are

drawing from this complete dataset pertain to a much broader population. Conclusions regarding

public mood relate to the global population, many of whom do not use Twitter. As discussed

earlier, Big Data is messy and involves many sources of systematic bias, and so while datasets

may sometimes appear to be comprehensive, we must always question exactly what (or who) is

missing from our datasets.

Can Big Data Defy the Law of Diminishing Marginal Returns?

Another important consideration is whether the on-going aggregation of larger datasets can

result in diminishing marginal returns to scale. The law of diminishing marginal returns holds

that “as the usage of one input increases, the quantities of other inputs being held fixed, a point

will be reached beyond which the marginal product of the variable input will decrease” (Besanko

& Braeutigam, 2011, p. 207). We have noted that in the Big Data movement, data has become a

factor of production in its own right. Consequently, it may be the case that after a certain number

of data points, the inclusion of additional data results in lower per-unit economic returns.

In fact, it has been shown that, past a certain level, some firms do experience diminishing

returns to scale when the volume of data is increased. For example, it has been observed that in

Page 60: Al-Mqbali, Leila, Big Data - Research Project

59

predictive modelling for Yahoo Movies, predictive performance was observed to increase with

sample size, but it appeared to be increasing at a progressively slower rate. There are several

reasons for this trend. First, “there is simply a maximum possible predictive performance due to

the inherent randomness in the data and the fact that accuracy can never be better than perfect”

(Junqué de Fortuny et al., 2013, p. 5). Second, predictive modelling exhibits a tendency to detect

larger and more significant correlations first, and as sample size increases the model begins to

detect more minor relationships that could not be seen with smaller samples (as smaller samples

lose granularity). Minor relationships rarely add value, and if not removed from modelling they

can result in overfitting.

It is important to note that techniques are not yet sophisticated enough to determine

whether decreasing returns are experienced by all firms in all industries. In addition, it remains to

be seen whether there exists a ceiling on returns. Further research and more advanced procedures

are needed to address these issues.

Page 61: Al-Mqbali, Leila, Big Data - Research Project

60

Final Remarks

Characterized by rapidly increasing volume, velocity, and variety of data, Big Data is

continuing to develop at an accelerated pace. Technological advances have facilitated an increase

in storage capacity and more sophisticated processing methods, allowing for the collection of

ever larger datasets in a multitude of formats.

The deployment of Big Data makes it possible for enterprises to look at data in novel ways,

thereby making it possible to identify new data patterns, allowing for different client and product

segmentation. New opportunities, however, come at the expense of estimates’ precision and

forecasting accuracy. The so-called “messiness” of Big Data, stemming from the introduction of

non-random errors from various data sets, potentially limits the application to the areas where the

cost of models’ imprecision is low. Where the cost of imprecision is high, as it is, for instance, in

the field of medicine, the existent methodology of careful sampling and hypothesis testing would

appear to be more appropriate.

In sum, the Big Data movement represents new possibilities for innovation and increased

efficiency. However, it also presents a host of conceptual challenges, as well as practical

challenges with regards to implementation and analysis. As Big Data continues to evolve, it is

important that its costs and benefits are examined on an on-going basis, in order to determine the

appropriate circumstances and conditions for its deployment.

Page 62: Al-Mqbali, Leila, Big Data - Research Project

61

Reference List

Allain, R (n.d.). Random Error and Systematic Error [web document]. Retrieved from:

https://www2.southeastern.edu/Academics/Faculty/rallain/plab193/labinfo/Error_Analysis/

05_Random_vs_Systematic.html. Accessed: February 14, 2014.

Arthur, L. (2013, August). What is Big Data? Forbes Magazine. Retrieved from:

http://www.forbes.com/sites/lisaarthur/2013/08/15/what- is-big-data. Accessed: February

10, 2014.

Bennett, Scott. (1996). Public Affairs Research Methods. Lewiston, NY: Edwin Mellen Press.

Besanko, D., & Braeutigam, R.R. (2011). Microeconomics (4th ed.). Hoboken, NJ: John

Wiley & Sons.

Big Data: What it is and why it Matters. (n.d.). In Statistical Analysis System. Retrieved from:

http://www.sas.com/en_us/insights/big-data/what-is-big-data.html. Accessed: February 8,

2014.

Brough, G. (n.d). Big Data, Big Value... Huge Opportunity. In Statistical Analysis System.

Retrieved from: http://www.sas.com/en_us/insights/articles/big-data/big-data-big-value-

huge-opportunity.html. Accessed: February 8, 2014.

Davenport, T. H., & Dyché, J. (2013). Big Data in Big Companies: International Institute for

Analytics. In Statistical Analysis System. Retrieved from:

http://www.sas.com/resources/asset/Big-Data-in-Big-Companies.pdf. Accessed: February

12, 2014.

Page 63: Al-Mqbali, Leila, Big Data - Research Project

62

Dumbill, Edd. (2012). What is Big Data: An Introduction to the Big Data Landscape. In O’Reilly

Strata Conference. Retrieved from: http://strata.oreilly.com/2012/01/what- is-big-data.html.

Accessed: 2014, February 9.

Eckerson, W. W. (2007). Predictive Analytics: Extending the Value of your Data Warehousing

Investment. In Statistical Analysis System. Retrieved from:

http://www.sas.com/events/cm/174390/assets/102892_0107.pdf. Accessed: February 15,

2014.

Exploring the Future Role of Technology in Protecting Privacy. (2013, June 19). In MIT Big

Data Initiative at CSAIL. Retrieved from:

http://bigdata.csail.mit.edu/sites/bigdata/files/u9/MITBigDataPrivacy_WKSHP_2013_final

vWEB.pdf. Accessed: March 24, 2014.

Harford, T. (2014, March 28). Big Data: Are We Making a Big Mistake? Financial Times

Magazine. Retrieved from: http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-

00144feabdc0.html#axzz2xUlFCa1x. Accessed: March 28, 2014.

Hawkins, D.M. (2003, October). The Problem of Overfitting. Journal of Chemical Information

and Computing Sciences, 44(1), 1-12. Retrieved from:

http://www.cbs.dtu.dk/courses/27618.chemo/overfitting.pdf. Accessed: March 4, 2014.

The Invention of Writing. (n.d.). Teacher Resource Center Ancient Mesopotamia. Retrieved

from: http://oi.uchicago.edu/OI/MUS/ED/TRC/MESO/writing.html. Accessed: February

20, 2014.

Page 64: Al-Mqbali, Leila, Big Data - Research Project

63

Jordan, J. (2013). The Risks of Big Data for Companies. Wall Street Journal. Retrieved from:

http://online.wsj.com/news/articles/SB10001424052702304526204579102941708296708.

Accessed: February 25, 2014.

Junqué, E., Martens, D., & Provost, F. (2013, December). Predictive Modelling with Big Data: Is

Bigger Really Better? Big Data. 1(4), 215-226. doi:10.1089/big.2013.0037.

Madura, J. (2015) [In press, currently available for online purchase]. Financial Markets and

Institutions (11th ed.) Mason, OH: Cengage Learning.

Manikya, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A.H. (2011,

June). Big Data: The Next Frontier for Innovation, Competition, and Productivity. In

McKinsey Global Institute. Retrieved from:

http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_i

nnovation. Accessed: March 4, 2014.

Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A Revolution that will transform how we

Live, Work, and Think. Boston, NY: Houghton Mifflin Harcourt.

Pepitone, J. (2012, May 17). Facebook’s IPO Price: $38 per Share. CNN Money. Retrieved from:

http://money.cnn.com/2012/05/17/technology/facebook-ipo-final-price. Accessed: March

27, 2014.

Riley, J.A. & Delic, K.A. (2010). Enterprise Knowledge Clouds: Applications and Solutions. In

B. Furht, & A. Escalante (Eds.), Handbook of Cloud Computing, 437-453. New York:

Springer.

Page 65: Al-Mqbali, Leila, Big Data - Research Project

64

Rosenblueth, A. & Wiener, N. (1945, October). The Role of Models in Science. Philosophy of

Science, 12(4), 316-321. Retrieved from:

http://www.csee.wvu.edu/~xinl/papers/role_model.PDF. Accessed: March 13, 2014.

Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail — but Some Don't.

New York, NY: The Penguin Group.

Somerville, I. (2010). Legacy Systems. Web Sections. Retrieved from:

http://www.softwareengineering-9.com. Accessed: March 6, 2014.

Somerville, I. (2010). Software Engineering (9th ed.). Addison Wesley.

Vig, J.R. (1992). Accuracy, Stability, and Precision. Introduction to Quartz Frequency

Standards. Retrieved from:

http://www.oscilent.com/esupport/TechSupport/ReviewPapers/IntroQuartz/vigaccur.html.

Accessed: February 27, 2014.

Waller, M. A., & Fawcett, S. E. (2013). Data Science, Predictive Analytics, and Big Data: A

Revolution that Will Transform Supply Chain Design and Management. Journal of

Business Logistics. 34(2), 77-84. doi: 10.1111/jbl.12010.

Ward, M. (2014, March 18). Crime Fighting with Big Data Weapons. BBC News. Retrieved

from: http://www.bbc.com/news/business-26520013. Accessed: March 25, 2014.

White, M.C. (2012, July 31). Big Data Knows What You’re Doing Right Now. Time Magazine.

Retrieved from: http://business.time.com/2012/07/31/big-data-knows-what-youre-doing-

right-now/. Accessed: March 25, 2014.

Page 66: Al-Mqbali, Leila, Big Data - Research Project

65

Yates, F (1981). Sampling Methods for Censuses and Surveys (4th ed.). London, England:

Charles Griffin & Company.