big data and data science methods for management research

15
r Academy of Management Journal 2016, Vol. 59, No. 5, 14931507. http://dx.doi.org/10.5465/amj.2016.4005 FROM THE EDITORS BIG DATA AND DATA SCIENCE METHODS FOR MANAGEMENT RESEARCH The recent advent of remote sensing, mobile technologies, novel transaction systems, and high- performance computing offers opportunities to un- derstand trends, behaviors, and actions in a manner that has not been previously possible. Researchers can thus leverage big datathat are generated from a plurality of sources including mobile transactions, wearable technologies, social media, ambient net- works, and business transactions. An earlier Academy of Management Journal (AMJ) editorial explored the potential implications for data science in management research and highlighted questions for management scholarship as well as the attendant challenges of data sharing and privacy (George, Haas, & Pentland, 2014). This nascent field is evolving rapidly and at a speed that leaves scholars and practitioners alike attempting to make sense of the emergent opportunities that big data hold. With the promise of big data come questions about the analytical value and thus relevance of these data for theory developmentincluding concerns over the context-specific relevance, its reliability and its validity. To address this challenge, data science is emerging as an interdisciplinary field that combines statistics, data mining, machine learning, and analytics to un- derstand and explain how we can generate analytical insights and prediction models from structured and unstructured big data. Data science emphasizes the systematic study of the organization, properties, and analysis of data and their role in inference, including our confidence in the inference (Dhar, 2013). Whereas both big data and data science terms are often used interchangeably, big datarefer to large and varied data that can be collected and managed, whereas data sciencedevelops models that capture, visual- ize, and analyze the underlying patterns in the data. In this editorial, we address both the collection and handling of big data and the analytical tools provided by data science for management scholars. At the current time, practitioners suggest that data science applications tackle the three core elements of big data: volume, velocity, and variety (McAfee & Brynjolfsson, 2012; Zikopoulos & Eaton, 2011). Vol- umerepresents the sheer size of the dataset due to the aggregation of a large number of variables and an even larger set of observations for each variable. Velocityreflects the speed at which these data are collected and analyzed, whether in real time or near real time from sensors, sales transactions, social media posts, and sentiment data for breaking news and social trends. Varietyin big data comes from the plurality of structured and unstructured data sources such as text, videos, networks, and graphics among others. The combinations of volume, velocity, and variety reveal the complex task of generating knowledge from big data, which often runs into millions of observations, and deriving theoretical contributions from such data. In this editorial, we provide a primer or a starter kitfor potential data science applications in management research. We do so with a caveat that emerging fields outdate and improve upon methodologies while often supplanting them with new applications. Neverthe- less, this primer can guide management scholars who wish to use data science techniques to reach better answers to existing questions or explore completely new research questions. BIG DATA, DATA SCIENCE, AND MANAGEMENT THEORY Big data and data science have potential as new tools for developing management theory, but, given the differences from existing data collection and analytical techniques to which scholars are socialized in doctoral training, it will take more effort and skill in adapting new practices. The current model of management re- search is post hoc analysis, wherein scholars analyze data collected after the temporal occurrence of the eventa manuscript is drafted months or years after the original data are collected. Therefore, velocity, or the real-time applications important for management We are grateful for the insightful comments of Kevin Boudreau, Avigdor Gal, John Hollenbeck, Mark Kennedy, and Michel Wedel on earlier versions. Gerry George gratefully acknowledges the financial and research sup- port of the Lee Foundation. 1493 Copyright of the Academy of Management, all rights reserved. Contents may not be copied, emailed, posted to a listserv, or otherwise transmitted without the copyright holders express written permission. Users may print, download, or email articles for individual use only.

Upload: votruc

Post on 13-Feb-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Big Data and Data Science Methods for Management Research

r Academy of Management Journal2016, Vol. 59, No. 5, 1493–1507.http://dx.doi.org/10.5465/amj.2016.4005

FROM THE EDITORS

BIG DATA AND DATA SCIENCE METHODS FOR MANAGEMENTRESEARCH

The recent advent of remote sensing, mobiletechnologies, novel transaction systems, and high-performance computing offers opportunities to un-derstand trends, behaviors, and actions in a mannerthat has not been previously possible. Researcherscan thus leverage “big data” that are generated froma plurality of sources including mobile transactions,wearable technologies, social media, ambient net-works, andbusiness transactions.An earlierAcademyof Management Journal (AMJ) editorial explored thepotential implications for data science inmanagementresearch and highlighted questions for managementscholarship as well as the attendant challenges of datasharing and privacy (George, Haas, & Pentland, 2014).This nascent field is evolving rapidly and at a speedthat leaves scholars and practitioners alike attemptingto make sense of the emergent opportunities that bigdatahold.With thepromiseof bigdata comequestionsabout the analytical value and thus relevance of thesedata for theory development—including concernsover the context-specific relevance, its reliabilityand its validity.

To address this challenge, data science is emergingas an interdisciplinary field that combines statistics,data mining, machine learning, and analytics to un-derstand and explainhowwecan generate analyticalinsights and prediction models from structured andunstructured big data. Data science emphasizes thesystematic study of the organization, properties, andanalysis of data and their role in inference, includingour confidence in the inference (Dhar, 2013).Whereasboth big data and data science terms are often usedinterchangeably, “big data” refer to large and varieddata that can be collected and managed, whereas“data science” develops models that capture, visual-ize, andanalyze theunderlyingpatterns in thedata. Inthis editorial, we address both the collection and

handling of big data and the analytical tools providedby data science for management scholars.

At the current time, practitioners suggest that datascience applications tackle the three core elements ofbig data: volume, velocity, and variety (McAfee &Brynjolfsson, 2012; Zikopoulos & Eaton, 2011). “Vol-ume” represents the sheer size of the dataset due to theaggregation of a large number of variables and an evenlarger set of observations for each variable. “Velocity”reflects the speed atwhich these data are collected andanalyzed, whether in real time or near real time fromsensors, sales transactions, social media posts, andsentiment data for breaking news and social trends.“Variety” in big data comes from the plurality ofstructured and unstructured data sources such as text,videos, networks, and graphics among others. Thecombinations of volume, velocity, and variety revealthe complex task of generating knowledge from bigdata, which often runs into millions of observations,and deriving theoretical contributions from such data.In this editorial, we provide a primer or a “starter kit”for potential data science applications inmanagementresearch. We do so with a caveat that emerging fieldsoutdate and improve uponmethodologies while oftensupplanting them with new applications. Neverthe-less, this primer can guide management scholars whowish to use data science techniques to reach betteranswers to existing questions or explore completelynew research questions.

BIG DATA, DATA SCIENCE, ANDMANAGEMENT THEORY

Bigdata anddata sciencehavepotential asnew toolsfor developing management theory, but, given thedifferences fromexistingdata collection andanalyticaltechniques towhich scholars are socialized indoctoraltraining, it will take more effort and skill in adaptingnew practices. The current model of management re-search is post hoc analysis, wherein scholars analyzedata collected after the temporal occurrence of theevent—a manuscript is drafted months or years afterthe original data are collected. Therefore, velocity, orthe real-time applications important for management

We are grateful for the insightful comments of KevinBoudreau, Avigdor Gal, John Hollenbeck, Mark Kennedy,and Michel Wedel on earlier versions. Gerry Georgegratefully acknowledges the financial and research sup-port of the Lee Foundation.

1493

Copyright of the Academy of Management, all rights reserved. Contents may not be copied, emailed, posted to a listserv, or otherwise transmitted without the copyright holder’s expresswritten permission. Users may print, download, or email articles for individual use only.

Page 2: Big Data and Data Science Methods for Management Research

practice, is not a critical concern for managementscholars in the current research paradigm. However,data volume and data variety hold potential for schol-arly research. Particularly, these two elements of datascience can be transposed as data scope and datagranularity for management research.

Data Scope

Building on the notion of volume, “data scope”refers to the comprehensiveness of data by whicha phenomenon can be examined. Scope could implya wide range of variables, holistic populations ratherthan sampling, or numerous observations on eachparticipant. By increasing the number of observa-tions, a higher data scope can shift the analysis fromsamples to populations. Thus, instead of focusing onsample selection and biases, researchers could po-tentially collect data on the complete population.Within organizations, many employers collect dataon their employees, andmore data are being digitizedand made accessible. This includes email communi-cation, office entry and exit, radio-frequency identi-fication tagging, wearable sociometric sensors, webbrowsers, and phone calls, which enable researchersto tap into large databases on employee behavior ona continuous basis. Researchers have begun to ex-amine the utility andpsychometric properties of suchdata collectionmethods,which is critical if they are tobe incorporated into and tied to existing theoriesand literatures. For example, Chaffin and colleagues(2015) examined the feasibility of using wearable so-ciometric sensors, which use a Bluetooth sensor tomeasure physical proximity, an infrared detectorto assess face-to-face positioning between actors,and amicrophone to capture verbal activity, to detectstructure within a social network. As another exam-ple, researchers have begun to analyze large samplesof language (e.g., individuals’ posts on social media)as a non-obtrusive way to assess personality (Parket al., 2015). With changes in workplace design,communication patterns, and performance feedbackmechanisms, we have called for research on howbusinesses are harnessing technologies and data toshape employee experience and talent managementsystems (Colbert, Yee, & George, 2016; Gruber, deLeon, George, & Thompson, 2015).

Data Granularity

Following the notion of variety, we refer to “datagranularity” as the most theoretically proximal mea-surement of a phenomenon or unit of analysis.

Granularity implies direct measurement of constit-uent characteristics of a construct rather than distalinferences from data. For example, in a study ofemployee stress, granular data would include emo-tions through facial recognitionpatterns orbiometricssuch as elevated heart rates during every minute onthe job or task, rather than surveys or respondentinterviews. In experience-sampling studies on well-being, for example, researchers have begun toincorporate portable blood pressure monitors. For in-stance, in a three-week experience-sampling study,Bono, Glomb, Shen, Kim, and Koch (2013) had em-ployeeswear ambulatorybloodpressuremonitors thatrecorded measurements every 30 minutes for twohours in the morning, afternoon, and evening. Simi-larly, Ilies, Dimotakis, and DePater (2010) used bloodpressure monitors in a field setting to record em-ployees’ blood pressure at the end of each workdayover a two-week period. Haas, Criscuolo, and George(2015) studied message posts and derived meaning inwords to predict whether individuals were likely tocontribute to problem solving and knowledge sharingacross organizational units. Researchers in other areascould also increase granularity in other ways. In net-work analysis, for instance, researchers can monitorcommunication patterns across employees instead ofasking employees with whom they interact or seekadvice from retrospectively. Equivalent data wereearlier collected using surveys and indirect observa-tion, but, with big data, the unit of analysis shifts fromindividual employees to messages and physical in-teractions. Though such efforts are already seen insmaller samples of emails or messages posted ona network (e.g., Haas et al., 2015), organization-wideefforts are likely to provide clearer and holistic repre-sentations of networks, communications, friendships,advice giving and taking, and information flows (vanKnippenberg, Dahlander, Haas, & George, 2015).

BETTER ANSWERS AND NEW QUESTIONS

Together, data scope and data granularity allowmanagement scholars to develop new questions andnew theories, and to potentially generate better an-swers to established questions. In Figure 1, we por-tray a stylistic model of how data scope and datagranularity could productively inform managementresearch.

Better Answers to Existing Questions

Data science techniques enable researchers toget more immediate and accurate results for testing

1494 OctoberAcademy of Management Journal

Page 3: Big Data and Data Science Methods for Management Research

existing theories. In doing so, we hope to get moreaccurate estimations of effect sizes and their con-tingencies. Over the past decade, management the-ories have begun emphasizing effect sizes. Thisemphasis on precision is typically observed instrategy research rather than in behavioral studies.With data science techniques, the precision of effectsizes and their confidence intervals will likely behigher and can reveal nuances inmoderating effectsthat have hitherto not been possible to identify orestimate effectively.

Better answers could also come from establishingclearer causal mechanisms. For instance, networkstudies rely on surveys of informants to assessfriendship and advice ties, but, in these studies, thetemporal dimension is missing, and therefore it isdifficult to determine whether network structuredrives behavior or vice versa. Instead, collectingemail communications or other formsof exchangeona continuous level would enable researchers tomeasure networks and behavior dynamically, andthus assess more systematically cause and effect.

Although rare event modeling is uncommon inmanagement research, data science techniques couldpotentially shed more light on, for example, organi-zational responses to disasters, modeling and esti-mating probability of failure, at-risk behavior, andsystemic resilience (vander Vegt, Essens,Wahlstrom,& George, 2015). Research on rare events can use caraccident data, for instance, to analyze the role ofdriver experience in seconds leading up to an acci-dent and how previous behaviors could be modeled

to predict reaction times and responses. Insurancecompanies now routinely use such data to price in-surance coverage, but this type of data could also beuseful for modeling individual-level risk propensity,aggressiveness, or even avoidance behaviors. At anaggregate level, data science approaches such as col-lecting driver behavior using sensors to gauge actionslike speeding and sudden stopping allow more thanobserving accidents, and therefore generate a betterunderstanding of their occurrence. Such data allowcities to plan traffic flows, map road rage or accidenthot spots, and avert congestion, and researchers toconnect such data to timeliness at work and negativeor positive effects of commuting sentiment on work-place behaviors.

Additionally, data science techniques such asmonitoring call center calls can enable researchers toidentify specific triggers to certain behaviors as op-posed to simplymeasuring those behaviors. This canhelp better understand phenomena such as em-ployee attrition. Studying misbehavior is problem-atic due to sensitivity, privacy, and availability ofdata. Yet, banks are now introducing tighter behav-ioral monitoring and compliance systems that aretracked in real time to predict and deter mis-behaviors. Scholars already examine lawsuits, fraud,and collusion, but, by using data science techniques,they can search electronic communication or pressdata using keywords that characterize misbehaviorin order to identify the likelihood of misbehaviorbefore its occurrence. As these techniques becomeprevalent, it will be important to tie the new mea-sures, and the constructs they purportedly assess, toexisting theories and knowledge bases; otherwise,we risk the emergence of separate literatures using“big” and “little” data that have the capacity to in-form each other.

New Questions

With higher scope and granularity of data, it be-comes possible to explore new questions that havenot been examined in the past. This could arise be-cause data science allows us to introduce new con-structs, but it could also arise because data scienceallows us to operationalize existing constructs ina novel way. Web scraping and sentiment data fromsocial media posts are now being seen in the man-agement literature, but theyhaveyet to push scholarsto ask new questions. Granular data with high scopecould open questions in new areas of mobility andcommunications, physical space, and collaborationpatterns where we could delve deeper into causal

FIGURE 1Scope and Granularity Implications for

Management Theory

High

High

New Questions

Data Granularity

BetterAnswers

LowData Scope

2016 1495George, Osinga, Lavie, and Scott

Page 4: Big Data and Data Science Methods for Management Research

mechanisms underlying collaboration and team dy-namics, decision making and the physical environ-ment, workplace design and virtual collaborations.Tracking phone usage and physical proximity cuescould provide insight into whether individualsspend too much time on communications technol-ogy and attention allocation to social situations atwork or at home. Studies suggest that time spent onemail increases anger and conflict at work and athome (Butts, Becker, & Boswell, 2015). But suchwork could then be extended to physical and socialcontingencies, nature of work, work performanceoutcomes, and their quality-of-life implications.

Data on customer purchase decisions and socialfeedback mechanisms can be complemented withdigital payments and transaction data to delvedeeper into innovation and product adoption aswellas behavioral dynamics of specific customer seg-ments. The United Nations Global Pulse is harness-ing data science for humanitarian action. Digitalmoney and transactions through mobile platformsprovide awindow into social and financial inclusion,such as access to credit, energy and water purchasethrough phone credits, and transfer of money forgoods and services, as well as create spending pro-files, identify indebtedness or wealth accumulation,and promote entrepreneurship (Dodgson, Gann,Wladwsky-Berger, Sultan, & George, 2015). Datascience applications allow the delivery and co-ordination of public services such as the treatmentof disease outbreaks, coordination across grassrootsagencies for emergencymanagement, andprovisionof fundamental services such as energy and trans-port. Data on carbon emissions and mobility can besuperimposed for tackling issues of climate changeand optimizing transport services or traffic man-agement systems. Such technological advances thatpromote social well-being can also raise new ques-tions for scholars in identifying ways of organizingand ask fundamentally new questions on organi-zation design, social inclusion, and the delivery ofservices to disenfranchised communities.

New questions could emerge from existing theo-ries. For example, once researchers can observe andanalyze email communication or online search data,they can ask questions concerning the processes bywhich executives make decisions as opposed tostudying the individual/top management team charac-teristics that affect managerial decisions. There is roomfor using unstructured data such as video and graphicdata, and face recognition for emotions. Together, thesedata could expand conversations beyond roles, experi-ence, and homogeneity to political coalitions, public

or corporate sentiment, decision dynamics, messageframing, issue selling, negotiations, persuasion, anddecision outcomes.

Text mining can be used when seeking to answerquestions such as where do ideas or innovationscome from—as opposed to testing whether certainconditions generate groundbreaking innovations.This requires datamining of patent citations that cantrack the sources of knowledge embedded in a givenpatent and its relationships with the entire pop-ulation of patents. In addition, analytics allow in-ference of meaning, rather than word co-occurrence,which could be helpful in understanding cumula-tiveness, evolution, and the emergence of ideas andknowledge.

A new repertoire of capabilities is required forscholars to explore these questions and to handlechallenges posed by data scope and granularity. Dataare now more easily available from corporates and“open data” warehouses such as the London Data-Store. These data initiatives encourage citizens toaccess platforms and develop solutions using bigdata on public services, mobility, and geophysicalmapping among others data sources. Hence, as newdata sources and analytics become available to re-searchers, the field of management can evolve byraising questions that have not received attention asa result of data access or analysis constraints.

BIG DATA AND DATA SCIENCE STARTER KITFOR SCHOLARS

Despite offering exciting new opportunities tomanagement scholars, data science can be intimi-dating and pose practical challenges. Below, we de-tail key challenges and provide solutions. We focuson five areas: (1) data collection, (2) data storage, (3)data processing, (4) data analysis, and (5) reportingand visualization, noting that, due to automation,the boundaries between these areas become in-creasingly blurry.Ouraimis toprovideadata science“quick start guide.”Fordetailed information,we referto relevant references. In Table 1, we present a sum-mary of the data science challenges and solutions.

Data Collection

Big data allows collection of data with high scopeand granularity. Due to advances in technology, datacollection methods are more often limited by theimagination of the researcher than by technologicalconstraints. In fact, one of the key challenges is to“think outside the box” on how to establish access to

1496 OctoberAcademy of Management Journal

Page 5: Big Data and Data Science Methods for Management Research

detailed data for a large number of observations. Bigdata collection methods that help to overcome thischallenge include sensors, web scraping, and webtraffic and communications monitoring.

Using sensors, one can continuously gather largeamounts of detailed data. For example, by askingemployees to wear activity-tracking wristbands, data

can be gathered on their movements, heart rate, andcalories burned, and, asmentioned, wearable sensorscan be used to collect data on physical proximity(Chaffin et al., 2015). Importantly, this data-gatheringapproach is relatively non-obtrusive and providesinformation about employees in natural settings.Moreover, this approach allows for the collection of

TABLE 1Big Data Challenges and Solutions

Process Challenges Solutions Key references

Data access andcollection c Easy access to data offered in

standardized formats. No practi-cal limit to the size of these dataoffering unlimited scalability

c Efficiently obtain detailed data fora large number of agents

cProtocols on security, privacy, anddata rights

c Sensors

c Web scraping

c Web traffic and communicationsmonitoring

c Chaffin et al. (2015)

c Sismeiro and Bucklin (2004)

Data storagec Tools for data storage, matchingand integration of different bigdatasets

c Data reliability

c Warehousing

c SQL, NoSQL, Apache Hadoop

c Save essential information onlyand update in real time

c Varian (2014)

c Prajapati (2013)

Data processingc Use non-numeric data for quanti-tative analyses

cTextmining tools to transform textinto numbers

c Emotion recognition

c Manning, Raghavan, andSchutze (2009)

c Teixeira, Wedel, and Pieters(2012)

Data analysisc Large number of variables

c Causality

c Find latent topics and attachmeaning

c Data too large to process

c Ridge, lasso, principal compo-nents regression, partial leastsquares, regression trees

c Topic modeling, latentDirichlet allocation, entropy-based measures, and deeplearning

c Cross-validation and holdoutsamples

c Field experiments

c Parallelization, bags of little boot-strap, sequential analysis

c Hastie, Tibshirani, andFriedman (2009)

c George andMcCulloch (1993)

c Archak, Ghose, and Ipeirotis(2011)

c Tirunillai and Tellis (2012)

c Blei, Ng, and Jordan (2003)

c LeCun, Bengio, and Hinton(2015)

c Lambrecht and Tucker (2013)

c Wang, Chen, Schifano, Wu,and Yan (2015)

c Wedel and Kannan (2016)

Reporting andvisualization c Facilitate interpretation, repre-

sentation with external partnersand knowledge users

c Difficult to understand complexpatterns

c Describe data sources

c Describe methods andspecifications

c Bayesian analysis

c Visualization and graphicinterpretations

c Loughran and McDonald(2011)

c Simonsohn, Simmons, andNelson (2015)

2016 1497George, Osinga, Lavie, and Scott

Page 6: Big Data and Data Science Methods for Management Research

data over a long time period. In contrast, alternativemethods for obtaining mentioned data, such as di-aries, may influence employees’ behavior and it maybe difficult to incentivize employees to keep a diaryover a long period (cf. testing and mortality effects inconsumer panels—see Aaker, Kumar, Leone, & Day,2013: 110). Sensors can also be used to monitor theoffice environment, including office temperature,humidity, light, and noise, or the movements and gasconsumption of vehicles, as illustrated by the moni-toring of United Parcel Service of America (UPS) de-livery trucks (UPS, 2016). Similarly, FedEx allowscustomers to follownot only the locationof apackage,but also the temperature, humidity, and light expo-sure (Business Roundtable, 2016). Of course, whenusing sensors to gather data on human subjects, oneshould stay within the bounds of privacy laws andresearch ethics codes. Although this may soundobvious, the ethical implications of collecting suchdata, coupled with issues such as data ownership,are likely to be complex, and thus call for revisitingformal guidelines for research procedures.

Web scraping allows for the automated extractionof large amounts of data fromwebsites.Web scrapingprograms are widely available nowadays and oftencome free of charge, as in the case of plugins forpopular web browsers such as Google Chrome. Al-ternatively, one may use packages that are availablefor programming languages; for example, the Beau-tiful Soup package for the Python programming lan-guage.1 Somewebsites, such as Twitter, offer so-called“application program interfaces” (APIs) to ease andstreamline access to its content.2 Web scraping can beused to extract numeric data, such as product prices,but it may also be used to extract textual, audio, andvideo data, or data on the structure of social networks.Examples of textual data that can be scraped from theweb include social media content generated by firmsand individuals, news articles, and product reviews.Importantly, before scraping data from a website, oneshouldcarefullycheck the termsofuseof theparticularwebsite and theAPI, if available.Not allwebsites allowthe scraping of data. Lawsuits have been filed againstcompanies that engage in web scraping activities(Vaughan, 2013), be it mostly in cases where the scra-ped data were used commercially. In case of doubtabout the legality of scraping a website’s data for

academicuse, it isbest tocontact thewebsitedirectly toobtain explicit consent.

Moreover, most firms nowadaysmonitor traffic ontheir website—that is, the data generated by visitorsto the firm’swebpages. Firms track aggregatemetricssuch as the total number of visitors and the mostvisited pages in a given time period (e.g., Wiesel,Pauwels, & Arts, 2011). In addition, researchers canaccess data on individual visitors including the pagefromwhich they reached the focal firm’swebsite, thepages that they visited, and the order in which thesewere accessed. Additional data include the last pagethey visited before leaving the website, the time theyspent staying on each webpage, the day and time atwhich they visited the website, and whether theyhad visited thewebsite before. Data at the individualvisitor level allow for a detailed analysis of howvisitors, suchas investors, suppliers, andconsumers,navigate through a firm’s website, and thus to obtaina better understanding of their information needs,their level of interactionwith the focal firm, and theirpath to their transactionswith the firm (e.g., Sismeiro& Bucklin, 2004; Xu, Duan, & Whinston, 2014). Inaddition to monitoring incoming traffic to their web-site, firms may also track outgoing traffic—that is, thewebpages that are visited by their employees. Outgo-ing web traffic provides information about the exter-nal web-based resources used by employees and theamount of time spent browsing the Internet for leisure.Such information may be of use in studies on, for ex-ample, product innovation and employee well-being.These data can also shed light on attention allocation,employee engagement, or voluntary turnover.

Data on communications between employees canbe gathered as well. For example, phone calls andemail traffic are easily monitored. Such data sourcesmay help to uncover intra-firm networks and mea-sure communication flows between organizationaldivisions, branches, and units at various hierarchylevels.Monitoring external and internalweb, phone,and email traffic are non-obtrusive data-gatheringapproaches—that is, users are not aware that thetraffic that they generate is being monitored. Withregard to the monitoring of web traffic and othercommunications, it is important to strictly adhere toprivacy laws and research ethics codes that protectemployees, consumers, and other agents.

The above methods allow for large-scale, auto-mated, and continuous data collection. Importantly,these features significantly ease the execution offield experiments. In a randomized field experiment,a randomly selected group of agents—for example,employees, suppliers, customers, etc.—is subjected

1 The Beautiful Soup package can be obtained fromhttps://www.crummy.com/software/BeautifulSoup/.

2 The Twitter API can be found on https://dev.twitter.com/rest/public.

1498 OctoberAcademy of Management Journal

Page 7: Big Data and Data Science Methods for Management Research

to a different policy than the control group. If thebehavior of the agents in both groups is continuouslymonitored, the effect of the policy can be easilytested (e.g., Lambrecht & Tucker, 2013). Other ex-perimental setups are possible as well. For example,if random selection of agents is difficult, a quasi-experimental approach can be adopted (Aaker et al.,2013: 288) in which all agents are subjected to a pol-icy change and where temporal variation in the datacan be used to assess the effect of this change. Indoing so, it is important to rule out alternative ex-planations for the observed temporal variation in thedata. Again, privacy laws and research ethics codesshould be strictly obeyed. For example, to protectthe privacy of customers, Lewis andReiley (2014) useda third party to match online browsing and offlinepurchase data.

Data Storage

Big data typically requires large storage capacity.Often, the required storage capacity exceeds thatfound in regular desktop and laptop computers.Units of interest are observed at a very granular leveland in high detail, by pulling data from varioussources. Fortunately, several solutions are available.First, one can tailor the storage approach to the size ofthe data. Second, one can continuously update andstore variables of interest while discarding informa-tion that is irrelevant to the study.

How to store data depends on their size. For rela-tively small datasets, no dedicated storage solutionsare required. Excel, SAS, SPSS, and Stata softwaresuites can hold data for many subjects and variables.For example, Excel can hold 1,048,576 rows and16,384 columns (Microsoft, 2016). Larger datasetsrequire another storage approach. As a first option,a relational database using structured query lan-guage (SQL) can be considered. Relational databasesstore data inmultiple tables that can be easily linkedwith each other. Open source relational databasesare available; for example, MySQL and PostgreSQL.Basic knowledge of SQL is required to use relationaldatabases. For yet larger databases, a NoSQL ap-proach may be required, especially when fast pro-cessing of the data is necessary (Madden, 2012).NoSQL stands for “not only SQL” (e.g., Varian, 2014)to reflect that NoSQL database solutions do supportSQL-like syntax. Examples include the open sourceApache Cassandra system initially developed by Face-book, and Google’s Cloud Bigtable, which offers apay-as-you-use cloud solution. To effectively storedata that is too large to hold on a single computer,

many firms employ Apache Hadoop, which allowsthe allocation of data acrossmultiple computers. Tobuild such infrastructure, self-study (see, e.g., Prajapati,2013) or collaboration with information systems ex-perts is required.

When the research question and data require-ments have been clearly defined, it may not be nec-essary to store all available data. Often, it is possibleto continuously update variables of interest as newdata comes in and then to save only the updatedvariables and not the new information itself. Forexample, a sensor may record every single heartbeatof a subject. Typically, wewould not be interested inthe full series of heartbeat timestamps for eachsubject but instead would like to know, for in-stance, a subject’s average hourly heart rate. Wethus only need to store one hour of heartbeattimestamps for each subject, after which we cancreate, for each subject, one hourly observation.Wecan now discard the underlying timestamp dataand start recording the next hour of heartbeattimestamps. Particularly when the number of sen-sors and number of subjects grows large, this ap-proach provides an enormous reduction in terms ofstorage requirements. Also, when interested inconsumers’ daily behavior on a focal firm’s web-site, storage of consumers’ full web traffic data isnot required.

In field experiments, situations may occur inwhich group-level instead of subject-level informa-tion suffices. For example, one may be interested inthe effect of a policy change on whether employeesaremore likely to adopt a new online self-evaluationtool, where we assume that adoption is a binaryvariable—that is, either yes or no. Adoption of theonline tool can be measured by monitoring em-ployees’ web traffic. It is adequate to know thenumber of adopters in the control and experimentalgroup and the total number of subjects in each group.Based on these numbers, and the assumption that thedependent variable is binary, we know the exactempirical distribution of adoption. We can thuseasily perform significance tests. It must be notedthat group-level information does not suffice whenadoption would be measured on an interval or ratioscale: toperformsignificance tests, subject-level datawould be required.

Clearly, one can only resort to the secondapproach—that is, updating variables of interestwhen new data becomes available while discardingthe new data itself—if the variables of interest areknown. Moreover, an important caveat of this ap-proach is that it will not be possible to retrieve the

2016 1499George, Osinga, Lavie, and Scott

Page 8: Big Data and Data Science Methods for Management Research

more granular rawdata thatwerediscarded to reducestorage requirements.

Data Processing

Variety, one of big data’s three Vs, implies thatscholars are likely to encounter new and differenttypes of data thatmay be non-numeric. An importantnew source of data is textual data. For example,studies of social media, email conversations, an-nual report sections, and product reviews requiremethods for handling textual data (e.g., Archak et al.,2011; Loughran & McDonald, 2011). Textual datacan be used both for theory testing (e.g., we couldtest the hypothesis that positive firm news makesemployees expect a higher annual bonus) andtheory development (e.g., we could explore whichwords in themanagement section in annual reportsare associated with a positive or negative investorresponse, and use these results to develop newtheory). Before being able to include non-numericdata in quantitative analyses, these data first needto be processed.

In case of theory testing, we have a clear ideaabout the information that needs to be extractedfrom texts. For example, to determinewhether newsis positive, we could ask independent raters tomanually evaluate all news items and to providea numeric rating. When the number of news itemsgrows substantially, a few independent raterswould not be able to complete the task in an ac-ceptable amount of time. One can then scale upthe “workforce” by relying onAmazonMechanicalTurk, through which participants complete tasksfor a fixed fee (Archak et al., 2011). Another solu-tion is to automate the rating task. To do so, wetypically first removepunctuationmarks (Manninget al., 2009) from the news items and convert allletters to lowercase. Words can then be identifiedas groups of characters separated by spaces. Re-moval of punctuation marks avoids commas,semicolons, and so on being viewed as being partof a word. Conversion of characters to lowercaseensures that identical words are treated as such.Manning and coworkers (2009: 30) also noted thatinformation is lost by converting words to lower-case, suchas thedistinctionbetweenBush, the formerU.S. presidents orBritish rockband, and bush, a plantor area of land. After these initial steps, a program canbewritten to determine the number of positive wordsin each news item, where the collection of positivewords needs to be predefined. It is also possible todevelop measures based on the number of relevant

words that appear within a certain distance froma focal word, such as a firm’s name or the name ofits CEO.

Several toolboxes are available to help processtextual data; for example, NLTK for Python and tmfor R. Corrections for the total number of words ina news item can be made to reflect that a longer textcontains more positive words than a shorter text,even when the texts are equally positive (for a dis-cussion on document length normalization, seeManning et al., 2009: 129). Also, we can refine theprogrambymaking it look at groups ofwords insteadof single words to make it pick up on fragments suchas “the financial outlook for firm A is not very posi-tive,” which does not reflect positive news despitethe use of the word “positive” (Das & Chen, 2007).

In theory development, we would be interested inexploring the data. For ease of exposition, we focuson the setting where we want to assess which wordsin Management Discussion and Analysis (MD&A)sections are associated with a variable of interest. Tothis end, for eachMD&Asection,wewould first needto indicate the number of times a word occurs. Thistask is virtually impossible to perform manually.Hence, we would typically automate this process.Again, we would remove punctuation marks andconvert all text to lowercase. Typically, we wouldalso remove non-textual characters such as num-bers, as well as small, frequently occurring wordssuch as “and” and “the” as these are non-informative(Manning et al., 2009). In addition, words may bestemmed and infrequent words may be removed toreduce the influence of outliers (Tirunillai & Tellis,2012). As a basic approach, we can determine the setof unique words across all MD&A sections, and, foreachMD&A, count the number of times each uniqueword occurs. The resulting data would appear asfollows: with each row representing a single MD&A,each column represents a single unique word, andthe cells indicate the number of times a word occursin the particular MD&A. Again, a correction for thelength of the MD&A section can be applied. Thedata would contain many 0s—that is, when a par-ticular word is not used in the MD&A. To reducethe storage size of the data, a sparse matrix can beused. To explore which of the many words are as-sociated with the variable of interest, we can makeuse of the techniques for variable selection that wediscuss below.

Other sources of non-numeric data include au-dio, images, and video. Many new and interestingtechniques exist for extracting numeric informa-tion from such data. For example, image and video

1500 OctoberAcademy of Management Journal

Page 9: Big Data and Data Science Methods for Management Research

data can be used to determine a person’s emotions,which may be expressed using numeric scales(Teixeira et al., 2012).

Data Analysis

After deciding on the data collection method, andhaving successfully stored and processed the data,a new challenge arises: how to analyze the data? Wemay encounter one or both of the following scenar-ios: (a) a (very) large number of potential explanatoryvariables are available and the aim is to developtheory based on a data-driven selection of the vari-ables; (b) the data are too large to be processed byconventional personal computers. Below, we pres-ent solutions to overcome these challenges.

Variable selection. Big data may contain a (very)large number of variables. It is not uncommon toobserve hundreds or thousands of variables. In the-ory testing, the variables of interest, and possiblecontrol variables, are known. In exploratory studies,where the aim is to develop new theory based onempirical results, we need statistical methods to as-sist in variable selection. Inclusion of all variablesin a model, such as a regression model, is typicallyimpeded by high multicollinearity (Hastie et al.,2009). One could try to estimate models using allpossible combinations of explanatory variablesand then compare model fit using, for example,a likelihood-based criterion. It is easy to see thatthe number of combinations of the explanatoryvariables explodes with the number of variables,rendering this approach unfeasible in practice.Methods that can be used in the situation of a largenumber of (potential) explanatory variables in-clude ridge and lasso regression, principal com-ponents regression, partial least squares, Bayesianvariable selection, and regression trees (e.g., Varian,2014).

Ridge and lasso regression are types of penalizedregression. Standard regression models are esti-mated by minimizing the sum of the squared differ-ences between the observed (yi) and predicted (byi)values for observation i; that is, we use the co-efficients that minimize +n

i5 1ðyi 2 byiÞ2 or, substitut-ing xib for byi, +n

i5 1ðyi 2 xibÞ2, where xi is thevector of independent variables for observation i,b isthe regression coefficient vector, and n indicates thenumber of observations. In penalized regression, wedo not minimize the sum of the squared differencesbetween the observed and predicted values—that is,the residuals—but, instead, minimize the sum of thesquared residuals plus a penalty term. The difference

between ridge and lasso regression lies in the penaltythat is added to thesquaredresiduals.Thepenalty termin ridge regression is l+k

j5 1b2j , where k is the number

of slope coefficients, and, in lasso regression, we usel+k

j5 1jbj j as a penalty term, where jbj j is the absoluteof bj (Hastie et al., 2009). The regression co-efficients given by ridge regression are thus ob-tained by minimizing +n

i5 1ðyi 2 xibÞ2 1 l+kj5 1b

2j .

The expression for lasso regression is obtained bysimply replacing the penalty term in this expres-sion. Ridge and lasso regression may also be com-bined intowhat is referred to as “elastic net regression”(Varian, 2014). Itmust benoted that the intercept,b0, isnot included in the penalty term. To remove the in-tercept fromthemodelaltogether,withoutaffecting theslope coefficients, one can center the dependent andindependent variables around their mean. Moreover,researchers typically standardize all variables beforeestimation, because the coefficient size, and thus alsothe penalty term, depends on the scaling of the vari-ables (Hastie et al., 2009). It is easy to see that by settingl to 0, one can obtain the least squares estimator. Byincreasing l, the penalty term grows in impor-tance, resulting in larger shrinkage of the co-efficients until all coefficients are shrunk to 0. Wereturn to the choice of l below when we discusscross-validation.

Principal components regression offers anotherapproach to handling a large number of independentvariables. In this approach,we first extract lprincipalcomponents from the independent variables, X, andthen regress the dependent variable y on the l prin-cipal components. The principal components,Z, area linear combination of the independent variables;that is, themth principal component is given by zm5Xvm, where vm are the loadings for principal com-ponent m. Typically, the independent variables arefirst standardized (Hastie et al., 2009). The regressionmodel that is estimated after extraction of the prin-cipal components is y5 u0 1+l

m5 1umzm; that is, theprincipal components serve as independent vari-ables. Instead of shrinking the coefficients for allvariables, as in penalized regression, principal com-ponents regression handles the large number of var-iables by combining correlated variables in oneprincipal component and by discarding variablesthat do not load onto the first l principal components.Below, we return to the choice of l, the number ofprincipal components to extract.

Partial least squares regressionprovidesyet anotherapproach to dealing with a large number of variables.As in principal components regression, the depen-dent variable is regressed on newly constructed

2016 1501George, Osinga, Lavie, and Scott

Page 10: Big Data and Data Science Methods for Management Research

“components” formed from the independent vari-ables. An important difference is that principalcomponents regressions only takes the indepen-dent variables into account in constructing thesecomponents, whereas partial least squares uses theindependent and dependent variables. More spe-cifically, in principal components regression, com-ponents are formed based on the correlationsbetween the independent variables,whereas, inpartialleast squares, the level of association between in-dependent and dependent variables serves as in-put. Several algorithms for implementing partialleast squares exist, such as NIPALS and SIMPLS.We refer interested readers to Frank and Friedman(1993) for a comparison of partial least squares andprincipal components regression. As in principalcomponents regressions, a decision needs to bemadeon the number of components to be used in the finalsolution. Below, we return to this issue.

Variables may also be selected using a Bayesianapproach. To this end, George andMcCulloch (1993)introduced Bayesian regression with a latent vari-able, gj, that indicates whether variable j is selected.This latent variable takes the value 1withprobabilitypj and the value 0 with probability 1 2 pj. The co-efficient for variable j, bj, is distributed, conditionalon gj, as ð12 gjÞNð0, t2j Þ1 gjNð0, c2j t2j Þ, where N de-notes the normal distribution. By setting tj to a verysmall strictlypositivenumber and cj to a largenumber,larger than 1, we obtain the following interpretation:if gj is 0, the variable is multiplied by a near-zero co-efficient bj (basically not selecting variable xj), if gj is1, bj will likely be non-zero, thus selecting variablexj. For more detail on this method and how to decideon the values for tj and cj, we would suggest referringto George and McCulloch (1993).

An interesting method that may assist in variableselection is the regression tree. Regression trees in-dicate to what extent the average value of the de-pendent variable differs for those observations forwhich the values of an independent variable lie aboveor below a certain value; for example, the averageemployee productivity for those employees who are40yearsorolderversus thosewhoareyounger.Withinthese two age groups, the same or another variable isused to create subgroups. In every round, that in-dependent variable and that cutoff value are chosenthat provide the best fit. The resulting tree—that is, theoverview of selected variables and their cutoffvalues—indicates which variables are most impor-tant in explaining the dependent variable. Also, thismethodmay reveal nonlinear effects of independentvariables (Varian, 2014). For example, when age is

used twice to explain employee productivity, the re-sults may reveal that the most productive employeesare younger than 30, or 40 years or older, with thosebetweenages30and40being the leastproductive. It isimportant to note that the results from a regressiontree, within which independent variables are di-chotomized, do not necessarily generalize towarda regression method with continuous independentvariables. Several interesting techniques for improv-ing the performance of regression trees exist, such asbagging, boosting, and random forests (Hastie et al.,2009: chap. 15; Varian, 2014). Several other methodsfor variable selection are available, including step-wise and stagewise regression.

When exploring textual data to determine whichof the many words in texts are associated with thevariable of interest, one can make use of the tech-niques described above—each word is a single vari-able. When the data are too sparse and too highdimensional, more advanced approaches are required.Topic models, such as latent Dirichlet allocation,may be used to extract latent topics from the MD&Asections (Blei, Ng, & Jordan, 2003), thus reducingthe dimensionality of the data. The latent topics canbe labeled using entropy-based measures (Tirunillai &Tellis, 2012) and then used in subsequent analyses.Newly introduced approaches such as deep learn-ing (LeCun et al., 2015) can also help overcome theshortcomings of latent Dirichlet allocation (Liu, 2015:169–171).

Tuning variable selection models. We now re-turn to the question of how to choose the l param-eter in ridge and lasso regression, and the number ofcomponents in principal components regression andpartial least squares. To that aim, researchers can relyon cross-validation (e.g., Hastie et al., 2009, Figures3.8 and 3.10). In its basic form, cross-validation re-quires random allocation of the observations to twomutually exclusive subsets. The first subset, referredto as the “training set,” is used to estimate themodel.The second subset, the “validation set,” is used totune themodel—that is, todetermine thelparameteror the number of components.More specifically, oneshould estimate many models on the training set,each assuming a different l parameter or number ofcomponents. The estimatedmodels can be then usedto predict the observations in the validation set. Thefinal step involves choosing the model with the lparameter or number of components that providesthe best fit in the validation set. The model is nottuned on the training set, as this would lead tooverfitting (Varian, 2014). To assess the predictivevalidity—that is, the performance of the model on

1502 OctoberAcademy of Management Journal

Page 11: Big Data and Data Science Methods for Management Research

newdata—observationsare allocated to three subsets:the aforementioned training and validation sets anda third set referred to as the “test set.” The model isthen estimated on the training set, tuned on the vali-dation set, and the predictive validity is assessed bythe forecasting performance on the test set. It isimportant to note that a model that shows a strongrelationship between independent variable j andthe dependent variable y, and that, in addition,shows strong predictive validity, does not demon-strate a causal impact of xj on y (Varian, 2014). Theassociation between xj and ymay be due to reversecausality or due to a third variable that drives bothxj and y, although the latter explanation requiresthat the influence of the third variable holds in thetraining and test sets. To test causality, one wouldideally rely on a field experiment in which xj is ma-nipulated for a random subset of the observed agents(Lambrecht & Tucker, 2013).

An advantage of working with big data is thattypical datasets contain a sufficient number of ob-servations to construct three subsets. In the case ofa relatively small number of observations, the train-ing set may become too small to reliably estimate themodel, and the validation and test sets may not berepresentative of the data, thus giving an incorrectassessment of the model fit (Hastie et al., 2009).K-fold cross-validation may be preferred in thesesituations. K-fold cross-validation partitions the datain K subsets. For each value of the l parameter ornumber of components, one should first estimate themodel on the combined data from all subsets but thefirst. The estimated model is then used to predictthe observations in the first subset. The same pro-cedure is then repeated for the combined data fromall subsets but the second, and so on. By computingthe average fit across subsets, one can choose theoptimal l parameter or number of components.Common values for K are 5 and 10 (Hastie et al.,2009). In case of very small datasets, K may be set tothe number of observations minus one, to obtainleave-one-out cross-validation.

Data too large to analyze. Big data may be toolarge to analyze, even when storage on a single ma-chine is possible. Typical personal computers andtheir processors may not be able to complete thecommands in a reasonable amount of time and theymay not hold sufficient internal memory to handlethe analysis of large datasets. Below, we discussthree solutions: (1) parallelization, (2) bags of littlebootstrap, and (3) sequential updating. For addi-tional approaches, such as the divide-and-conquerapproach and the application of the Markov chain

Monte Carlo method to large datasets, we refer toWang and colleagues (2015) andWedel and Kannan(2016).

Parallelization helps to speed up computationsby using multiple of the computer’s processingunits. Nowadays,most computers are equippedwithmulticore processors, yet many routines employonly a single processing core. By allocating the taskover multiple cores, significant time gains can beobtained, although it must be noted that some timegains are lost due to the costs associated with dis-tributing the tasks. Many statistical programs andprogramming languages enable parallelization, suchas Stata’sMP version, the parallel package for R, andMatlab’s Parallel Computing Toolbox. For example,researchers may consider numerical maximizationof the likelihood, which typically relies on finitedifferencing; that is, the gradient of the likelihoodfunction is approximated by evaluating the effects ofvery small changes in the parameters (implementa-tions include the functions optim in R and fminconand fminunc in Matlab). The large number of likeli-hood calculations required to obtain the gradientapproximation can be sped up dramatically by allo-cating the likelihood evaluations over the availableprocessing units. It must be noted that, in this ex-ample, parallelization is possible because the tasksare independent. Dependent tasks are challenging,and often impossible, to parallelize. For instance,different iterations of a likelihood maximization al-gorithm are performed after each other—that is, onedecides on the parameter value that is used in itera-tion 1 once iteration 0 is completed; it is impossibleto simultaneously perform both iterations on differ-ent processing units.

The bags of little bootstrap introduced by Kleiner,Talwalkar, Sarkar, and Jordan (2014) is a combinationof bootstrapmethods that enhances the tractability ofanalyses in terms of memory requirements. First,without replacement, subsamples of sizeb are createdfrom the original data of size n. Then, within eachsubsample, bootstrap samples of size n—that is, thetotal number of observations, with replacement—aredrawn.Themodelof interest canbe thenestimatedoneachof the bootstrap samples. Estimation can bedonein a computationally rather efficient way because thebootstrap samples contain (many) duplicate obser-vations that can be efficiently handled by usinga weight vector. Moreover, the bootstrap samples canbe analyzed inparallel. By combining the results fromthe bootstrap samples, first within subsamples, andthen across subsamples, full sample estimates areobtained. Kleiner and coworkers (2014) provided

2016 1503George, Osinga, Lavie, and Scott

Page 12: Big Data and Data Science Methods for Management Research

advice on the number of subsamples and subsamplesize. Different types of data may require a differentbootstrap approach; for example, time-series (Efron &Tibshirani, 1994: 99–101) and network data (Ebbes,Huang, & Rangaswamy, 2015).

Finally, sequential methods offer a fast andmemory-efficient method that is particularly suitedto real-time data analysis (Chung, Rust, & Wedel,2009). These methods sequentially update parame-ters of interest as new data arrives and require onlystorage of the current parameter values, not the his-torical data. Examples of these methods includethe Kalman filter (see Osinga, in press, for an in-troduction to state space models and the Kalmanfilter) and related Bayesian approaches. In the cur-rent time period, the Kalman filter makes a forecastabout the next time period. As soon as we reach thisnext time period and new data arrives, the forecastcan be compared to the observed values. Based onthe forecasting error, the parameters are updated andused for making a new forecast. This new forecast iscompared to new data as soon as they arrive, givinganother parameter update and forecast. Thus, his-torical data need not be kept in memory, which en-hances tractability.

Reporting and Visualization

When it comes to reporting, oneof thekeychallengesof big data and data science is to be complete. The va-riety of big data makes it important to clearly describethe different data sources that are used. Also, stepstaken in preprocessing and merging of the data shouldbe carefully discussed. For example, when workingwith web traffic data, the researcher should indicatehow long a visitor needs to stay on a webpage for thevisit to be counted, whether a re-visit to a webpagewithin a certain time frame is counted as an additionalvisit or not, and so on. Similarly, when a program isused to rate the degree of positive sentiment in a newsitem, the researcherneeds to be clear as towhichwordsare searched for. Loughran and McDonald (2011)demonstrated the importance of context. For exam-ple, the words cost and liability may express negativesentiment in some settings, whereas they are moreneutrallyused in financial texts.Also, it shouldbe clearto the reader whether weights are applied, and whichother decisions have been made in preparing the datafor analysis (e.g., Haas et al., 2015).

With regard to data analysis, statistical signifi-cance becomes less meaningful when working withbigdata. Evenvariables that have a small effect on thedependent variable will be significant if the sample

size is large enough.Moreover, spurious correlationsare likely when considering a large number of vari-ables. One should therefore, in addition to statisticalsignificance, focus on the effect size of a variable andon its out-of-sample performance. Also, it is impor-tant to note that traditional statistical concepts applyto situations wherein a sample of the population isanalyzed, whereas big data may capture the entirepopulation. Bayesian statistical inference may pro-vide a solution (Wedel &Kannan, 2016) as it assumesthe data to be fixed and the parameters to berandom—unlike the frequentist approach, whichassumes that the data can be resampled. That beingsaid, it may also be the case that, despite having bigdata, researchers may ultimately aggregate observa-tions so that sample size decreases dramatically andstatistical significance remains an important issue.This might occur, for example, because one’s theoryis not about explaining millisecond-to-millisecondchanges in phenomena such as heart rate, but, rather,about explaining differences between individuals intheir overall health and well-being.

Whenapplyingmethods for variable selection, it isimportant to describe the method used and, partic-ularly, how the model was tuned. Since differentapproaches may give different results, it is highlyadvised to try multiple approaches to show robust-ness of the findings. In theory testing, it is clearwhichvariables to use. However, these variables may oftenbe operationalized in different ways. Also, the num-ber and way of including control variables may beopen for discussion. Simonsohn and colleagues(2015) advised the estimation of all theoretically jus-tified model specifications to demonstrate robustnessof the results across specifications.

Finally, researchers may consider visualizing pat-terns in the data to give the audience a better feel forthe strength of the effects, to allow easy comparison ofeffects, and to show that the effects are not an artifactof the complicatedmodel thatwas used to obtain them.With the rise of data science, many new visualizationtools have become available (e.g., Bime, Qlik Sense,and Tableau) that allow for easy application of multi-ple selections and visualization for large datasets.

CONCLUSION

In this editorial, we discussed the applications ofdata science in management research. We see op-portunities for scholars to develop better answers toexisting theories and extend to new questions byembracing the data scope and granularity that bigdata provide. Our starter kit explains the key

1504 OctoberAcademy of Management Journal

Page 13: Big Data and Data Science Methods for Management Research

challenges of data collection, storage, processing andanalysis, and reporting and visualization—whichrepresent departures from existing methods andparadigms. We repeat our caveat that the field isevolving rapidly with business practices and com-puting technologies. Even so, our editorial providesthe novice with the basic elements required to ex-periment with data science techniques. A few de-cades ago, the emergence of commercial databases,such as Compustat and SDC, and the availability ofadvanced analytics packages, such as STATA andUCINET, revolutionized management research byenabling scholars to shift from case studies andsimple two-by-two frameworks to complex modelsthat leverage rich archival data. The advent of datascience can be the next phase in this evolution,which offers opportunities not only for refiningestablished perspectives and enhancing the accu-racy of known empirical results, but also forembarking into novel research domains, raising newtypes of research questions, adopting more refinedunits of analysis, and shedding new light on themechanisms that drive observed effects. Whereasdata science applications are becoming pervasive inmarketing and organizational behavior research,strategy scholars have yet to harness these powerfultools and techniques. Data science applications inmanagement will take significant effort to craft, re-fine, and perfect. With a new generation of re-searchers evincing interest in these emergent areas,and the doctoral research training being provided,management scholarship is getting ready for its nextgenerational leap forward.

Gerard GeorgeSingapore Management University

Ernst C. OsingaSingapore Management University

Dovev LavieTechnion – Israel Institute of Technology

Brent A. ScottMichigan State University

REFERENCES

Aaker, D. A., Kumar, V., Leone, R. P., & Day, G. S. 2013.Marketing research: International student version(11th ed.). New York, NY: John Wiley & Sons.

Archak, N., Ghose, A., & Ipeirotis, P. G. 2011. Deriving thepricingpowerof product features bymining consumerreviews. Management Science, 57: 1485–1509.

Blei, D. M., Ng, A. Y., & Jordan, M. I. 2003. Latent dirichletallocation. Journal ofMachineLearningResearch, 3:993–1022.

Bono, J. E., Glomb, T. M., Shen, W., Kim, E., & Koch, A. J.2013. Building positive resources: Effects of positiveevents andpositive reflection onwork stress andhealth.Academy of Management Journal, 56: 1601–1627.

Business Roundtable. 2016, April. FedEx. In BusinessRoundtable (Ed.), Inventing the future:Howtechnologyis reshaping the energy and environmental landscape:26–27. Washington, D.C.: Business Roundtable. Re-trieved June 7, 2016, from http://businessroundtable.org/inventing-the-future/fedex.

Butts, M., Becker, W. J., & Boswell, W. R. 2015. Hot buttonsand time sinks: The effects of electronic communicationduring nonwork time on emotions and work–nonworkconflict. Academy of Management Journal, 58:763–788.

Chaffin, D., Heidl, R., Hollenbeck, J. R., Howe, M., Yu, A.,Voorhees, C., & Calantone, R. 2015. The promise andperils of wearable sensors in organizational research.Organizational Research Methods. Published onlineahead of print. doi: 10.1177/1094428115617004.

Chung, T. S., Rust, R. T., & Wedel, M. 2009. My mobilemusic: An adaptive personalization system for digitalaudio players. Marketing Science, 28: 52–68.

Colbert, A., Yee, N., & George, G. 2016. The digital work-force and the workplace of the future. Academy ofManagement Journal, 59: 731–739.

Das, S. R., & Chen, M. Y. 2007. Yahoo! for Amazon: Sen-timent extraction from small talk on the web. Man-agement Science, 53: 1375–1388.

Dhar, V. 2013. Data science and prediction. Communica-tions of the ACM, 56: 64–73.

Dodgson,M., Gann, D.M.,Wladwsky-Berger, I., Sultan, N.,&George, G. 2015.Managing digitalmoney.Academyof Management Journal, 58: 325–333.

Ebbes, P., Huang, Z., & Rangaswamy, A. 2015. Samplingdesigns for recovering local and global characteris-tics of social networks. International Journal of Re-search inMarketing. Published online ahead of print.doi: 10.1016/j.ijresmar.2015.09.009.

Efron, B., & Tibshirani, R. J. 1994. An introduction to thebootstrap. New York, NY: Chapman & Hall/CRC.

Frank, L. E., & Friedman, J. H. 1993. A statistical view ofsome chemometrics regression tools. Technometrics,35: 109–135.

George, E. I., & McCulloch, R. E. 1993. Variable selectionvia Gibbs sampling. Journal of the American Statis-tical Association, 88: 881–889.

2016 1505George, Osinga, Lavie, and Scott

Page 14: Big Data and Data Science Methods for Management Research

George, G., Haas, M. R., & Pentland, A. 2014. Big data andmanagement. Academy of Management Journal, 57:321–325.

Gruber, M., de Leon, N., George, G., & Thompson, P. 2015.Managing by design. Academy of ManagementJournal, 58: 1–7.

Haas, M. R., Criscuolo, P., & George, G. 2015. Whichproblems to solve? Online knowledge sharing andattention allocation in organizations. Academy ofManagement Journal, 58: 680–711.

Hastie, T., Tibshirani, R., & Friedman, J. 2009. The el-ements of statistical learning: Data mining, in-ference and prediction (2nd ed.). Berlin, Germany:Springer.

Ilies, R., Dimotakis, N., & DePater, I. E. 2010. Psychologicaland physiological reactions to high workloads: Im-plications for well-being. Personnel Psychology, 63:407–436.

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. 2014.A scalable bootstrap for massive data. Journal of theRoyal Statistical Society: Series B, StatisticalMethodology, 76: 795–816.

Lambrecht, A., & Tucker, C. 2013. When does retargetingwork? Information specificity in online advertising.Journal of Marketing Research, 50: 561–576.

LeCun, Y., Bengio, Y., & Hinton, G. 2015. Deep learning.Nature, 521: 436–444.

Lewis, R. A., & Reiley, D. H. 2014. Online ads and offlinesales: Measuring the effect of retail advertising viaa controlled experiment on Yahoo! QuantitativeMarketing and Economics, 12: 235–266.

Liu, B. 2015. Sentiment analysis: Mining opinions, sen-timents, and emotions. New York, NY: CambridgeUniversity Press.

Loughran, T., &McDonald, B. 2011.When is a liability nota liability? Textual analysis, dictionaries, and 10-Ks.The Journal of Finance, 66: 35–65.

Madden, S. 2012. From databases to big data. IEEE Inter-net Computing, 16: 4–6.

Manning, C. D., Raghavan, P., & Schutze, H. 2009. Intro-duction to information retrieval. Cambridge, England:Cambridge University Press.

McAfee, A., & Brynjolfsson, E. 2012. Big data: The man-agement revolution. Harvard Business Review, 90:61–67.

Microsoft. 2016. Excel specifications and limits [Softwaresupport]. Retrieved June 24, 2016, from https://support.office.com/en-us/article/Excel-specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f.

Osinga, E. C. In press. State space models. In P. S. H.Leeflang, J. E.Wieringa, T.H.A.Bijmolt, &K. E. Pauwels

(Eds.), Advanced methods for modeling markets:Chapter 5. New York, NY: Springer.

Park, G., Eichstaedt, J. C., Kern, M. L., Seligman,M. E. P., Schwartz, H. A., Ungar, L. H., Kosinski, M., &Stillwell, D. J. 2015. Automatic personality assessmentthrough social media language. Journal of Personalityand Social Psychology, 108: 934–952.

Prajapati, V. 2013.Big data analytics with R andHadoop.Birmingham, England: Packt Publishing.

Simonsohn, U., Simmons, J. P., & Nelson, L. D. 2015.Specificationcurve:Descriptiveand inferential statisticson all reasonable specifications. Available online atSSRN. doi: 10.2139/ssrn.2694998.

Sismeiro, C., & Bucklin, R. E. 2004. Modeling purchasebehavior at an e-commerce web site: A task-completionapproach. Journal of Marketing Research, 41: 306–323.

Teixeira, T., Wedel, M., & Pieters, R. 2012. Emotion-induced engagement in internet video advertisements.Journal of Marketing Research, 49: 144–159.

Tirunillai, S., & Tellis, G. J. 2012. Does chatter really mat-ter? Dynamics of user-generated content and stockperformance. Marketing Science, 31: 198–215.

UPS. 2016. Leadership matters: Sustainability—telematics.Retrieved June 7, 2016, from https://www.ups.com/content/us/en/bussol/browse/leadership-telematics.html.

van der Vegt, G. S., Essens, P.,Wahlstrom,M., & George, G.2015. From the editors: Managing risk and resilience.Academy of Management Journal, 58: 971–980.

van Knippenberg, D., Dahlander, L., Haas, M., & George, G.2015. Information, attention and decision-making.Academy of Management Journal, 58: 649–657.

Varian, H. R. 2014. Big data: New tricks for econometrics.The Journal of Economic Perspectives, 28: 3–27.

Vaughan, B. 2013, July 29. AP, news aggregator Meltwaterendcopyrightdispute.Reuters (U.S. ed.).Retrieved June7, 2016, fromhttp://www.reuters.com/article/manniap-meltwater-lawsuit-idUSL1N0FZ17920130729.

Wang, C., Chen, M. H., Schifano, E., Wu, J., & Yan, J. 2015.Statistical methods and computing for big data. arXiv.org [Website]. Ithaca, NY: Cornell University. E-printavailable at http://arxiv.org/pdf/1502.07989v2.pdf.Accessed May, 2016.

Wedel, M., & Kannan, P. K. 2016. Marketing analyticsfor data-rich environments. Journal of Marketing. Pub-lished online ahead of print. doi: 10.1509/jm.15.0413.

Wiesel, T., Pauwels, K., & Arts, J. 2011. Practice prizepaper—Marketing’s profit impact: Quantifying onlineand off-line funnel progression. Marketing Science,30: 604–611.

Xu, L., Duan, J. A., & Whinston, A. 2014. Path to purchase: Amutually exciting point process model for online

1506 OctoberAcademy of Management Journal

Page 15: Big Data and Data Science Methods for Management Research

advertising and conversion. Management Science,60: 1392–1412.

Zikopoulos, P., & Eaton, C. 2011.Understanding big data:Analytics for enterprise class Hadoop and stream-ing data. New York, NY: McGraw-Hill.

Gerry George is dean and Lee Kong Chian Chair Professorof Innovation and Entrepreneurship at the Lee Kong ChianSchool of Business at Singapore Management University.He also serves as the editor of AMJ.

Ernst Osinga is an assistant professor of marketing at theLee Kong Chian School of Business at Singapore Man-agement University. His research interests include online ad-vertising and retailing, pharmaceutical marketing, and the

marketing–finance interface.HereceivedhisPhDinmarketingfrom the University of Groningen in the Netherlands.

Dovev Lavie is professor and vice dean at the Technion –

Israel Institute of Technology. He earned his PhD attheWharton School of the University of Pennsylvania. Hisresearch focuses on alliance portfolios, balancing explo-ration and exploitation, and applications of the resource-based view. He serves as associate editor ofAMJ, handlingmanuscripts in strategy and organization theory.

Brent Scott is professor of management at the Broad Col-lege of Business atMichigan State University. His researchfocuses on emotions, organizational justice, and employeewell-being. He serves as an associate editor of AMJ, han-dling manuscripts in organizational behavior.

2016 1507George, Osinga, Lavie, and Scott