on information quality: can your data do the job? (scecr 2015 keynote)

31
1 On Information Quality: Can Your Data Do the Job? SCECR 2015 @ Addis Ababa Galit Shmueli @ Institute of Service Science National Tsing Hua University, Taiwan

Upload: galit-shmueli

Post on 13-Aug-2015

105 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

1

On Information Quality:

Can Your Data Do the Job?

SCECR 2015 @ Addis Ababa

Galit Shmueli @ Institute of Service Science

National Tsing Hua University, Taiwan

Page 2: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

2

eBay dataset with millions of auctions• All camera auctions in 2008• All camera auctions in 2015• Auctions for all items in 1/2012• Sample from all items in 2012

How much will you pay?

Or maybe Craigslist data?

Page 3: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

3

“statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.”

Statistics: A Very Short Introduction (Hand 2008)

Pre-data, post-data, post-analysis

What is the potential of a dataset to generate knowledge?

Page 4: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

4

Analysis goal

g XAvailable data

fData analysis

method

Utility measure

U

Page 5: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

5

Analysis goal

g

Domain GoalWhat, why, when, where, how

→ Analysis GoalExplain, predict, describeEnumerative, analyticExploratory, confirmatory

Quality of Goal Specification• “error of the third kind” - giving the right answer to the wrong

question – Kimball• “Far better an approximate answer to the right question, which

is often vague, than an exact answer to the wrong question, which can always be made precise” - Tukey

Page 6: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

6

XAvailable data

Data Source• Primary, secondary• Observational, experiment• Single, multiple sources• Collection instrument, protocol

Data Type• Continuous, categorical, mix• Structured, un-, semi-structured• Cross-sectional, time series, panel,

network, geographical

Data Quality U(X|g)• “Zeroth Problem - How do the data relate to the problem, and

what other data might be relevant?” - Mallows• MIS/Database - usefulness of queried data to person querying it. • Quality of Statistical Data (IMF, OECD) - usefulness of summary

statistics for a particular goal (7 dimensions)

Data Size and Dimension• # observations• # variables

Page 7: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

7

fData analysis

method

Analysis Quality• “poor models and poor analysis techniques, or even analyzing the

data in a totally incorrect way.” - Godfrey• Analyst expertise• Software availability• The focus of statistics/econometrics/DM education

Statistical models and methods • Parametric, semi-, non-parametric• Classic, Bayesian

Econometric models

Data mining algorithms

Graphical methodsOperations research methods

Page 8: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

8

Utility measure

U

Quality of Utility Measure• Adequate metric from analysis standpoint (R2, holdout data)• Adequate metric from domain standpoint

Domain goal → Analysis goal

• Predictive accuracy, lift • Goodness-of-fit • Statistical power, statistical significance• Strength-of-fit• Expected costs, gains• Bias reduction, bias-variance tradeoff

Analysis utility → Domain utility

Page 9: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

9

InfoQ(f,X,g) = U( f(X|g) )

Depends on quality of g, X, f, U and relationship between them

The potential of a particular dataset to achieve a particular goal using a given empirical analysis method

Page 10: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

10

Statistical Approaches for Increasing InfoQ

Study Design (Pre-Data)• DOE• Clinical trials• Survey sampling• Computer experiments

Post-Data-Collection• Data cleaning and

preprocessing• Re-weighting, bias

adjustment• Meta analysis

Randomization, Stratification, Blinding, Placebo, Blocking, Replication, Sampling frame,Link data collection protocol with appropriate design

Recovering “real data” vs. “cleaning for the goal”Handling missing values, outlier detection, re-weighting, combining results

Page 11: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

11

Assessing InfoQ“Quality of Statistical Data” (Eurostat, OECD, NCSES,…)• Relevance• Accuracy• Timeliness and punctuality• Accessibility• Interpretability• Coherence• Credibility

InfoQ dimensions1. Data resolution2. Data structure3. Data integration4. Temporal relevance5. Chronology of data and goal6. Generalizability7. Construct operationalization8. Communication

3 V’s of Big Data • Volume• Variety• Velocity

Marketing Research• Recency• Accuracy• Availability• Relevance

Page 12: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

12

#1 Data ResolutionMeasurement scale and aggregation level

Dutch Flower AuctionsVan HeckKetterGuptaKoppiusLuKambil

How many flowers?How many auctions?Bidder level?

Page 13: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

13

#2 Data Structure

Data Types• Time series, cross-sectional, panel• Geographic, spatial, network• Text, audio, video, semantic• Structured, semi-, non-structured• Discrete, continuous

Data Characteristics Corrupted and missing values due to study design or data collection mechanism

Social TV: Real-Time Media Response to TV AdvertisingHill, Nalavade & Benton, KDD 2012

Prediction in Economic NetworksDhar, Geva, Oestreicher-Singer & Sundararajan, ISR 2012

Page 14: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

14

Consumer Surplus in Online AuctionsBapna, Jank & Shmueli, ISR 2008

#3 Data Integration

Utility of Linkage Dangers: Privacy Increase or decrease InfoQ?

Music Blogging, Online Sampling, and the Long Tail, Dewan & Ramprasad, ISR 2013

Song radio play data from Nielsen SoundScan + music blog sampling data from The Hype Machine

Page 15: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

15

#4 Temporal Relevance

Analysis Timeliness(solving the right

problem too late)

Data Collection

Data Analysis

Study Deployment

t1 t2 t3 t4 t5 t6

Collection Timeliness(relevance to g)

g: Prospective vs. retrospective; longitudinal vs. snapshotNature of X, complexity of f

forecast

Page 16: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

16

#5 Chronology of Data & Goalg1: Explain priceg2: Forecast price

Retrospective/prospectiveEx-post availabilityEndogeneity

Page 17: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

17

#6 Generalizability

Statistical generalizability

Scientific generalizability

Definition of gChoice of X, f, U

The Hidden Cost of Accommodating Crowdfunder Privacy Preferences: A Randomized Field ExperimentBurtch, Ghose, Wattal

“We found that our treatment had a large, highly negative effect on hiding behavior…(β = -0.279, p < 0.001)”

“It is possible that our results would not extend to a [offline] purchase context, where issues of social capital, reputation, etc. might be less pronounced”

Page 18: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

18

#7a Construct Operationalizationχ construct X = θ(χ) operationalization (measurable)

• Causal explanation vs. prediction, description

• Theory vs. data• Data: Questionnaire,

physical measurement

Online Seller

Reputation

Total #ratings

{#pos, #neg} ratings

# recent negative ratings

Text content of feedback

The Digitization of Word-of-Mouth: Promise and Challenges of Online Reputation SystemsDellarocas, Management Science

Page 19: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

19

#7b Action OperationalizationDerive concrete actions from the information provided by a study

The Management Insights editor will write a Management Insights paragraph for every paper that is accepted for publication.

One Way Mirrors in Online Dating: A Randomized Experiment Bapna, Ramprasad, Shmueli & Umyarov

Online Reputation Systems: How to Design One That Does What You NeedDellarocas

Social TV: Real-Time Media Response to TV AdvertisingHill, Nalavade & Benton

Page 20: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

20

#8 CommunicationVisual, written, verbal presentations & reports

Knowledge must reach the right person at the right time

• Mentoring• Manuscript reviewing• Data made available to

others• EDA and shared

visualization dashboards• Seminars + conferences!

Page 21: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

21

“In the last three years, there has been a concerted effort by those in Washington to reduce government spending and reign in the national debt.

One reason for the budget cuts?

Research by two Harvard economists, Ken Rogoff and Carmen Reinhart. The pair found that when a country owes more than 90 percent of their GDP, it slides into recession.”

… Fixing this Excel error transforms high-debt countries from recession to growth

www.marketplace.org/topics/economy/excel-mistake-heard-round-world

Page 22: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

22

a brief conversation about marriage equality with a canvasser who revealed that he or she was gay had a big, lasting effect on the voters’ views, as measured by separate online surveys administered before and after the conversation. 

May 2015: Independent researchers fail to replicate; noted statistical irregularities in LaCour’s data (baseline outcome data statistically indistinguishable from a national survey ; over-time changes indistinguishable from perfect normally distributed noise)

High or Low InfoQ?

Page 23: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

23

Assessing InfoQ in Practice

Rating-based assessment1-5 scale on each dimension:

InfoQ Score = [d1(Y1) d2(Y2) … d8(Y8)]1/8

Experience from two research methods courses

Shmueli and Kenett (2013), “An Information Quality (InfoQ) Framework for Ex-Ante and Ex-Post Evaluation of Empirical Studies”, Proceedings of IDAM 2013, Springer-Verlag.

Page 24: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

Assessing Research Proposals: Ex-Ante (Prospective) InfoQ• 2009 Research methods workshop for PhD students• Help students develop and present research proposal• 50 students from OB, OR, marketing, economics

Page 25: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

• Each student evaluates his/her proposal on the 8 InfoQ dimensions. Will proposal likely lead to the intended goal?

• Students’ grades derived from an InfoQ score of their proposal submission (presentation + written)

InfoQ Integration (Prospective)

Page 26: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

Goal: Make students’ research journey more efficient and more effective

Page 27: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

Assessing Research Proposals:Ex-Post (Retrospective) InfoQ

Professional masters degree program that emphasizes statistical practice, methods, data analysis and practical workplace skills. The MSP is for students who are interested in professional careers in business, industry, government, or scientific research.

Masters Program in Statistical Practice

Page 28: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

Assessing Research Proposals:Ex-Post (Retrospective) InfoQ

In 2012: 16 studentsStudents instructed to: (60-90 min)1. Read short description of InfoQ and its dimensions2. Read reports on five studies 3. Describe goal, data, analysis, utility for each study4. Evaluate the 8 dimensions for each study

Predicting days with unhealthy air quality in Washington DC

Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators

Quality-of-care factors in U.S. nursing homes

Predicting ZILLOW.com’s Zestimate accuracy

goo.gl/erNPF

Predicting First Day Returns for Japanese IPOs

Page 29: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

Feedback:

InfoQ approach helped “sort out all of the information”

Several reported that they will adopt this evaluation approach for future studies

Page 30: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

30

InfoQ: Strengths and ChallengesInfoQ approach streamlines questioning of data value• “Why should we invest in data?” – management• Compare value of potential datasets, analyses• Prioritize/rank projects• Strengthen functional – analytical relationship• Evaluation study: InfoQ useful for developing an analysis plan

(prospective) and evaluating an empirical study (retrospective)

To Do:• Improve InfoQ assessment (heterogeneity across different raters)• Alternative InfoQ assessment approaches (pilot study, EDA, other)• Further dimensions (data privacy, human subject compliance & risk)• Effect of technological advances on InfoQ

Page 31: On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)

31In progress…

With discussion