big data small testing -...

45
Big Data, Small Testing ? Jayant Haritsa Database Systems Lab Indian Institute of Science Feb 2017 Parcomptech CDAC 1

Upload: duongkien

Post on 05-Feb-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Big Data, Small Testing ?

Jayant Haritsa

Database Systems Lab

Indian Institute of Science

Feb 2017 Parcomptech CDAC 1

Page 2: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Old Concept, New Hype

• VLDB: Premier international database

conference, started in 1975

– Very Large Data Bases

• Large ≈ Big, ∴ Very Large ≫ Big

• Only a few enterprises really have big data,

the others just say so for bragging rights!

– World Data Center for Climate ~ 1 petabyte

Feb 2017 Parcomptech CDAC 2

Page 3: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

NYT Op-ed Article [April 2014]

• Eight (No, Nine!) Problems With Big Data • Gary Marcus, Ernest Davis (NYU faculty)

“big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions”

Who’s Bigger? Where Historical Figures Really Rank

(Book by MIT/Google: Hitler ranks higher than Aristotle!)

We need to ensure that Big Data does not wind up

becoming Huge Nonsense …

Feb 2017 Parcomptech CDAC 3

Page 4: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

CallingBullshit.org [2017]

Univ. of Washington, Seattle

Profs. Carl Bergstrom and Jevin West

1 credit course: Calling Bullshit in the Age of Big Data

”We will focus on bullshit that comes clad in the trappings of scholarly discourse. Traditionally, such highbrow nonsense has come couched in big words and fancy rhetoric, but more and more we see it presented instead in the guise of big data and fancy algorithms — and these quantitative, statistical, and computational forms of bullshit are those that we will be addressing in the present course.”

Feb 2017 Parcomptech CDAC 4

Page 5: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Research Landscape

• Current Focus: Architecting the “plumbing”

infrastructure for Big Data environments • programming models, stream processing and summarization,

sketching and approximation algorithms, storage architectures,

cloud hosting, analytics, security …

• These techniques are unlikely to work in practice!

• The elephant in the room is the lack of testing methodologies for such deployments

5 Feb 2017 Parcomptech CDAC

Page 6: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Quotes†

50% of our cost is on testing (QA)

(Bill Gates @ Opening of Gates Building)

Testing alone takes up six months of the

18 month product release cycle

(SAP Executive)

Estimated damage of 60 billion dollars

per year in USA caused by software bugs

(US Department of Commerce, 2004)

Feb 2017 Parcomptech CDAC 6

† From Donald Kossmann’s Stanford talk

Page 7: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Big Data Disasters

Feb 2017 Parcomptech CDAC 7

Page 8: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

1. UK Immigration [2013]

A Home Office text message campaign accusing people of being illegal immigrants has received numerous complaints after several people were contacted in error. Officials have sent messages to almost 40,000 people they suspect of not having a right to be in the UK, instructing them to contact border officials to discuss their immigration status. Government commissioned Capita, the outsourcing company, to trace people believed to have outstayed their visas.

Feb 2017 Parcomptech CDAC 8

Page 9: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

UK Immigration (contd)

In a few months, Capita was accused of mishandling cases and getting just as mixed up as the bureaucrats it was supposed to be replacing!

In November, Capita admitted a backlog of 150,000 notifications to foreign students it hadn't been able to process and therefore determine if they should or shouldn't still be in the country.

In IT terms, it's been at the center of a billion dollar botched "e-borders" system, which has been missing deadlines and delivery dates since the middle of the last decade and which may not even be legal under European Union legislation!

Feb 2017 Parcomptech CDAC 9

Page 10: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

2. Obama HealthCare.gov [2013]

Severe problems were caused by unexpected high volume when the site drew 250,000 simultaneous users instead of the 50,000-60,000 expected. More than 8 million people visited the site from October 1 to 4. White House officials subsequently conceded that it was not just an issue of volume, but involved software and systems design issues. Also, stress tests done by the contractors one day before the launch date revealed that the site became too slow with only 1,100 simultaneous users !

HealthCare.gov problems persisted even weeks after the launch. For example, a networking error at the related data services hub killed the website's functionality. This occurred the exact day after Health & Human Services head Kathleen Sebelius had highlighted designing that data hub as a government success.

Feb 2017 Parcomptech CDAC 10

Page 11: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

3. Flipkart → Flopkart [2014]

Deccan Herald: Big Apology Day follows Flipkart's Big Billion Day – After its Big Billion Day on Monday, which fetched Flipkart.com $100

million by way of sales and the ire of hordes of angry customers who complained of technical glitches and false promises on discounts, the Bangalore-based online giant was quick to apologise for its drawbacks on Tuesday.

– “Though we saw unprecedented interest in our products and traffic like never before, we also realised that we were not adequately prepared for the sheer scale of the event. We didn't source enough products and deals in advance to cater to your requirements. To add to this, the load on our server led to intermittent outages, further impacting your shopping experience on our site,” the Bansals said.

– Noting that it took enormous effort from everyone at Flipkart, many months of preparation and pushing its “capabilities and systems to the limit” for the big day, the Bansals said that they were looking at deals and offers painstakingly put together for months.

Feb 2017 Parcomptech CDAC 11

Page 12: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Flipkart → Flopkart [2014]

Price Changes

– Even as Flipkart prepares various deals and promotional pricing in the lead-up to the sale, the pricing of several products gets changed to non-discounted rates for a few hours.

Out of stock – The website ran out of stock for many products within a few minutes

(and in some cases, seconds) of the sale going live. Most special deals were sold out as soon as they went live.

Cancellations

– A large number of people bought specific products simultaneously. This led to some instances of orders getting overbooked for a product sold out just a few seconds ago.

Website Issues

– Nearly 5000 servers were deployed and had prepared for 20 times the traffic growth. But the volume of traffic at different times of the day was much higher than this.

Feb 2017 Parcomptech CDAC 12

Page 13: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Testing Times

Feb 2017 Parcomptech CDAC 13

Page 14: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Software Mindset†

Everybody loves writing code

Everybody hates testing it

– emphasis on developing new models

than on evaluating current setups

– solution: automate the testing

computers are cheap and do not complain

Feb 2017 Parcomptech CDAC 14 14 14

† From Donald Kossmann’s Stanford talk

Page 15: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Automated Testing

Test Automation is a DB Problem – several optimizations in different flavors

– all about logical data independence

– far cry from being solved!

Research on Testing Methodology (as opposed to doing the testing) is fun! – interesting intellectual problems with immediate practical impact

– theory, algorithms, data structures, experiments, prototypes, …

– Testing and Tuning of Database Systems Special Issue of IEEE Data Engineering Bulletin, 31(1), 2008

Feb 2017 Parcomptech CDAC 15 15

Page 16: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Research Problems [1]

• Given a large workload of test queries, how to

summarize them into a small set that provides

equivalent coverage? [SIGMOD 2002]

• automatic index selection in databases, workload-based sampling

• NP-hard from Minimum k-median; constant-factor approximation

for metric spaces

• Retaining select data characteristics in

outsourced data processing [AMCIS 2012]

• retain data quality problems (missing data, incorrect format,

domain violations) in encrypted customer data!

Feb 2017 Parcomptech CDAC 16

Page 17: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Research Problems [2]

• Given a database D, a parametrized query Q,

and cardinality constraints C on subexpressions,

how to assign the parameter values to Q to

satisfy the constraints? [TKDE 2002]

• problem is NP-hard from Subset-sum; hill-climbing heuristic

• Reverse query processing [ICDE 2007]

• Given query Q on database schema D, and desired result R,

generate a database instance D s.t. Q(D) = R; ideally, minimal D !

• Undecidable whether such a D even exists!

• Model checking, symbolic processing, reverse relational algebra

Feb 2017 Parcomptech CDAC 17 17

Page 18: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Trichotomy†

Feb 2017 Parcomptech CDAC 18

Answers

Database

Queries

18

† From Mike Franklin, UCB

Page 19: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Trichotomy†

Feb 2017 Parcomptech CDAC 19

Answers

Database

Queries

Query Processing

19

† From Mike Franklin, UCB

Page 20: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Trichotomy†

Feb 2017 Parcomptech CDAC 20

Answers

Database

Queries

Reverse

Query Processing

20

† From Mike Franklin, UCB

Page 21: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Trichotomy†

Feb 2017 Parcomptech CDAC 21

Answers

Database

Queries

Programming

By Example

21

† From Mike Franklin, UCB

Page 22: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

DB Testing Practices

Feb 2017 Parcomptech CDAC 22

Page 23: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Basic Question

How do you know the output delivered

for the user objective is correct?

Checking is hard because of the

magnitude of data involved and the

complexity of the queries

Feb 2017 Parcomptech CDAC 23 23

Page 24: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Types of Errors

English-to-SQL translation errors

“Public demands change” Public is demanding change in society

Public demands are changing over time

Public is demanding loose change (coins)

– Big problem (only about 40% are correct !)

– Further, more than 80% are written correctly

only after two to four attempts!

Feb 2017 Parcomptech CDAC 24 24

Page 25: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Types of Errors (contd)

Syntactic errors

– easy to check with automatic parser generators

Semantic errors

– Schema/type errors (easy to check from catalogs)

– Arithmetic errors (easy to check at runtime)

– Optimizer rewriting errors

e.g. infamous Count Bug [1986]

– Operator implementation errors

– Index maintenance errors

– Transaction management errors

e.g. ARIES checkpoint error

Feb 2017 Parcomptech CDAC 25 25

Hard to find

Page 26: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Library Approach

SQL test libraries designed by the

engine developers or application

specialists

Run regression tests on this workload

– Very limited coverage

Feb 2017 Parcomptech CDAC 26 26

Page 27: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Stochastic Approach

Stochastic (random) generation of SQL

queries [Microsoft]

– highly expensive on real databases

– unlikely to catch “boundary-value” errors

– will not easily catch self-join and other such

specific errors

– correlated random variables

Feb 2017 Parcomptech CDAC 27 27

Page 28: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Moving on to the

Big Data World

Feb 2017 28 Parcomptech CDAC

Page 29: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Test Environment

• Underlying infrastructure is a hybrid of

ETL/IR/KM/DB components

• e.g. IBM Infosphere (DataStage, QualityStage, MDM,

DB2, Big Insights, Metadata repository, …)

• Need to test

• “functionality” (programs/data)

• “compilation” (query/model planning)

• “execution” (query/model processing)

Feb 2017 Parcomptech CDAC 29 29

Page 30: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Sample Scenario

• Wish to test “yottabyte” (1024 byte) scale

Big Data environment for InfoSphere

• Metrics: Functionality, Correctness,

Performance, Scalability

• Impractical (time) or infeasible (space) to

explicitly create and process test data

Feb 2017 Parcomptech CDAC 30 30

Page 31: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Pie-in-the-sky

A complete testing environment for Big

Data management systems, wherein the

entire data and meta-data is virtual or

transient, supporting efficient evaluation

of arbitrary deployment scenarios.

Feb 2017 Parcomptech CDAC 31

Page 32: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Metadata Testing

Feb 2017 Parcomptech CDAC 32

Page 33: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Our Approach

• Build metadata construction tools that

“fool” the underlying information systems

into thinking that the data is actually

present even though it had never been

created or stored

• Developed tool called CODD

(Constructing Dataless Databases)

• Edgar Codd, IBM, father of RDBMS / Turing awardee

• In archaic English, “cod” means “empty shell”

33 Feb 2017 Parcomptech CDAC

Page 34: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

CODD Metadata Processor

• Easy-to-use graphical tool for the automated

creation, verification, retention, scaling and

porting of database meta-data configurations

• Entirely written in Java (~50K LOC) and

operational on industrial-strength db engines (DB2, Oracle, SQL Server, SQL-MX, PostgreSQL)

• Released as free software after receiving

copyright from the Indian government

• In use at several industrial and academic

research labs

34 Feb 2017 Parcomptech CDAC

Page 35: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Feb 2017 Parcomptech CDAC

Metadata Construction

35

• Users can directly input statistics on:

• Relational Tables (row cardinality, row length, disk blocks)

• Attribute Columns (column width, number of distinct values,

value distribution histograms)

• Attribute Indexes (number of leaf blocks, clustering factor)

• System Parameters (cores, memory size, CPU utilization)

35 35

Page 36: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Feb 2017 Parcomptech CDAC

Graphical Histogram

36

36 36

Page 37: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Feb 2017 Parcomptech CDAC

Metadata Validation

37

Need to ensure that the input information is

– Legal (valid type and range)

– Consistent (compatible with other metadata values)

Validation Approach – Construct a directed acyclic constraint graph CG(V,E)

– V is the set of individual metadata entities while E is the set of

statistical value dependencies

– Super Nodes: used to represent collapsed chain of nodes for

compactness

– Run topological sort on CG to obtain CGlinear

– CODD uses this linear ordering to guide the user

37 37

Page 38: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Feb 2017 Parcomptech CDAC

Constraint Graph [DB2]

38

Legality

Constraint

Statistical Dependency:

Direction chosen as per

abstraction hierarchy

Super

Nodes

Dashed edges represent

missing constraints

Node Processing Order

38 38

Page 39: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Unique features of CODD

• Supports creation of arbitrary “what-if” scenarios

• Carries out automatic validation of user input

• Supports both space-based scaling and time-

based scaling

• Provides graphical histogram operations

• Supports inter-engine metadata transfer

39 Feb 2017 Parcomptech CDAC

Page 40: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

CODD in action

• Successfully simulated yottabyte environment on

a vanilla laptop

• Demonstrated deep bug in a popular commercial

database system that only surfaces at Big Data

scale

• Query optimizer saturates at 1020 bytes

• Hidden constant in the code inserted by a programmer

to symbolize “infinity”

40 Feb 2017 Parcomptech CDAC

Page 41: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Take Away

Research on Automating Big Data Testing is great technical fun with immediate practical relevance ...

Stop Protesting, be Pro-Testing !

Feb 2017 Parcomptech CDAC 41 41

Page 42: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

More Details

For publications, software and documentation, visit:

http://dsl.cds.iisc.ac.in/projects/CODD/index.html

Feb 2017 Parcomptech CDAC 42 42

Page 43: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Questions ?

Feb 2017 43 Parcomptech CDAC

Page 44: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Parcomptech CDAC

Example Application Scenario

52

TPC-H 1 GB (baseline) and TPC-H 100TB (scaled) (both are metadata shells created using CODD)

Commercial DBMS; Laptop with 64 GB disk

Query 9 of TPC-H

select n_name, o_year, sum(amount)

from (select n_name, o_orderdate, l_extendedprice

from part, supplier, lineitem, partsupp, orders, nation

where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and

ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey

= l_orderkey and s_nationkey = n_nationkey and p_name like

%green% and s_acctbal :varies and ps_supplycost :varies

) as all_nations

group by n_name, o_year

order by n_name, o_year desc

Feb 2017 52

Page 45: Big Data Small Testing - PARCOMPTECHparcomptech.garudaindia.in/2017/wp-content/uploads/2017/02/... · Big Data, Small Testing ? Jayant Haritsa ... Metadata repository, …) •Need

Scalability: 1GB to 100TB Database

53

Baseline Dataset (1 GB) Scaled Dataset (100TB)

Number of Plans

increased from

32 to 77!

Significant change in

geometries of plan

optimalityregions

Feb 2017 Parcomptech CDAC 53