demystifying systems for interactive and real-time analytics

42
Demystifying Systems for Interactive and Real-time Analytics The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs

Upload: kalona

Post on 24-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Demystifying Systems for Interactive and Real-time Analytics. The BigFrame Team. Duke University, Hong Kong Polytechnic University, and HP Labs. Analytics System Landscape. Streaming. Dataflow. MapReduce. Graph. Multi-tenant. MPP DB. Array DB. Columnar. Mixed. Text Analytics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Demystifying Systems for Interactive and Real-time Analytics

Demystifying Systems for Interactive and Real-time

Analytics

The BigFrame TeamDuke University, Hong Kong Polytechnic

University, and HP Labs

Page 2: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Page 3: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Gamma

AsterNetezza

DB2 PE

Teradata SQL Server Parallel DataWarehouse

Greenplum

Page 4: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

HP Vertica

ParAccel

Redshift

Vectorwise

Page 5: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System LandscapeHadoo

pTenzing

HiveMahout

HadoopDBPig

Page 6: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System LandscapeDremel

Drill StingerImpala

SparkDryad SCOPE

Page 7: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

CassandraHBaseBigtable

Druid

HANA

SpannerMegastore

Splunk

Page 8: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

StormGraphLab

Streambase

CassovaryGraphX

Solr

ElasticSearch

SciDBCloudera Search

MadLINQ

Pregel

HAMA

Page 9: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Analytics System Landscape

Mesos

YARNSerengeti

Cloud platforms

Page 10: Demystifying Systems for Interactive and Real-time Analytics

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

What does this mean for Big Data Practitioners?

Page 11: Demystifying Systems for Interactive and Real-time Analytics

Gives them a lot of power!

From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html

Page 12: Demystifying Systems for Interactive and Real-time Analytics

Even the mighty may need a little help

Page 13: Demystifying Systems for Interactive and Real-time Analytics

Challenges for Practitioners

Which system touse for the app that I

am developing?

• Features (e.g., graph data)

• Performance (e.g., claims like

System A is 50x faster than B)

• Resource efficiency

• Growth and scalability

• Multi-tenancy

App Developers, Data Scientists

Page 14: Demystifying Systems for Interactive and Real-time Analytics

Different parts of my app have different

requirements

Compose “best of breed” systems

ORUse “one size fits

all” system?

Managing manysystems is hard!

System Admins

Challenges for Practitioners

Which system touse for the app that I

am developing?

App Developers, Data Scientists

Page 15: Demystifying Systems for Interactive and Real-time Analytics

Managing manysystems is hard!

Different parts of my app have different

requirements

Total Cost of Ownership (TCO)?

CIOSystem Admins

Challenges for Practitioners

Which system touse for the app that I

am developing?

App Developers, Data Scientists

Page 16: Demystifying Systems for Interactive and Real-time Analytics

Numbers make decisions easier

Page 17: Demystifying Systems for Interactive and Real-time Analytics

Need benchmarks

Page 18: Demystifying Systems for Interactive and Real-time Analytics

One Approach

Develop a benchmark per system category

Categorize systems

Page 19: Demystifying Systems for Interactive and Real-time Analytics

Useful, But …

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenant

Star Schema BenchmarkTPC-H / TPC-DS

Counting triangles

Terasort

GridMixSWIMHiBench

DFSIO

MapReduce Vs. Parallel DB /Hive Benchmark (in HiBench) /Berkeley Big Data Benchmark

Yahoo Cloud Serving Benchmark (YCSB)YCSB Variants

CH-benchCHmark

MulTe

Graph 500PageRank

RDF Benchmarks

Information Extraction Benchmark

Linear Road

SS-DB

Page 20: Demystifying Systems for Interactive and Real-time Analytics

Problem #1 May Miss the Big Picture

Page 21: Demystifying Systems for Interactive and Real-time Analytics

Problem #1 May Miss the Big Picture

Cannot capture the complexities and end-to-end behavior of big data applications and deployments:

(i) Bottlenecks(ii) Data conversion, transfer, & loading overheads(iii) Storage costs & other parts of the data life-cycle(iv) Resource management challenges(v) Total Cost of Ownership (TCO)

Page 22: Demystifying Systems for Interactive and Real-time Analytics

Give a man a fish and you will feed him for a day.

Give him fishing gear and you will feed him for life.

-- Anonymous

Problem #2 Benchmark

BenchmarkGenerator

Page 23: Demystifying Systems for Interactive and Real-time Analytics

BigFrame: A Benchmark Generator for Big

Data Analytics

Page 24: Demystifying Systems for Interactive and Real-time Analytics

How a user uses BigFrameBigFram

eInterfac

e

bigif(benchmark

input format)BenchmarkGenerator

bspec(benchmark specification)

HBase

Hive

MapReduce

Benchmark Driver for System

Under Testrun the benchmark

results

System Under Test

Page 25: Demystifying Systems for Interactive and Real-time Analytics

bspec: Benchmark Specification

HBase

Hive

MapReduce

System Under Test

2. Data refreshpattern

Time

3. Query streams

4. E

valu

atio

n m

etric

s

1. Data forinitial load

Page 26: Demystifying Systems for Interactive and Real-time Analytics

What does the user(want to) specify?

BigFrame

Interface

bigif(benchmark

input format)

Page 27: Demystifying Systems for Interactive and Real-time Analytics

The 3Vs

MPP DB

Columnar

MapReduce

Mixed

Dataflow

Streaming

Text Analytics

Array DB

GraphMulti-tenantVolume

VarietyVelocity

Page 28: Demystifying Systems for Interactive and Real-time Analytics

bigif: BigFrame’s InputFormat

Data Variety

Relational, text, array,

graph

Small,medium,

large

Data Volume

QueryVolume

Queryconcurrency

& classes

DataVelocity

At rest,slow,fast

Micro,Macro

QueryVariety

Exploratory,Continuous

QueryVelocity

Page 29: Demystifying Systems for Interactive and Real-time Analytics

Benchmark Generationbigif

(benchmark input format)

BenchmarkGenerator

bspec(benchmark specification)

bigif describes pointsin a discrete space of

{Data,Query} X{Variety,Volume,Velocity}

1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics

Benchmark generation can beaddressed as a search problem

within a rich application domain

Page 30: Demystifying Systems for Interactive and Real-time Analytics

Application Domain Modeled Currently

E-commerce sales,

promotions, recommendati

ons

Social mediasentiment &

influence

Benchmark generation can beaddressed as a search problem

within a rich application domain

Page 31: Demystifying Systems for Interactive and Real-time Analytics

Application Domain Modeled Currently

Item

Customer

Web_sales

Promotion

Tweets

Relationships

Page 32: Demystifying Systems for Interactive and Real-time Analytics

Application Domain Modeled Currently

Item

Web_salesPromotion

Page 33: Demystifying Systems for Interactive and Real-time Analytics

Application Domain Modeled Currently

Page 34: Demystifying Systems for Interactive and Real-time Analytics

Benchmark Generationbigif

(benchmark input format)

BenchmarkGenerator

bspec(benchmark specification)

bigif describes pointsin a discrete space of

{Data,Query} X{Variety,Volume,Velocity}

1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics

BigFrame can generate Data, Queries, and Arrival Patterns with the user-specified {Variety,Volume,Velocity}

requirements from the application domain

Page 35: Demystifying Systems for Interactive and Real-time Analytics

Use Cases of BigFrame

Page 36: Demystifying Systems for Interactive and Real-time Analytics

Use Case I: Exploratory BI• Large volumes of relational data

• Mostly aggregation and few joins

• Can Spark’s performance match that of an MPP DB?

Data Variety = {Relational}

Query Variety = Micro

BigFrame will generate a benchmark specification containing

relational data and (SQL-ish) queries

Page 37: Demystifying Systems for Interactive and Real-time Analytics

Use Case II: Complex BI• Large volumes of relational data• Even larger volumes of text data

• Combined analytics

Data Variety = {Relational, Text}

Query Variety = Macro (application-focused instead of

micro-benchmarking)

BigFrame will generate a benchmark specification that includes

sentiment analysis tasks over tweets

Page 38: Demystifying Systems for Interactive and Real-time Analytics

• Large volume and velocity of

relational and text data

Use Case III: Dashboards

• Continuously-updated Dashboards

Query Velocity = Continuous

(as opposed to Exploratory)

Data Velocity =Fast

BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results

change upon data refresh

Page 39: Demystifying Systems for Interactive and Real-time Analytics

Use Case IV: Does One Size Fit All?• Growing set of applications have to

process relational, text, & graph data

• Compose “best of breed” systems or use a “one size fits all” system?

Data Variety = {Relational, Text,

Graph}

BigFrame will generate a benchmark specification that includes composite workflows

with relational, text, and graph analytics

Query Variety = Macro

Page 40: Demystifying Systems for Interactive and Real-time Analytics

Use Case V: Multi-tenancy and SLAs• Big data deployments are

increasingly multi-tenant and

need to meet SLAs

Specifiedthrough Query

Volume dimension

BigFrame can generate a benchmark specification containing a specified number of concurrent query streams with class labels for queries (e.g., Batch, Interactive, or Streaming)

Page 41: Demystifying Systems for Interactive and Real-time Analytics

Working with the Community• First release of BigFrame planned for August 2013• With feedback from benchmark developers (BigBench)

• Open-source with extensibility APIs

• Benchmark Drivers for more systems

• Utilities (accessed through the Benchmark Driver to

drill down into system behavior during benchmarking)

• Instantiate the BigFrame pipeline for more app domains

Page 42: Demystifying Systems for Interactive and Real-time Analytics

Take Away• “Benchmarks shape a field (for better or worse) …”

-- David Patterson, Univ. of California, Berkeley

• Benchmarks meet different needs for different people• End customers, application developers, system designers,

system administrators, researchers, CIOs

• BigFrame helps users generate benchmarks that best

meet their needs