variety with big data volume and research works to copeanalysis in the big data era popular big data...

Research Works to Cope with Big Data Volume and

Variety

Jiaheng LuUniversity of Helsinki, Finland

Big Data: 4Vs

Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html


Hadoop and Spark platform optimization


Multi-model databases: quantum framework and category theory

Outline

▶ Big data platform optimization (10 mins)▶ Motivation

▶ Two main principles and approaches

▶ Experimental results

▶ Multi-model databases (10 mins)▶ Overview

▶ Quantum framework

▶ Category theory

5

Optimizing parameters in Hadoop and Spark platforms

Analysis in the Big Data Era

▶ Key to success = Timely and Cost-effective analysis

7

Data Analysis Decision making

Massive data Insights Saving and

Revenue


▶ Popular big data platform：Hadoop and Spark

▶ Burden on users▶ Responsible for provisioning and configuration

▶ Usually lack expertise to tune the system

8


▶ Popular big data platform：Hadoop and Spark

▶ Burden on users▶ Responsible for provisioning and configuration

▶ Usually lack expertise to tune the system

9

As a data scientist, I do not know how to improve the efficiency of my job?


▶ Popular big data platform： Hadoop and Spark▶ Burden on users: provision and tuning▶ Effect of system-tuning for jobs

10

Tuned vs. Default

Running time Often 10x

System resource utilization

Often 10x

Others Well tuned jobs may avoid failures like OOM, out of disk, job time out, etc.

Good performance after tuning

Automatic job optimization toolkit

▶ NOT our goal: Change the codes of the system to improve the efficiency

▶ Our goal: Configure the parameters to achieve good performance

11

Our system is easy to be used in the existing Hadoop and Spark system

▶ Given a MapReduce or Spark job with input data and running cluster, we find the setting of parameters that optimize the execution time of the job. (i.e. minimize the job execution time)

Problem definition

12

Challenges: too many parameters!

13

There are more than 190 parameters in Hadoop!

Two key ideas in job optimizer

14

1.Reduce search space!

190 413

13 parameters we tune

15

16

Four key factors

▶ We identify four key factors (parameters) to model a MapReduce job execution ▶ The number of Map task waves m. (number of Map tasks)

▶ The number of Reduce task waves r. (number of Reduce tasks)

▶ The Map output compression option c. (true or false)

▶ The copy speed in the Shuffle phase v (number of parallel copiers)

17

Cost model ▶ Producer: the time to produce the Map outputs in m

waves

▶ Transporter: the non-overlapped time to transport Map outputs to the Reduce side

▶ Consumer: the time to produce Reduce outputs in r waves

Two key ideas in job optimizer

18

2. Keep everything busy!▶ CPU: map, reduce and compression▶ I/O: sort and merge▶ Network: shuffle

19

Keep map and shuffle parallel

MRTunner approach:

20

Given a new job for tuning

Retrieve the profile data

Search for the four key parameters

Compute the 13 parameters

Configure and run Good

performance after tuning

Architecture：

21

Job Optimizer

Profile query engine

Job Profile

DataProfile

ResourceProfile

Hadoop MR Log HDFS OS/Ganglia

Offline

Online

Profile data▶ Job profile

▶ Selectivity of Map input/output

▶ Selectivity of Reduce input/output

▶ Ratio of Map output compression

▶ Data profile

▶ Data Size

▶ Distribution of input key/value

▶ System profile

▶ Number of machines

▶ Network throughput

▶ Compression/Decompression throughput

▶ Overhead to create a map or reduce task22

23

Experimental evaluation

▶ Performance Comparison ▶ Hadoop-X (Commercial Hadoop): ▶ Starfish: Parameters advised by Starfish▶ MRTuner: Parameters advised by MRTuner

▶ Workloads▶ Terasort ▶ N-gram ▶ Pagerank

24

Effectiveness of MRTuner Job Optimizer

Running time of jobs

For N-gram job, MRTuner obtains more than 20x speedup than Hadoop-X

Commercial Hadoop-X

25

Comparison between Hadoop-X and MRTuner (N-gram)

Hadoop-X MRTuner

Cluster-wide Resource Usage from Ganglia

CPU and Network utilizations are higher.

26

Impact of Parameters on Selected Jobs

MRTuner tuned time: t1Results after changing parameters to Hadoop-X setting: t2The impact: (t2-t1)/t1, then normalize all the impacts

Compression is important

Map task # is important

Reduce task # is important

Ongoing research topics

▶ Efficient job optimization on YARN and Spark

▶ Tune for container size and executor size

27

Multi-model databases: Quantum framework and category theory


Multi-model databases

Motivation: one application to include multi-model data

An E-commerce example with multi-model data

Sale history

RecommendCustomer

ShoppingCart

Product Catalog

NoSQL database types

Photo downloaded from: http://www.vikramtakkar.com/2015/12/nosql-types-of-nosql-database-part-2.html

Multiple NoSQL databases

Sales Social media Customer

CatalogShopping-cart

MongoDB

MongoDBRedis

MongoDBNeo4j

Multi-model DB

Tabular

RDFXML

Spatial

TextMulti-model DB JSON

• One unified database for multi-model data

Challenge: a new theory foundation

Call for a unified model and theory for multi-model data!

The theory of relations (150 years old) is not adequate to mathematically describe modern (NoSQL) DBMS.

Two possible theory foundations

Quantum framework; Approximate query processing for open field in multi-model databases

Category theory: Exact query processing and schema mapping for close field in multi-model databases

35

Quantum framework

Database is based on the components:Logic (SQL expressions)Algebra (relational algebra)’

The Quantum framework adds quantum probability and quantum algebra.

36

Why not classical probability

Apply three rules on multi-model dataQuantum superpositionQuantum entanglementQuantum inference

37

Unifying multi-model data in Hilbert space

38

XMLRelation

RDF Graph

Use quantum probability to answer the query approximately

Two possible theory foundations

Quantum framework; Approximate query processing for open field in multi-model databases

Category theory: Exact query processing and schema mapping for close field in multi-model databases

39

What is category theory?It was invented in the early 1940’s

Category theory has been proposed as a new foundation for mathematics (to replace set theory)

A category has the ability to compose the arrows associatively

Unified data model

• One unified data model with objects and morphisms

XMLRelation

RDF Graph

Unified language model

Tabular

SPAQLXPath Xquery

SQL

• One unified language model with functors

Transformation

• Natrual transformation between multiple language for multi-model data

Ongoing research topics

▶ Approximate query processing based on quantum framework

▶▶ Multi-model data integration based on

category theory

44

Conclusion

1. Parameter tuning is important for Big data platform like Hadoop and Spark.

2. Emerging two new theoretical foundations on multi-model databases: quantum framework and category theory

45

Reference

▶ Jiaheng Lu, Irena Holubová: Multi-model Data Management: What's New and What's Next? EDBT 2017: 602-605

▶ Jiaheng Lu: Towards Benchmarking Multi-Model Databases. CIDR 2017

▶ Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)

46

variety with big data volume and research works to copeanalysis in the big data era popular big data...

Documents