variety with big data volume and research works to copeanalysis in the big data era popular big data...
TRANSCRIPT
Research Works to Cope with Big Data Volume and
Variety
Jiaheng LuUniversity of Helsinki, Finland
Big Data: 4Vs
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Hadoop and Spark platform optimization
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Multi-model databases: quantum framework and category theory
Outline
▶ Big data platform optimization (10 mins)▶ Motivation
▶ Two main principles and approaches
▶ Experimental results
▶ Multi-model databases (10 mins)▶ Overview
▶ Quantum framework
▶ Category theory
5
Optimizing parameters in Hadoop and Spark platforms
Analysis in the Big Data Era
▶ Key to success = Timely and Cost-effective analysis
7
Data Analysis Decision making
Massive data Insights Saving and
Revenue
Analysis in the Big Data Era
▶ Popular big data platform:Hadoop and Spark
▶ Burden on users▶ Responsible for provisioning and configuration
▶ Usually lack expertise to tune the system
8
Analysis in the Big Data Era
▶ Popular big data platform:Hadoop and Spark
▶ Burden on users▶ Responsible for provisioning and configuration
▶ Usually lack expertise to tune the system
9
As a data scientist, I do not know how to improve the efficiency of my job?
Analysis in the Big Data Era
▶ Popular big data platform: Hadoop and Spark▶ Burden on users: provision and tuning▶ Effect of system-tuning for jobs
10
Tuned vs. Default
Running time Often 10x
System resource utilization
Often 10x
Others Well tuned jobs may avoid failures like OOM, out of disk, job time out, etc.
Good performance after tuning
Automatic job optimization toolkit
▶ NOT our goal: Change the codes of the system to improve the efficiency
▶ Our goal: Configure the parameters to achieve good performance
11
Our system is easy to be used in the existing Hadoop and Spark system
▶ Given a MapReduce or Spark job with input data and running cluster, we find the setting of parameters that optimize the execution time of the job. (i.e. minimize the job execution time)
Problem definition
12
Challenges: too many parameters!
13
There are more than 190 parameters in Hadoop!
Two key ideas in job optimizer
14
1.Reduce search space!
190 413
13 parameters we tune
15
16
Four key factors
▶ We identify four key factors (parameters) to model a MapReduce job execution ▶ The number of Map task waves m. (number of Map tasks)
▶ The number of Reduce task waves r. (number of Reduce tasks)
▶ The Map output compression option c. (true or false)
▶ The copy speed in the Shuffle phase v (number of parallel copiers)
17
Cost model ▶ Producer: the time to produce the Map outputs in m
waves
▶ Transporter: the non-overlapped time to transport Map outputs to the Reduce side
▶ Consumer: the time to produce Reduce outputs in r waves
Two key ideas in job optimizer
18
2. Keep everything busy!▶ CPU: map, reduce and compression▶ I/O: sort and merge▶ Network: shuffle
19
Keep map and shuffle parallel
MRTunner approach:
20
Given a new job for tuning
Retrieve the profile data
Search for the four key parameters
Compute the 13 parameters
Configure and run Good
performance after tuning
Architecture:
21
Job Optimizer
Profile query engine
Job Profile
DataProfile
ResourceProfile
Hadoop MR Log HDFS OS/Ganglia
Offline
Online
Profile data▶ Job profile
▶ Selectivity of Map input/output
▶ Selectivity of Reduce input/output
▶ Ratio of Map output compression
▶ Data profile
▶ Data Size
▶ Distribution of input key/value
▶ System profile
▶ Number of machines
▶ Network throughput
▶ Compression/Decompression throughput
▶ Overhead to create a map or reduce task22
23
Experimental evaluation
▶ Performance Comparison ▶ Hadoop-X (Commercial Hadoop): ▶ Starfish: Parameters advised by Starfish▶ MRTuner: Parameters advised by MRTuner
▶ Workloads▶ Terasort ▶ N-gram ▶ Pagerank
24
Effectiveness of MRTuner Job Optimizer
Running time of jobs
For N-gram job, MRTuner obtains more than 20x speedup than Hadoop-X
Commercial Hadoop-X
25
Comparison between Hadoop-X and MRTuner (N-gram)
Hadoop-X MRTuner
Cluster-wide Resource Usage from Ganglia
CPU and Network utilizations are higher.
26
Impact of Parameters on Selected Jobs
MRTuner tuned time: t1Results after changing parameters to Hadoop-X setting: t2The impact: (t2-t1)/t1, then normalize all the impacts
Compression is important
Map task # is important
Reduce task # is important
Ongoing research topics
▶ Efficient job optimization on YARN and Spark
▶ Tune for container size and executor size
27
Multi-model databases: Quantum framework and category theory
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Multi-model databases
Motivation: one application to include multi-model data
An E-commerce example with multi-model data
Sale history
RecommendCustomer
ShoppingCart
Product Catalog
NoSQL database types
Photo downloaded from: http://www.vikramtakkar.com/2015/12/nosql-types-of-nosql-database-part-2.html
Multiple NoSQL databases
Sales Social media Customer
CatalogShopping-cart
MongoDB
MongoDBRedis
MongoDBNeo4j
Multi-model DB
Tabular
RDFXML
Spatial
TextMulti-model DB JSON
• One unified database for multi-model data
Challenge: a new theory foundation
Call for a unified model and theory for multi-model data!
The theory of relations (150 years old) is not adequate to mathematically describe modern (NoSQL) DBMS.
Two possible theory foundations
Quantum framework; Approximate query processing for open field in multi-model databases
Category theory: Exact query processing and schema mapping for close field in multi-model databases
35
Quantum framework
Database is based on the components:Logic (SQL expressions)Algebra (relational algebra)’
The Quantum framework adds quantum probability and quantum algebra.
36
Why not classical probability
Apply three rules on multi-model dataQuantum superpositionQuantum entanglementQuantum inference
37
Unifying multi-model data in Hilbert space
38
XMLRelation
RDF Graph
Use quantum probability to answer the query approximately
Two possible theory foundations
Quantum framework; Approximate query processing for open field in multi-model databases
Category theory: Exact query processing and schema mapping for close field in multi-model databases
39
What is category theory?It was invented in the early 1940’s
Category theory has been proposed as a new foundation for mathematics (to replace set theory)
A category has the ability to compose the arrows associatively
Unified data model
• One unified data model with objects and morphisms
XMLRelation
RDF Graph
Unified language model
Tabular
SPAQLXPath Xquery
SQL
• One unified language model with functors
Transformation
• Natrual transformation between multiple language for multi-model data
Ongoing research topics
▶ Approximate query processing based on quantum framework
▶▶ Multi-model data integration based on
category theory
44
Conclusion
1. Parameter tuning is important for Big data platform like Hadoop and Spark.
2. Emerging two new theoretical foundations on multi-model databases: quantum framework and category theory
45
Reference
▶ Jiaheng Lu, Irena Holubová: Multi-model Data Management: What's New and What's Next? EDBT 2017: 602-605
▶ Jiaheng Lu: Towards Benchmarking Multi-Model Databases. CIDR 2017
▶ Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)
46
47