presentaed by kirti dighe drushti gawade. what is shark? a new data analysis system built on the top...
TRANSCRIPT
![Page 1: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/1.jpg)
Shark:SQL and Rich Analytics at Scale
Presentaed By
Kirti Dighe Drushti Gawade
![Page 2: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/2.jpg)
What is Shark? A new data analysis systemBuilt on the top of the RDD and sparkCompatible with Apache Hive data, metastores, and queries(HiveQL, UDFs, etc)Similar speedups of up to 100xSupports low-latency, interactive queries through in-memory computationSupports both SQL and complex analytics such as machine learning
![Page 3: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/3.jpg)
Shark Architecture Used to query an existing Hive warehouse
returns result much faster without modification Diagram of Architecture
![Page 4: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/4.jpg)
Support partial DAG execution Optimization of joint algorithm
Features of shark Supports general computation Provides in-memory storage abstraction-
RDD Engine is optimized for low latency
Spark
![Page 5: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/5.jpg)
Sparks main abstraction-RDD Collection stored in external storage system or
derived data set Contains arbitrary data types
Benefits of RDD’s Return at the speed of DRAM Use of lineage Speedy recovery Immutable-foundation for relational
processing.
RDD
![Page 6: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/6.jpg)
Shark can tolerate the loss of any set of worker nodes.
Recovery is parallelized across the cluster.
The deterministic nature of RDDs also enables straggler mitigation
Recovery works even in queries that combine SQL and machine learning UDFs
Fault tolerance guarantees
![Page 7: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/7.jpg)
Executing sql over RDDs
Process of executing sql queries which includes
Query parsing Logical plan generation Physical plan generation
![Page 8: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/8.jpg)
Partial DAG execution(PDE)
Static query optimization Dynamic query optimization Modification of statistics Example of statistics Partition size record count List of “heavy hitters” Approximate histogram
Engine extension
![Page 9: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/9.jpg)
Join Optimization
Skew handling and degree parallelismTask scheduling overhead
![Page 10: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/10.jpg)
Columnar Memory Store
Simply catching records as JVM objects is insuffiecient
Shark employs column oriented storage , a partition of columns is one MaoReduce “record”
Benefits: compact representation, cpu efficient compression, cache locality
![Page 11: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/11.jpg)
Shark supports machine learning-first class citizen
Programming model design to express machine learning algorithm:
1. Language Integration
Shark allows queries to perform logistic regression over a user database.
Ex: Data analysis pipeline that performs logistic regression over database.
Machine learning support
![Page 12: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/12.jpg)
2. Execution Engine Integration
Common abstraction allows machine learning computation and SQl queries to share workers and cached data.
Enables end to end fault tolerance
![Page 13: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/13.jpg)
How to improve Query Processing Speed Minimize tail latency CPU cost processing of each
Memory-based shuffle Temporary object creation Bytecode compilation of expression
evaluation
Implementation
![Page 14: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/14.jpg)
Evaluation of the shark using database
Pavlo et al. Benchmark: 2.1 TB of data reproducing Pavlo et al.’s comparison of MapReduce vs. analytical DBMSs [25]. TPC-H Dataset: 100 GB and 1 TB datasets generated by the
DBGEN program [29]. Real Hive Warehouse: 1.7 TB of sampled Hive warehouse
data from an early industrial user of Shark. Machine Learning Dataset: 100 GB synthetic dataset to
measure the performance of machine learning algorithms.
Shark perform 100x faster than hive
Experiments
![Page 15: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/15.jpg)
Methodology and cluster setupAmazon EC2 with 100m2.4xlarge nodes8 virtual code68 GB of memory1.6 TB of local storage
Pavlo etal. Benchmarks1 GB/node ranking table20 GB/node uservisits table
Selection Query (cluster index)
SELECT pageURL, pageRankFROM rankings WHERE pageRank > X;
![Page 16: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/16.jpg)
Aggregation QueriesSELECT sourceIP, SUM(adRevenue)FROM uservisits GROUP BY sourceIP;
SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)
FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7);
![Page 17: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/17.jpg)
Join Query
SELECT INTO Temp sourceIP, AVG(pageRank), SUM(adRevenue) as totalRevenue FROM rankings AS R, uservisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’) GROUP BY UV.sourceIP;
Join query runtime from Join stategies
Pavlo Benchmark chosen by optimizers
![Page 18: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/18.jpg)
Data LoadingTo query data in HDFS directly,which means its
data ingress rate is at least as fast as Hadoop’s.
Micro-Benchmarks Aggregation performanceSELECT [GROUP_BY_COLUMN], COUNT(*)
FROM lineitem GROUP BY [GROUP_BY_COLUMN]
![Page 19: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/19.jpg)
Join selection at runtime
Fault tolerenceMeasuring sharks performance in presence of node failures –simulate failures and measure query performance, before,during and after failure recovery.
![Page 20: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/20.jpg)
Real hive warehouse
1. Query 1 computes summary statistics in 12 dimensions for users of a specific customer on a specific day.2. Query 2 counts the number of sessions and distinct customer/client combination grouped by countries with filter cates on eight columns.3. Query 3 counts the number of sessions and distinct users forall but 2 countries.4. Query 4 computes summary statistics in 7 dimensions groupingby a column, and showing the top groups sorted in descendingorder.
![Page 21: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/21.jpg)
Machine learning Algorithms
Compare performance of shark running the same work flow in Hive and Hadoop
Workflow consisted of three steps:1)Selecting the data of interesr from the
warehouse using SQL2)Extracting Features3)Applying Iterartive Algorithms Logistic Regresion K-Means Clustering
![Page 22: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/22.jpg)
Logistic Regression,pre-iterarion runtime(seconds)
K-means Cllustering,pre-iteration algorithm
![Page 23: Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5517fc5655034693228b4a60/html5/thumbnails/23.jpg)
Warehouse combining relational queries and complex analytics
Generalizes map reduce using both1. Traditional Databse Techniques 2. Novel Partial DAG Execution Shark faster than Hive and Hadoop
Conclusion