qmapper for smart grid: migrating sql-based application to hive yue wang, yingzhong xu, yue liu,...

30
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Upload: harriet-waters

Post on 12-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

QMapper for Smart Grid: Migrating SQL-based Applicationto HiveYue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin HuSIGMOD’15, May 31–June 4, 2015

Page 2: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Content

•Introduction•System Overview•Query Rewriting•Cost Model•Implementation•Experiments

Page 3: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Introduction• High-level query languages such as Hive, Pig based

on MapReduce have been widely used

• Performance bottlenecks of current RDBMS-based infrastructure appear in traditional enterprises

• Hive can not fully support the SQL syntax at the moment

• Even if some SQL queries used in RDBMS can be directly accepted by Hive, their performance might be very low in the Hadoop

Page 4: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

contribution

•Translate SQL to optimized HiveQL

•A cost model is proposed to reflect the execution time of MapReduce jobs

•An algorithm is designed to reorganize the join structure so as to construct the near-optimal query

Page 5: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

SECICS System

•The total amount of data is 20TB and there is about 30GB new data added into the database every day

•three kinds of data in SECICS:▫Meter data:collected by smart meters▫Archive data: records the detailed

archived information of meter data▫Statistic data: the result of offline batch

analysis

Page 6: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Background

Page 7: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Background• Low data write throughput

▫RDBMS with complex indexes can not provide enough write throughput

• Unsatisfied statistics analyzing capability▫The average processing time even reaches 3 to

4 hours• Weak scalability

▫scaling out RDBMS mostly leads to redesign of the sharding strategies as well as a lot of application logic.

• Uncontrollable resource competition

Page 8: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

The migration of Stored Procedures

Page 9: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Overview

Page 10: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Four Components•SQL Interpreter:

▫resolves the SQL query provided by a user and parses that query into an Abstracted Syntax Tree

•Query Rewriter:▫a Rule-Based Rewriter (RBR) checks if a

query matches a series of static rules, new equivalent queries will be generated

▫Cost-Based Optimizer (CBO) is used to further optimize the join structure for each query

Page 11: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Four Components

•Statistics Collector▫collecting statistics of related tables and

their columns•Plan Evaluator

▫The queries with equivalent join cost generated by RBR will be sent to it

Page 12: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

QUERY REWRITING

•Rule-based Rewriter▫detect the SQL clauses that are not

supported well by Hive and transform them into HiveQL

▫initial rules are first invoked to check if the query can be rewritten

▫the RBR will traverse the subqueries of each query and apply rules to them recursively

▫all rewritten queries are generated and sent to the CBO

Page 13: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Example

•lvRate(uid,deviceid,isMissing,date,type)•dataProfile(dataid,uid,isActive)•dataRecord(dataid,date,consumption)•powerCut(uid,date)•gprsUsage(deviceid,dataid,date,gprs)•deviceInfo(deviceid,region,type)

Page 14: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Basic UPDATE Rule• This rule translates UPDATE into SELECT

statement by putting the simpleCondition to selectList

• UPDATE lvRate a SET a.isMissing=trueLEFT OUTER JOIN dataProfile b ON a.uid=b.uidLEFT OUTER JOIN dataRecord c on b.dataid=c.dataidAND a.date=c.date WHERE c.dataid IS NULL

• INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,IF(c.dataid IS NULL,true,false) as isMissing,a.date,a.type FROM lvRate LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c ON b.dataid=c.dataid AND a.date=c.date

Page 15: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

(NOT) EXISTS Rule• transforms that subquery into a LEFT OUTER

JOIN and replaces that (NOT) EXISTS condition with join Column IS (NOT) NULL

• DELETE FROM lvRate a WHERE NOT EXISTS (SELECT 1 FROM powerCut b WHERE a.uid=b.uid AND a.date=b.date )

• INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,a.isMissing,a.date,a.type FROM lvRate a LEFT OUTER JOIN ( SELECT uid,date FROM powerCut) b ON a.uid=b.uid AND a.date=b.date WHERE b.uid IS NULL

Page 16: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Cost-based Optimizer• SELECT sum(gprs), type FROM gprsUsage A JOIN deviceInfo

B ON A.deviceid = B.deviceid JOIN dataRecord C ON A.dataid = C.dataid AND A.date = C.date JOIN dataProfile D ON C.dataid = D.dataid LEFT OUTER JOIN powerCut E ON D.uid = E.uid AND A.date = E.date WHERE E.uid IS NULL AND A.date=’2014-01-01’ GROUP BY B.type

• SELECT sum(gprs), type FROM( SELECT T1.gprs, T1.date, T1.type, T2.uid FROM (SELECT A.gprs, A.dataid, A.date, B.type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid WHERE A.date=’2014-01-01’ )T1 JOIN (SELECT C.dataid, C.date, D.uid FROM dataRecord C JOIN dataProfile D ON C.dataid = D.dataid)T2 ON T1.dataid = T2.dataid

Page 17: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Cost-based Optimizer

Page 18: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Cost-based Optimizer

•Different from traditional databases, MapReduce- based query processing will write join intermediate results back to HDFS and the next join operation will read it from HDFS too, causing big I/O costs

• the main difference in intermediate results is that the left-deep plan generates A B C and the bushy plan generates C D

•B may has worse performance as jobs will compete for computing resources

Page 19: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

COST MODEL

•Cost of MapReduce▫Map phase can be divided into three

subphases, which are Map, Spill and Merge.

▫Reduce phase also includes three parts, Shuffle, Merge and Reduce

•Map▫For each Mapper:

Page 20: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Mapper Cost Model

•Spill

•Merge

•Different from normal MapReduce jobs, in Hive, the internal logic of mappers may vary depending on the specific table to be processed.

Page 21: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Reduce

•In the reduce phase, shuffle is responsible for fetching mappers outputs to their corresponding reducers

Page 22: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

•Merge

•Reduce

•Total Cost

Page 23: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Cost of Operators in Map and Reduce

•In order to calculate the costs, a few sample queries based on TPC-H are designed as probes to collect the execution time of operators

•given a chain with n operators,the cost is evaluated as:

Page 24: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Cost of Workflow

•A HiveQL query is finally compiled to MapReduce workflows (a directed acyclic graph) where each node is a single MapReduce job and the edge represents the dataflow

Page 25: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Experiments•evaluate the correctness and efficiency of

Qmapper•the efficiency of translating SQL into HiveQL

and the efficiency of HiveQL execution comparing QMapper with manually translated work

•TPC-H will demonstrate the execution efficiency of HiveQL generated by Qmapper

•Smart Grid application will show the correctness and translation efficiency of QMapper

Page 26: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015
Page 27: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Join Performance

Page 28: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Scalability

Page 29: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Accuracy of Cost Model

Page 30: QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Thanks