tdwi sql on hadoop - sigs.de · pdf filespark, cascading •query engines like pig, hive,...
TRANSCRIPT
Prof. Dr. Jens Albrecht SQL on Hadoop 3
Why SQL for Big Data?
� Mature technology
� Broad knowledge available
� Powerful query language
� High interactive performance
� Many third party tools for data
analysis and visualization
� Flexible data structures
� Semi-structured data� Changing schemas
� Self-Service
� Data integration on-the-fly
� Scalability
� Analysis, integration, volumen
� (Relatively) Low Cost
� Commodity Hardware, Open Source
Prof. Dr. Jens Albrecht SQL on Hadoop 4
Agility
Why SQL for Big Data?
+
Extended
DWHData Lake
Prof. Dr. Jens Albrecht SQL on Hadoop 5
Hadoop-SQL Integration
HDFS
MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS
DistributedSQL Engine
HDFS
DistributedSQL Engine
NoSQLHive
Format-agnostic SQL Engines
RDBMS
Relational
RDBMS with Hadoop Access
Hadoop
Prof. Dr. Jens Albrecht SQL on Hadoop 6
Hadoop-SQL Integration
HDFS
MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS
DistributedSQL Engine
HDFS
DistributedSQL Engine
NoSQLHive
Format-agnostic SQL Engines
RDBMS
Relational
RDBMS with Hadoop Access
Hadoop
Prof. Dr. Jens Albrecht SQL on Hadoop 7
MapReduce
Hive
Hive
� General
▸Developed initially by Facebook
▸SQL-processing for HDFS and HBase
▸Table definitions in Hive Meta Store
▸Generation of MapReduce Code
▸Schema-on-Read via SerDe
� Advantages
▸Mature part of every Hadoop distribution
▸Simple setup
▸Java-API for UDFs
▸Usage of many data formats via SerDe
� Disadvantages
▸Batch-oriented, slow
Hadoop
HDFS/HBase
MetaStore
HiveQLExecution
SerDe
Prof. Dr. Jens Albrecht SQL on Hadoop 8
Schema-on-Write vs. Schema-on-Read
� Relational Database: Schema-on-Write
� Big Data Processing: Schema-on-Read
Multi-structuredSource Data
ETL
Relational DBMS
SQL
Multi-structuredSource Data
Load as-is
Hadoop
SQL
Schema mapped to
originalfiles
Prof. Dr. Jens Albrecht SQL on Hadoop 9
Schema-on-Read: Hive & CSV
CREATE TABLE gps_data(
userid INT,
deviceid INT,
longitude STRING,
latitude STRING,
utctime STRING)
ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILE;
-- load data = copyLOAD DATA LOCAL INPATH 'new_data/gps.dat'OVERWRITE INTO TABLE gps_data;
SELECT COUNT(*) FROM gps_data;
4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC
Sample Data
Prof. Dr. Jens Albrecht SQL on Hadoop 10
Schema-on-Read: Hive & Regexp
CREATE TABLE weblog (host STRING,identity STRING,user STRING,time STRING,request STRING,status STRING,size STRING,referer STRING,agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\])
([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
)STORED AS TEXTFILE;
Prof. Dr. Jens Albrecht SQL on Hadoop 11
Schema-Read: Hive & JSON
Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html
JSON Data Relational Mapping in Hive
Prof. Dr. Jens Albrecht SQL on Hadoop 12
MapReduce
Hive
Hive on Tez (Stinger)
� Stinger-Initiative
▸"Make Hive 100x faster"
▸Finished with Hive 0.13 (April 2014)
▸ Replace MapReduce with Tez▸Native columnar data format (ORC)
� Stinger.Next
▸Phase 1: Hive 0.14 (November 2014)
▸ACID transactions
▸Phase 2: (Q2 2015)
▸ Subsecond Queries mit LLAP▸Machine Learning Integration
▸Phase 3: (Q4 2015)
▸ SQL:2011 Analytic Functions▸Materialized Views
Hadoop
HDFS/HBase
Tez / Yarn
MetaStore
HiveQLExecution
SerDe
Prof. Dr. Jens Albrecht SQL on Hadoop 13
Hadoop-SQL Integration
HDFS
MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS
DistributedSQL Engine
HDFS
DistributedSQL Engine
NoSQLHive
Format-agnostic SQL Engines
RDBMS
Relational
RDBMS with Hadoop Access
Hadoop
Prof. Dr. Jens Albrecht SQL on Hadoop 14
Distributed SQL Engine
Pure Hadoop SQL Engines
� Approach
▸Distributed, parallel SQL engine
▸Often usage of Hive Metadata
▸Support of optimized data formats
▸Hadoop as mandatory basis
� Advantages and Disadvantages
▸Significantly faster as Hive
▸Low latency through dedicated engine
▸Operator pipelining and result caching
� Diffentiation of solutions
▸Supported SQL functionality
▸Point querying
▸Cost-based optimizer / performance
▸Transaction support
HDFS/HBase
LocalAgent
LocalAgent
LocalAgent
LocalAgent
SQL Query Coordination
DataFiles
DataFiles
DataFiles
DataFiles
Prof. Dr. Jens Albrecht SQL on Hadoop 15
Distributed SQL Engine
Pure Hadoop SQL Engines
HDFS/HBase
LocalAgent
LocalAgent
LocalAgent
LocalAgent
SQL Query Coordination
DataFiles
DataFiles
DataFiles
DataFiles
Big Insights
Prof. Dr. Jens Albrecht SQL on Hadoop 16
Pure Hadoop SQL Engines
Example: IBM Big SQL
Source: IBM
Prof. Dr. Jens Albrecht SQL on Hadoop 17
Apache Spark & Spark SQL
� General
▸SQL engine based on Spark
▸Data access via data frames (former SchemaRDD)
▸In-Memory columnar format
▸HDFS / HBase as file format
� Advantages
▸Spark as general-purpose parallel computing framework
▸Support of Hive extensions like UDFs and SerDes and Hive metadata
� Disadvantages
▸Not yet fully mature
▸Not yet as fast as competitors
Spark SQL
HDFS/HBase
Apache Spark
Schema RDD
SQLExecution
SerDe
Prof. Dr. Jens Albrecht SQL on Hadoop 18
Apache Spark
� Distributed In-Memory Computing Framework
▸Data caching
▸General framework for all kinds of SQL and non-SQL analytics
▸Support for out-of-the box libraries as well as Java, Python and Scala in the same engine and for the same data
▸New datasources API allows to write plugins for non-Hadoop sources
Spark Execution Engine
Hadoop
ZooKeeperYARN
(optional)HDFS
Spark SQLSpark
Streaming
Graph Computation
(GraphX)
Machine Learning(MLlib)
Prof. Dr. Jens Albrecht SQL on Hadoop 19
Apache Spark
lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith(ERROR)) errors.persist()
// Return the time fields of errors mentioning// assuming time is field number 3 in a tsv filehdfs_errors = errors.filter(_.contains(HDFS))time_fields = hdfs_errors.map(_.split(’\t’)(3)).collect()
Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Action
Resilient Distributed Data Sets(RDDs)
Transformations
Prof. Dr. Jens Albrecht SQL on Hadoop 20
Spark Transformations
� Transformations: Create a new RDD from an existing one
▸Lazy evaluation – results are not materialized
▸Much more than map reduce
� Actions: Return a value or dataset to the calling program
map filter samplegroupByKey sortByKey reduceByKeyunion pipe repartitionjoin leftOuterJoin rightOuterJoin…
reduce collect countfirst take(n) saveAsTextFile…
Prof. Dr. Jens Albrecht SQL on Hadoop 22
Adding Schema to RDDs
Source: http://de.slideshare.net/jeykottalam/spark-sqlamp-camp2014
Prof. Dr. Jens Albrecht SQL on Hadoop 23
Hadoop-SQL Integration
HDFS
MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS
DistributedSQL Engine
HDFS/Hive
DistributedSQL Engine
NoSQLSQL
Format-agnostic SQL Engines
RDBMS
Relational
RDBMS with Hadoop Access
Hadoop
Prof. Dr. Jens Albrecht SQL on Hadoop 24
Distributed SQL Engine
SQL-Engine with Pluggable Storage
� Ansatz
▸Verteilte, parallele SQL-Engine
▸Oftmals Nutzung von Hive Metadaten
▸Unterstützung optimierter Dateiformate
▸Hadoop obligatorisch als Basis
� Allgemeine Vorteile
▸Deutlich schneller als Hive
▸Geringe Latenzzeiten durch Vermeidung von Map-Reduce
▸Operatoren-Pipelining und Caching
▸Skalierbar
� Allgemeine Nachteile
▸In der Regel keine Transaktionsunterstützung
LocalAgent
LocalAgent
LocalAgent
LocalAgent
SQL Query Coordination
HDFS Hive JSON Parquet
Cas-sandra
MySQL Oracle …
Connector Plugins
Prof. Dr. Jens Albrecht SQL on Hadoop 25
Google Dremel
Dremel
Prof. Dr. Jens Albrecht SQL on Hadoop 26
Apache Drill
� Self-Service Data Integration
▸No metadata repository required
▸Dynamic schema discovery: Metadata automatically extracted for data sources
▸ RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly)▸ Utilizes self-describing data formats (Parquet, JSON, AVRO)
▸SQL-DDL can be used to create metadata explicitly
� ANSI-SQL plus Flexible Data Model
▸Fully ANSI compliant SQL
▸DrQL with SQL extensions for nested data structures (like JSON)
Quelle: http://www.mapr.com
Prof. Dr. Jens Albrecht SQL on Hadoop 27
Apache Drill: SQL for JSON
select name, flatten(fillings) as f
from dfs.users.`/donuts.json`
where f.cal < 300;
Source: https://www.mapr.com/products/apache-drill
Prof. Dr. Jens Albrecht SQL on Hadoop 28
Apache Drill: SQL for Heterogeneous Data
� Formats
▸JSON
▸CSV
▸ORC
▸Parquet
▸HBase tables
▸MongoDB
select USERS.name, USERS.emails.work
from
dfs.logs.`/data/logs` LOGS,
dfs.users.`/profiles.json` USERS,
where
LOGS.uid = USERS.uid and
errorLevel > 5
order by count(*);
Prof. Dr. Jens Albrecht SQL on Hadoop 29
Hadoop-SQL Integration
HDFS
MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS
DistributedSQL Engine
HDFS
DistributedSQL Engine
NoSQLHive
Format-agnostic SQL Engines
RDBMS
Relational
RDBMS with Hadoop Access
Hadoop
Prof. Dr. Jens Albrecht SQL on Hadoop 30
External Tables in HDFS
HDFS
RDBMS
Logical Mapping
CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6),
CUST_ID NUMBER,
TIME_ID DATE,
CHANNEL_ID CHAR(1),
PROMO_ID NUMBER(6),
QUANTITY_SOLD NUMBER(3),
AMOUNT_SOLD NUMBER(10,2)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY SALES_EXT_DIR
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINEFIELDS TERMINATED BY ',' (
PROD_ID DECIMAL EXTERNAL,
…
)
PREPROCESSORHDFS_BIN_PATH:hdfs_stream
)
LOCATION ( 'file_sales_1',
'file_sales_2',
'files_sales_3')
);
Prof. Dr. Jens Albrecht SQL on Hadoop 31
RDBMS with Hadoop Integration
� Approach
▸Towards genuine integration of Hadoop into RDBMS
▸Utilize Hadoop's computational power
▸Cost-based choice
� Advantages
▸Easiest way to use Hadoop as data source
▸Combined access to traditional and new data sources
� Disadvantages
▸Cost
▸Limited data sources
▸Vendor lock-inHDFS/HBase
MapReduce
Parallel Database
Relational TablesExternalTables
SQL Query Coordination
MRLoader
Prof. Dr. Jens Albrecht SQL on Hadoop 32
RDBMS with Hadoop Integration
� Products
▸Microsoft Polybase (part of MS Analytics Platform)
▸Oracle Big Data SQL (part of Oracle Big Data Appliance in combination with Exadata)
� Use Cases
▸Extension of traditional BI System
▸Data-lake scenario with RDBMS as primary system and Hadoop for mass data
▸Mix of analytic and transactional load
HDFS/HBase
MapReduce
Parallel Database
Relational TablesExternalTables
SQL Query Coordination
MRLoader
Prof. Dr. Jens Albrecht SQL on Hadoop 33
Oracle Big Data SQL
� Integrates
Hive Metadata
� Allows hybrid queries
▸Include Hadoop and NoSQL in relational queries
▸Use Exadata Smartscan-Technology
Source: http://www.oracle.com
Prof. Dr. Jens Albrecht SQL on Hadoop 34
Hadoop-SQL Integration
HDFS
MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS
DistributedSQL Engine
HDFS
DistributedSQL Engine
NoSQLHive
Format-agnostic SQL Engines
RDBMS
Relational
RDBMS with Hadoop Access
Hadoop
Stinger Big Insights
Prof. Dr. Jens Albrecht SQL on Hadoop 35
>
Hadoop File Formats
Prof. Dr. Jens Albrecht SQL on Hadoop 36
File formats for Big Data
Text Formats
� High storage usage �bad scan performance
� Low compression � bad scan performance
Dedicated Formats (e.g. DB internal)
� Not open, no interoperability
Requirements for Big Data
� Interoperability
� Low storage / good compression
� High performance
� Flexible schema
A schema for a file format??Query performance for a file format??
Big Data often have a nested structure andmultiple schema variants and –versions
Prof. Dr. Jens Albrecht SQL on Hadoop 37
Motivation for file formats
Quelle: http://de.slideshare.net/cloudera/hadoop-summit-36479635
Prof. Dr. Jens Albrecht SQL on Hadoop 38
Considerations for File Formats
Query Tools
•none
•Frameworks like MapReduce, Spark, Cascading
•Query Engines like Pig, Hive, Impala
Schema Versioning
•Schema present?
•If so, can it change?
Splittability
•Partitioning
•Splitting of files possible for distributed processing?
•Example: CSV: yes, XML: partial, MP4: no
Block Compression
•Can blocks be independently compressed and distributed?
•Block compression is a prerequisite for partition compresssion!
File Size
•Size in bytes and number of files?
•Hadoop likes big, splittable files!
•Lots of small files cost performance
Load Profile
•Write Performance
•Filter operations
•Reading of single columns
•Full scans
Source: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore
Prof. Dr. Jens Albrecht SQL on Hadoop 39
Example Row-Format: Avro
� Schema specification
▸internally storedin binary format
▸self-describing files
� Reader- vs. Writer Schema
▸Allows different"Views" on files
Writer Schema Writer Schema
Reader Schema
Avro Parser
Avro Data
Resolution Rules
Read
Write
record Person {string userName;union { null, long } favouriteNumber;array<string> interests;
}
Prof. Dr. Jens Albrecht SQL on Hadoop 40
Example Column Format:
� Columnar formats in general
▸Trade faster reads for slower writes
▸Very good compression
� Parquet Files
▸Hybrid Partitioning – sets of records in blocks, columnar within blocks
▸Zone maps per block as kind of index (min/max values per column)
Image: http://de.slideshare.net/cloudera/hadoop-summit-36479635
Prof. Dr. Jens Albrecht SQL on Hadoop 41
Databases as Lego Construction Kit!?
SQL Prozessor
Verteilte Ausführung
Speicherverwaltung
Dat
a D
icti
on
ary
SQL
Traditional monolithic RDBMS
SQL Prozessor
MapReduce
Spark
CSV Seq Avro
JSON ORC Parquet
SQL
Hadoop DB building blocks
• Generic Execution Engine• Metadata sharing in Hive Repository or
self-describing file formats• Operator push-down to intelligente file interfaces
Prof. Dr. Jens Albrecht SQL on Hadoop 42
>
Summary
Prof. Dr. Jens Albrecht SQL on Hadoop 43
Considerations for SQL on Hadoop Solutions
� SQL Functionality
▸Coverage of SQL standard
▸User-defined functions
▸Transactional Safety
� Performance and Stability
▸Multi-user workloads
▸Efficiency of joins and aggregations (I/O problems? Size limits?)
� Supported Data and Storage Formats
▸ Logical Format: relational, JSON, none, …
▸Physical Formats: CSV, Parquet, ORC, Avro, …
� Intelligent Storage Plugins / Data Federation
▸Access to various data sources beyond Hadoop
▸Pushdown predicates, access selected columns only
� Support for your Hadoop Distribution
Prof. Dr. Jens Albrecht SQL on Hadoop 44
HadoopHadoop
SQLSQL
Hadoop vs. SQL
� Hadoop SQL Technologies
▸Supplement traditional RDBMS
▸Extend traditional RDBMS
▸Develop new RDBMS
SQL universe
gets wider.
Hadoop and SQL
move closely
together.
Database systems become
open and modular.
Prof. Dr. Jens Albrecht SQL on Hadoop 45
References
▸ L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234
▸ A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014
▸ M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data Magazine, June 2013
▸ M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15)
▸ D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266
▸ S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330–339, 2010
▸ M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,NSDI 2012
▸ J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen.Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475
▸ C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore. Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore
▸ J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0, Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635
▸ M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out.Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html
▸ P. Srivati: Resilient Distributed Datasets (RDD) for the impatient.Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html
▸ S. Yegulalp: 10 ways to query Hadoop with SQL.Infoworld, 2014, http://www.infoworld.com/article/2683729/hadoop/10-ways-to-query-hadoop-with-sql.html