tdwi sql on hadoop - sigs.de · pdf filespark, cascading •query engines like pig, hive,...

Processing Big Data

With SQL on Hadoop

Jens Albrecht

[email protected]

Prof. Dr. Jens Albrecht SQL on Hadoop 3

Why SQL for Big Data?

� Mature technology

� Broad knowledge available

� Powerful query language

� High interactive performance

� Many third party tools for data

analysis and visualization

� Flexible data structures

� Semi-structured data� Changing schemas

� Self-Service

� Data integration on-the-fly

� Scalability

� Analysis, integration, volumen

� (Relatively) Low Cost

� Commodity Hardware, Open Source


Agility

Why SQL for Big Data?

+

Extended

DWHData Lake


Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS


NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop



HDFS

MR / Tez

HiveQL


HDFS


HDFS


NoSQLHive


RDBMS

Relational


Hadoop


MapReduce

Hive

Hive

� General

▸Developed initially by Facebook

▸SQL-processing for HDFS and HBase

▸Table definitions in Hive Meta Store

▸Generation of MapReduce Code

▸Schema-on-Read via SerDe

� Advantages

▸Mature part of every Hadoop distribution

▸Simple setup

▸Java-API for UDFs

▸Usage of many data formats via SerDe

� Disadvantages

▸Batch-oriented, slow

Hadoop

HDFS/HBase

MetaStore

HiveQLExecution

SerDe


Schema-on-Write vs. Schema-on-Read

� Relational Database: Schema-on-Write

� Big Data Processing: Schema-on-Read

Multi-structuredSource Data

ETL

Relational DBMS

SQL

Multi-structuredSource Data

Load as-is

Hadoop

SQL

Schema mapped to

originalfiles


Schema-on-Read: Hive & CSV

CREATE TABLE gps_data(

userid INT,

deviceid INT,

longitude STRING,

latitude STRING,

utctime STRING)

ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILE;

-- load data = copyLOAD DATA LOCAL INPATH 'new_data/gps.dat'OVERWRITE INTO TABLE gps_data;

SELECT COUNT(*) FROM gps_data;

4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC

Sample Data


Schema-on-Read: Hive & Regexp

CREATE TABLE weblog (host STRING,identity STRING,user STRING,time STRING,request STRING,status STRING,size STRING,referer STRING,agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\])

([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"

)STORED AS TEXTFILE;


Schema-Read: Hive & JSON

Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html

JSON Data Relational Mapping in Hive


MapReduce

Hive

Hive on Tez (Stinger)

� Stinger-Initiative

▸"Make Hive 100x faster"

▸Finished with Hive 0.13 (April 2014)

▸ Replace MapReduce with Tez▸Native columnar data format (ORC)

� Stinger.Next

▸Phase 1: Hive 0.14 (November 2014)

▸ACID transactions

▸Phase 2: (Q2 2015)

▸ Subsecond Queries mit LLAP▸Machine Learning Integration

▸Phase 3: (Q4 2015)

▸ SQL:2011 Analytic Functions▸Materialized Views

Hadoop

HDFS/HBase

Tez / Yarn

MetaStore

HiveQLExecution

SerDe



HDFS

MR / Tez

HiveQL


HDFS


HDFS


NoSQLHive


RDBMS

Relational


Hadoop


Distributed SQL Engine

Pure Hadoop SQL Engines

� Approach

▸Distributed, parallel SQL engine

▸Often usage of Hive Metadata

▸Support of optimized data formats

▸Hadoop as mandatory basis

� Advantages and Disadvantages

▸Significantly faster as Hive

▸Low latency through dedicated engine

▸Operator pipelining and result caching

� Diffentiation of solutions

▸Supported SQL functionality

▸Point querying

▸Cost-based optimizer / performance

▸Transaction support

HDFS/HBase

LocalAgent

LocalAgent

LocalAgent

LocalAgent

SQL Query Coordination

DataFiles

DataFiles

DataFiles

DataFiles




HDFS/HBase

LocalAgent

LocalAgent

LocalAgent

LocalAgent


DataFiles

DataFiles

DataFiles

DataFiles

Big Insights



Example: IBM Big SQL

Source: IBM


Apache Spark & Spark SQL

� General

▸SQL engine based on Spark

▸Data access via data frames (former SchemaRDD)

▸In-Memory columnar format

▸HDFS / HBase as file format

� Advantages

▸Spark as general-purpose parallel computing framework

▸Support of Hive extensions like UDFs and SerDes and Hive metadata

� Disadvantages

▸Not yet fully mature

▸Not yet as fast as competitors

Spark SQL

HDFS/HBase

Apache Spark

Schema RDD

SQLExecution

SerDe


Apache Spark

� Distributed In-Memory Computing Framework

▸Data caching

▸General framework for all kinds of SQL and non-SQL analytics

▸Support for out-of-the box libraries as well as Java, Python and Scala in the same engine and for the same data

▸New datasources API allows to write plugins for non-Hadoop sources

Spark Execution Engine

Hadoop

ZooKeeperYARN

(optional)HDFS

Spark SQLSpark

Streaming

Graph Computation

(GraphX)

Machine Learning(MLlib)


Apache Spark

lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith(ERROR)) errors.persist()

// Return the time fields of errors mentioning// assuming time is field number 3 in a tsv filehdfs_errors = errors.filter(_.contains(HDFS))time_fields = hdfs_errors.map(_.split(’\t’)(3)).collect()

Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Action

Resilient Distributed Data Sets(RDDs)

Transformations


Spark Transformations

� Transformations: Create a new RDD from an existing one

▸Lazy evaluation – results are not materialized

▸Much more than map reduce

� Actions: Return a value or dataset to the calling program

map filter samplegroupByKey sortByKey reduceByKeyunion pipe repartitionjoin leftOuterJoin rightOuterJoin…

reduce collect countfirst take(n) saveAsTextFile…


Adding Schema to RDDs

Source: http://de.slideshare.net/jeykottalam/spark-sqlamp-camp2014



HDFS

MR / Tez

HiveQL


HDFS


HDFS/Hive


NoSQLSQL


RDBMS

Relational


Hadoop



SQL-Engine with Pluggable Storage

� Ansatz

▸Verteilte, parallele SQL-Engine

▸Oftmals Nutzung von Hive Metadaten

▸Unterstützung optimierter Dateiformate

▸Hadoop obligatorisch als Basis

� Allgemeine Vorteile

▸Deutlich schneller als Hive

▸Geringe Latenzzeiten durch Vermeidung von Map-Reduce

▸Operatoren-Pipelining und Caching

▸Skalierbar

� Allgemeine Nachteile

▸In der Regel keine Transaktionsunterstützung

LocalAgent

LocalAgent

LocalAgent

LocalAgent


HDFS Hive JSON Parquet

Cas-sandra

MySQL Oracle …

Connector Plugins


Google Dremel

Dremel


Apache Drill

� Self-Service Data Integration

▸No metadata repository required

▸Dynamic schema discovery: Metadata automatically extracted for data sources

▸ RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly)▸ Utilizes self-describing data formats (Parquet, JSON, AVRO)

▸SQL-DDL can be used to create metadata explicitly

� ANSI-SQL plus Flexible Data Model

▸Fully ANSI compliant SQL

▸DrQL with SQL extensions for nested data structures (like JSON)

Quelle: http://www.mapr.com


Apache Drill: SQL for JSON

select name, flatten(fillings) as f

from dfs.users.`/donuts.json`

where f.cal < 300;

Source: https://www.mapr.com/products/apache-drill


Apache Drill: SQL for Heterogeneous Data

� Formats

▸JSON

▸CSV

▸ORC

▸Parquet

▸HBase tables

▸MongoDB

select USERS.name, USERS.emails.work

from

dfs.logs.`/data/logs` LOGS,

dfs.users.`/profiles.json` USERS,

where

LOGS.uid = USERS.uid and

errorLevel > 5

order by count(*);



HDFS

MR / Tez

HiveQL


HDFS


HDFS


NoSQLHive


RDBMS

Relational


Hadoop


External Tables in HDFS

HDFS

RDBMS

Logical Mapping

CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6),

CUST_ID NUMBER,

TIME_ID DATE,

CHANNEL_ID CHAR(1),

PROMO_ID NUMBER(6),

QUANTITY_SOLD NUMBER(3),

AMOUNT_SOLD NUMBER(10,2)

)

ORGANIZATION EXTERNAL

( TYPE ORACLE_LOADER

DEFAULT DIRECTORY SALES_EXT_DIR

ACCESS PARAMETERS

( RECORDS DELIMITED BY NEWLINEFIELDS TERMINATED BY ',' (

PROD_ID DECIMAL EXTERNAL,

…

)

PREPROCESSORHDFS_BIN_PATH:hdfs_stream

)

LOCATION ( 'file_sales_1',

'file_sales_2',

'files_sales_3')

);


RDBMS with Hadoop Integration

� Approach

▸Towards genuine integration of Hadoop into RDBMS

▸Utilize Hadoop's computational power

▸Cost-based choice

� Advantages

▸Easiest way to use Hadoop as data source

▸Combined access to traditional and new data sources

� Disadvantages

▸Cost

▸Limited data sources

▸Vendor lock-inHDFS/HBase

MapReduce

Parallel Database

Relational TablesExternalTables


MRLoader


RDBMS with Hadoop Integration

� Products

▸Microsoft Polybase (part of MS Analytics Platform)

▸Oracle Big Data SQL (part of Oracle Big Data Appliance in combination with Exadata)

� Use Cases

▸Extension of traditional BI System

▸Data-lake scenario with RDBMS as primary system and Hadoop for mass data

▸Mix of analytic and transactional load

HDFS/HBase

MapReduce

Parallel Database

Relational TablesExternalTables


MRLoader


Oracle Big Data SQL

� Integrates

Hive Metadata

� Allows hybrid queries

▸Include Hadoop and NoSQL in relational queries

▸Use Exadata Smartscan-Technology

Source: http://www.oracle.com



HDFS

MR / Tez

HiveQL


HDFS


HDFS


NoSQLHive


RDBMS

Relational


Hadoop

Stinger Big Insights


>

Hadoop File Formats


File formats for Big Data

Text Formats

� High storage usage �bad scan performance

� Low compression � bad scan performance

Dedicated Formats (e.g. DB internal)

� Not open, no interoperability

Requirements for Big Data

� Interoperability

� Low storage / good compression

� High performance

� Flexible schema

A schema for a file format??Query performance for a file format??

Big Data often have a nested structure andmultiple schema variants and –versions


Motivation for file formats

Quelle: http://de.slideshare.net/cloudera/hadoop-summit-36479635


Considerations for File Formats

Query Tools

•none

•Frameworks like MapReduce, Spark, Cascading

•Query Engines like Pig, Hive, Impala

Schema Versioning

•Schema present?

•If so, can it change?

Splittability

•Partitioning

•Splitting of files possible for distributed processing?

•Example: CSV: yes, XML: partial, MP4: no

Block Compression

•Can blocks be independently compressed and distributed?

•Block compression is a prerequisite for partition compresssion!

File Size

•Size in bytes and number of files?

•Hadoop likes big, splittable files!

•Lots of small files cost performance

Load Profile

•Write Performance

•Filter operations

•Reading of single columns

•Full scans

Source: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore


Example Row-Format: Avro

� Schema specification

▸internally storedin binary format

▸self-describing files

� Reader- vs. Writer Schema

▸Allows different"Views" on files

Writer Schema Writer Schema

Reader Schema

Avro Parser

Avro Data

Resolution Rules

Read

Write

record Person {string userName;union { null, long } favouriteNumber;array<string> interests;

}


Example Column Format:

� Columnar formats in general

▸Trade faster reads for slower writes

▸Very good compression

� Parquet Files

▸Hybrid Partitioning – sets of records in blocks, columnar within blocks

▸Zone maps per block as kind of index (min/max values per column)

Image: http://de.slideshare.net/cloudera/hadoop-summit-36479635


Databases as Lego Construction Kit!?

SQL Prozessor

Verteilte Ausführung

Speicherverwaltung

Dat

a D

icti

on

ary

SQL

Traditional monolithic RDBMS

SQL Prozessor

MapReduce

Spark

CSV Seq Avro

JSON ORC Parquet

SQL

Hadoop DB building blocks

• Generic Execution Engine• Metadata sharing in Hive Repository or

self-describing file formats• Operator push-down to intelligente file interfaces


>

Summary


Considerations for SQL on Hadoop Solutions

� SQL Functionality

▸Coverage of SQL standard

▸User-defined functions

▸Transactional Safety

� Performance and Stability

▸Multi-user workloads

▸Efficiency of joins and aggregations (I/O problems? Size limits?)

� Supported Data and Storage Formats

▸ Logical Format: relational, JSON, none, …

▸Physical Formats: CSV, Parquet, ORC, Avro, …

� Intelligent Storage Plugins / Data Federation

▸Access to various data sources beyond Hadoop

▸Pushdown predicates, access selected columns only

� Support for your Hadoop Distribution


HadoopHadoop

SQLSQL

Hadoop vs. SQL

� Hadoop SQL Technologies

▸Supplement traditional RDBMS

▸Extend traditional RDBMS

▸Develop new RDBMS

SQL universe

gets wider.

Hadoop and SQL

move closely

together.

Database systems become

open and modular.


References

▸ L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234

▸ A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014

▸ M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data Magazine, June 2013

▸ M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15)

▸ D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266

▸ S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330–339, 2010

▸ M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,NSDI 2012

▸ J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen.Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475

▸ C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore. Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore

▸ J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0, Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635

▸ M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out.Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html

▸ P. Srivati: Resilient Distributed Datasets (RDD) for the impatient.Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

▸ S. Yegulalp: 10 ways to query Hadoop with SQL.Infoworld, 2014, http://www.infoworld.com/article/2683729/hadoop/10-ways-to-query-hadoop-with-sql.html

tdwi sql on hadoop - sigs.de · pdf filespark, cascading •query engines like pig, hive,...

Documents