tdwi sql on hadoop - sigs.de · pdf filespark, cascading •query engines like pig, hive,...

43
Processing Big Data With SQL on Hadoop Jens Albrecht [email protected]

Upload: habao

Post on 06-Mar-2018

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Processing Big Data

With SQL on Hadoop

Jens Albrecht

[email protected]

Page 2: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 3

Why SQL for Big Data?

� Mature technology

� Broad knowledge available

� Powerful query language

� High interactive performance

� Many third party tools for data

analysis and visualization

� Flexible data structures

� Semi-structured data� Changing schemas

� Self-Service

� Data integration on-the-fly

� Scalability

� Analysis, integration, volumen

� (Relatively) Low Cost

� Commodity Hardware, Open Source

Page 3: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 4

Agility

Why SQL for Big Data?

+

Extended

DWHData Lake

Page 4: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 5

Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Page 5: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 6

Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Page 6: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 7

MapReduce

Hive

Hive

� General

▸Developed initially by Facebook

▸SQL-processing for HDFS and HBase

▸Table definitions in Hive Meta Store

▸Generation of MapReduce Code

▸Schema-on-Read via SerDe

� Advantages

▸Mature part of every Hadoop distribution

▸Simple setup

▸Java-API for UDFs

▸Usage of many data formats via SerDe

� Disadvantages

▸Batch-oriented, slow

Hadoop

HDFS/HBase

MetaStore

HiveQLExecution

SerDe

Page 7: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 8

Schema-on-Write vs. Schema-on-Read

� Relational Database: Schema-on-Write

� Big Data Processing: Schema-on-Read

Multi-structuredSource Data

ETL

Relational DBMS

SQL

Multi-structuredSource Data

Load as-is

Hadoop

SQL

Schema mapped to

originalfiles

Page 8: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 9

Schema-on-Read: Hive & CSV

CREATE TABLE gps_data(

userid INT,

deviceid INT,

longitude STRING,

latitude STRING,

utctime STRING)

ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILE;

-- load data = copyLOAD DATA LOCAL INPATH 'new_data/gps.dat'OVERWRITE INTO TABLE gps_data;

SELECT COUNT(*) FROM gps_data;

4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC

Sample Data

Page 9: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 10

Schema-on-Read: Hive & Regexp

CREATE TABLE weblog (host STRING,identity STRING,user STRING,time STRING,request STRING,status STRING,size STRING,referer STRING,agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\])

([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"

)STORED AS TEXTFILE;

Page 10: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 11

Schema-Read: Hive & JSON

Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html

JSON Data Relational Mapping in Hive

Page 11: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 12

MapReduce

Hive

Hive on Tez (Stinger)

� Stinger-Initiative

▸"Make Hive 100x faster"

▸Finished with Hive 0.13 (April 2014)

▸ Replace MapReduce with Tez▸Native columnar data format (ORC)

� Stinger.Next

▸Phase 1: Hive 0.14 (November 2014)

▸ACID transactions

▸Phase 2: (Q2 2015)

▸ Subsecond Queries mit LLAP▸Machine Learning Integration

▸Phase 3: (Q4 2015)

▸ SQL:2011 Analytic Functions▸Materialized Views

Hadoop

HDFS/HBase

Tez / Yarn

MetaStore

HiveQLExecution

SerDe

Page 12: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 13

Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Page 13: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 14

Distributed SQL Engine

Pure Hadoop SQL Engines

� Approach

▸Distributed, parallel SQL engine

▸Often usage of Hive Metadata

▸Support of optimized data formats

▸Hadoop as mandatory basis

� Advantages and Disadvantages

▸Significantly faster as Hive

▸Low latency through dedicated engine

▸Operator pipelining and result caching

� Diffentiation of solutions

▸Supported SQL functionality

▸Point querying

▸Cost-based optimizer / performance

▸Transaction support

HDFS/HBase

LocalAgent

LocalAgent

LocalAgent

LocalAgent

SQL Query Coordination

DataFiles

DataFiles

DataFiles

DataFiles

Page 14: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 15

Distributed SQL Engine

Pure Hadoop SQL Engines

HDFS/HBase

LocalAgent

LocalAgent

LocalAgent

LocalAgent

SQL Query Coordination

DataFiles

DataFiles

DataFiles

DataFiles

Big Insights

Page 15: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 16

Pure Hadoop SQL Engines

Example: IBM Big SQL

Source: IBM

Page 16: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 17

Apache Spark & Spark SQL

� General

▸SQL engine based on Spark

▸Data access via data frames (former SchemaRDD)

▸In-Memory columnar format

▸HDFS / HBase as file format

� Advantages

▸Spark as general-purpose parallel computing framework

▸Support of Hive extensions like UDFs and SerDes and Hive metadata

� Disadvantages

▸Not yet fully mature

▸Not yet as fast as competitors

Spark SQL

HDFS/HBase

Apache Spark

Schema RDD

SQLExecution

SerDe

Page 17: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 18

Apache Spark

� Distributed In-Memory Computing Framework

▸Data caching

▸General framework for all kinds of SQL and non-SQL analytics

▸Support for out-of-the box libraries as well as Java, Python and Scala in the same engine and for the same data

▸New datasources API allows to write plugins for non-Hadoop sources

Spark Execution Engine

Hadoop

ZooKeeperYARN

(optional)HDFS

Spark SQLSpark

Streaming

Graph Computation

(GraphX)

Machine Learning(MLlib)

Page 18: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 19

Apache Spark

lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith(ERROR)) errors.persist()

// Return the time fields of errors mentioning// assuming time is field number 3 in a tsv filehdfs_errors = errors.filter(_.contains(HDFS))time_fields = hdfs_errors.map(_.split(’\t’)(3)).collect()

Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Action

Resilient Distributed Data Sets(RDDs)

Transformations

Page 19: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 20

Spark Transformations

� Transformations: Create a new RDD from an existing one

▸Lazy evaluation – results are not materialized

▸Much more than map reduce

� Actions: Return a value or dataset to the calling program

map filter samplegroupByKey sortByKey reduceByKeyunion pipe repartitionjoin leftOuterJoin rightOuterJoin…

reduce collect countfirst take(n) saveAsTextFile…

Page 20: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 22

Adding Schema to RDDs

Source: http://de.slideshare.net/jeykottalam/spark-sqlamp-camp2014

Page 21: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 23

Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS/Hive

DistributedSQL Engine

NoSQLSQL

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Page 22: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 24

Distributed SQL Engine

SQL-Engine with Pluggable Storage

� Ansatz

▸Verteilte, parallele SQL-Engine

▸Oftmals Nutzung von Hive Metadaten

▸Unterstützung optimierter Dateiformate

▸Hadoop obligatorisch als Basis

� Allgemeine Vorteile

▸Deutlich schneller als Hive

▸Geringe Latenzzeiten durch Vermeidung von Map-Reduce

▸Operatoren-Pipelining und Caching

▸Skalierbar

� Allgemeine Nachteile

▸In der Regel keine Transaktionsunterstützung

LocalAgent

LocalAgent

LocalAgent

LocalAgent

SQL Query Coordination

HDFS Hive JSON Parquet

Cas-sandra

MySQL Oracle …

Connector Plugins

Page 23: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 25

Google Dremel

Dremel

Page 24: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 26

Apache Drill

� Self-Service Data Integration

▸No metadata repository required

▸Dynamic schema discovery: Metadata automatically extracted for data sources

▸ RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly)▸ Utilizes self-describing data formats (Parquet, JSON, AVRO)

▸SQL-DDL can be used to create metadata explicitly

� ANSI-SQL plus Flexible Data Model

▸Fully ANSI compliant SQL

▸DrQL with SQL extensions for nested data structures (like JSON)

Quelle: http://www.mapr.com

Page 25: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 27

Apache Drill: SQL for JSON

select name, flatten(fillings) as f

from dfs.users.`/donuts.json`

where f.cal < 300;

Source: https://www.mapr.com/products/apache-drill

Page 26: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 28

Apache Drill: SQL for Heterogeneous Data

� Formats

▸JSON

▸CSV

▸ORC

▸Parquet

▸HBase tables

▸MongoDB

select USERS.name, USERS.emails.work

from

dfs.logs.`/data/logs` LOGS,

dfs.users.`/profiles.json` USERS,

where

LOGS.uid = USERS.uid and

errorLevel > 5

order by count(*);

Page 27: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 29

Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Page 28: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 30

External Tables in HDFS

HDFS

RDBMS

Logical Mapping

CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6),

CUST_ID NUMBER,

TIME_ID DATE,

CHANNEL_ID CHAR(1),

PROMO_ID NUMBER(6),

QUANTITY_SOLD NUMBER(3),

AMOUNT_SOLD NUMBER(10,2)

)

ORGANIZATION EXTERNAL

( TYPE ORACLE_LOADER

DEFAULT DIRECTORY SALES_EXT_DIR

ACCESS PARAMETERS

( RECORDS DELIMITED BY NEWLINEFIELDS TERMINATED BY ',' (

PROD_ID DECIMAL EXTERNAL,

)

PREPROCESSORHDFS_BIN_PATH:hdfs_stream

)

LOCATION ( 'file_sales_1',

'file_sales_2',

'files_sales_3')

);

Page 29: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 31

RDBMS with Hadoop Integration

� Approach

▸Towards genuine integration of Hadoop into RDBMS

▸Utilize Hadoop's computational power

▸Cost-based choice

� Advantages

▸Easiest way to use Hadoop as data source

▸Combined access to traditional and new data sources

� Disadvantages

▸Cost

▸Limited data sources

▸Vendor lock-inHDFS/HBase

MapReduce

Parallel Database

Relational TablesExternalTables

SQL Query Coordination

MRLoader

Page 30: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 32

RDBMS with Hadoop Integration

� Products

▸Microsoft Polybase (part of MS Analytics Platform)

▸Oracle Big Data SQL (part of Oracle Big Data Appliance in combination with Exadata)

� Use Cases

▸Extension of traditional BI System

▸Data-lake scenario with RDBMS as primary system and Hadoop for mass data

▸Mix of analytic and transactional load

HDFS/HBase

MapReduce

Parallel Database

Relational TablesExternalTables

SQL Query Coordination

MRLoader

Page 31: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 33

Oracle Big Data SQL

� Integrates

Hive Metadata

� Allows hybrid queries

▸Include Hadoop and NoSQL in relational queries

▸Use Exadata Smartscan-Technology

Source: http://www.oracle.com

Page 32: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 34

Hadoop-SQL Integration

HDFS

MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS

DistributedSQL Engine

HDFS

DistributedSQL Engine

NoSQLHive

Format-agnostic SQL Engines

RDBMS

Relational

RDBMS with Hadoop Access

Hadoop

Stinger Big Insights

Page 33: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 35

>

Hadoop File Formats

Page 34: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 36

File formats for Big Data

Text Formats

� High storage usage �bad scan performance

� Low compression � bad scan performance

Dedicated Formats (e.g. DB internal)

� Not open, no interoperability

Requirements for Big Data

� Interoperability

� Low storage / good compression

� High performance

� Flexible schema

A schema for a file format??Query performance for a file format??

Big Data often have a nested structure andmultiple schema variants and –versions

Page 35: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 37

Motivation for file formats

Quelle: http://de.slideshare.net/cloudera/hadoop-summit-36479635

Page 36: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 38

Considerations for File Formats

Query Tools

•none

•Frameworks like MapReduce, Spark, Cascading

•Query Engines like Pig, Hive, Impala

Schema Versioning

•Schema present?

•If so, can it change?

Splittability

•Partitioning

•Splitting of files possible for distributed processing?

•Example: CSV: yes, XML: partial, MP4: no

Block Compression

•Can blocks be independently compressed and distributed?

•Block compression is a prerequisite for partition compresssion!

File Size

•Size in bytes and number of files?

•Hadoop likes big, splittable files!

•Lots of small files cost performance

Load Profile

•Write Performance

•Filter operations

•Reading of single columns

•Full scans

Source: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore

Page 37: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 39

Example Row-Format: Avro

� Schema specification

▸internally storedin binary format

▸self-describing files

� Reader- vs. Writer Schema

▸Allows different"Views" on files

Writer Schema Writer Schema

Reader Schema

Avro Parser

Avro Data

Resolution Rules

Read

Write

record Person {string userName;union { null, long } favouriteNumber;array<string> interests;

}

Page 38: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 40

Example Column Format:

� Columnar formats in general

▸Trade faster reads for slower writes

▸Very good compression

� Parquet Files

▸Hybrid Partitioning – sets of records in blocks, columnar within blocks

▸Zone maps per block as kind of index (min/max values per column)

Image: http://de.slideshare.net/cloudera/hadoop-summit-36479635

Page 39: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 41

Databases as Lego Construction Kit!?

SQL Prozessor

Verteilte Ausführung

Speicherverwaltung

Dat

a D

icti

on

ary

SQL

Traditional monolithic RDBMS

SQL Prozessor

MapReduce

Spark

CSV Seq Avro

JSON ORC Parquet

SQL

Hadoop DB building blocks

• Generic Execution Engine• Metadata sharing in Hive Repository or

self-describing file formats• Operator push-down to intelligente file interfaces

Page 40: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 42

>

Summary

Page 41: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 43

Considerations for SQL on Hadoop Solutions

� SQL Functionality

▸Coverage of SQL standard

▸User-defined functions

▸Transactional Safety

� Performance and Stability

▸Multi-user workloads

▸Efficiency of joins and aggregations (I/O problems? Size limits?)

� Supported Data and Storage Formats

▸ Logical Format: relational, JSON, none, …

▸Physical Formats: CSV, Parquet, ORC, Avro, …

� Intelligent Storage Plugins / Data Federation

▸Access to various data sources beyond Hadoop

▸Pushdown predicates, access selected columns only

� Support for your Hadoop Distribution

Page 42: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 44

HadoopHadoop

SQLSQL

Hadoop vs. SQL

� Hadoop SQL Technologies

▸Supplement traditional RDBMS

▸Extend traditional RDBMS

▸Develop new RDBMS

SQL universe

gets wider.

Hadoop and SQL

move closely

together.

Database systems become

open and modular.

Page 43: TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Prof. Dr. Jens Albrecht SQL on Hadoop 45

References

▸ L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234

▸ A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014

▸ M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data Magazine, June 2013

▸ M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15)

▸ D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266

▸ S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330–339, 2010

▸ M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,NSDI 2012

▸ J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen.Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475

▸ C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore. Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore

▸ J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0, Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635

▸ M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out.Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html

▸ P. Srivati: Resilient Distributed Datasets (RDD) for the impatient.Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

▸ S. Yegulalp: 10 ways to query Hadoop with SQL.Infoworld, 2014, http://www.infoworld.com/article/2683729/hadoop/10-ways-to-query-hadoop-with-sql.html