apache tez : accelerating hadoop query processing

Apache Tez : Accelerating Hadoop Query Processing

Jeff Markham Technical Director, APAC Hortonworks

© Hortonworks Inc. 2013

Tez – Introduction

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph.

• Built on top of YARN – the resource management framework for Hadoop.

• Open source Apache incubator project and Apache licensed.

© Hortonworks Inc. 2013.


YARN: Taking Hadoop Beyond Batch

HADOOP 1.0

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)

Pig (data flow)

Hive (sql)

Others (cascading)

HDFS2 (redundant, reliable storage)

YARN (cluster resource management)

Tez (execu:on engine)

HADOOP 2.0

Data Flow Pig

SQL Hive

Others (cascading)

Batch MapReduce Real Time

Stream Processing

Storm

Online Data

Processing HBase,

Accumulo

MapReduce as Base Apache Tez as Base



Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task

Processor Input Output



Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Map Processor

HDFS Input

Sorted Output

Reduce Processor

Shuffle Input

HDFS Output

Reduce Processor

Shuffle Input

Sorted Output



Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS



Tez Sessions

… because Map/Reduce query startup is expensive

• Tez Sessions – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s)

• Hive – Session launch/shutdown in background (seamless, user not

aware) – Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc



Tez Delivers Interactive Query - Out of the Box!

Feature DescripEon Benefit

Tez Session Overcomes Map-‐Reduce job-‐launch latency by pre-‐launching Tez AppMaster Latency

Tez Container Pre-‐Launch

Overcomes Map-‐Reduce latency by pre-‐launching hot containers ready to serve queries. Latency

Tez Container Re-‐Use Finished maps and reduces pick up more work rather than exi:ng. Reduces latency and eliminates difficult split-‐size tuning. Out of box performance!

Latency

Run:me re-‐configura:on of DAG

Run:me query tuning by picking aggrega:on parallelism using online query sta:s:cs Throughput

Tez In-‐Memory Cache Hot data kept in RAM for fast access. Latency

Complex DAGs Tez Broadcast Edge and Map-‐Reduce-‐Reduce paXern improve query scale and throughput. Throughput


Tez – Design Themes

• Empowering End Users • Execution Performance


Tez – Empowering End Users

• Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment



• Expressive dataflow definition API’s – Enable definition of complex data flow pipelines using simple

graph connection API’s. Tez expands the logical plan at runtime. – Targeted towards data processing applications like Hive/Pig but

not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance.

TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2

TaskD-1 TaskD-2 TaskE-1 TaskE-2


Aggregate Stage

Partition Stage

Preprocessor Stage


• Expressive dataflow definition API’s

Sampler

Task-1 Task-2

Task-1 Task-2

Task-1 Task-2

Samples

Ranges

Distributed Sort



• Flexible Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting

different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that

can be programmatically composed to generate useful tasks.

Mapper

HDFSInput

MapProcessor

FileSortedOutput

Reducer

ShuffleInput

ReduceProcessor

HDFSOutput

PairwiseJoin

Input1

JoinProcessor

FileSortedOutput

Input2



• Data type agnostic – Tez is only concerned with the movement of data. Files and

streams of bytes. – Does not impose any data format on the user application. MR

application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them.

File

Stream

Key Value

Tez Task

Tuples

User Code

Bytes Bytes



• Simplifying deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible

FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new

functionality while keeping stable versions for production. – Leverages YARN local resources.

Client Machine

Node Manager

TezTask

Node Manager

TezTask TezClient

HDFS Tez Lib 1 Tez Lib 2

Client Machine

TezClient



• Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage

With great power API’s come great responsibilities J Tez is a framework on which end user applications can be built


Tez – Execution Performance

• Performance gains over Map Reduce • Optimal resource management • Plan reconfiguration at runtime • Dynamic physical data flow decisions



• Performance gains over Map Reduce – Eliminate replicated write barrier between successive

computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow

jobs that are started after a predecessor job completes.

Pig/Hive - MR Pig/Hive - Tez



• Plan reconfiguration at runtime – Dynamic runtime concurrency control based on data size, user

operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer.

HDFS Blocks

YARN Resources

Stage 1 50 maps

100 partitions

Stage 2 100

reducers

Stage 1 50 maps

100 partitions

Stage 2 100 10

reducers

Only 10GB’s of data



• Optimal resource management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks.

YARN Container

TezTask Host

TezTask1

TezTask2

Sha

red

Obj

ects

YARN Container

Tez Application Master

Start Task

Task Done

Start Task



• Dynamic physical data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or in-

memory. – Transfer bytes via blocking files or streaming and the spectrum in

between.

Producer (small size)

In-Memory

Consumer

Producer

Local File

Consumer

At Runtime


Tez – Sessions

Page 33

Client

•  Key for interactive queries •  Analogous to database

sessions and represents a connection between the user and the cluster

•  Run multiple DAGs / queries in the same session

•  Maintains a pool of reusable containers for low latency execution of tasks within and across queries

•  Takes care of data locality and releasing resources when idle

•  Session cache in the Application Master and in the container pool reduce re-computation and re-initialization

Application Master

Con

tain

er P

ool

Pre-Warmed

JVM

Shared Object

Registry

Task Scheduler

Start Session

Submit DAG


Tez – Benchmark Performance

Page 35

Significant (but not all) speed-ups due to Tez: •  DAG support and runtime graph re-

configuration enable utilizing the parallelism of the cluster

•  Tez Session and container re-use enable efficient and low latency execution


Tez – Performance Analysis

Page 36

Tez Session populates container pool

Dimension table calculation and HDFS split generation in parallel

Dimension tables broadcasted to Hive MapJoin tasks

Final Reducer pre-launched and fetches completed inputs

AM

… …

TPC-DS – Query 27 with Hive on Tez


Tez – Current status

• Apache Incubator Project – Rapid development. Over 600 jiras opened. Over 400 resolved. – Growing community of contributors and users.

• Focus on stability – Testing and quality are highest priority. – Code ready and deployed on multi-node environments.

• Support for a vast topology of DAGs – Already functionally equivalent to Map Reduce. Existing Map

Reduce jobs can be executed on Tez with few or no changes. – Hive re-targeted to use Tez for execution of queries (HIVE-4660). – Work started on Pig to use Tez for execution of scripts (PIG-3446).

Page 37


Tez – Roadmap

• Richer DAG support – Support for co-scheduling and streaming – Better fault tolerance with checkpoints

• Performance optimizations – More efficiencies in transfer of data – Improve session performance

• Usability – Stability and testability – Recovery and history – Tools for performance analysis and debugging

Page 38


Tez – Key Takeaways

• Distributed execution framework that works on computations represented as dataflow graphs

• Naturally maps to execution plans produced by query optimizers

• Customizable execution architecture designed to enable dynamic performance optimizations at runtime

• Works out of the box with the platform figuring out the hard stuff

• Span the spectrum of interactive latency to batch • Open source Apache project – your use-cases and code are welcome

• It works and is already being used by Hive and Pig

Page 40


Thank You !

Page 41

apache tez : accelerating hadoop query processing

Technology

tez tez

deployment tez

baseapache tez

tez dataflow graphs

dag of tez tasks hortonworks

apache tez speed

tez design themes

local tez configuration