presentation by tachyonnexus & baidu at strata singapore 2015

44
Interactive Data Analytics with Spark on Tachyon in Baidu Bin Fan (Tachyon Nexus) [email protected] Xiang Wen (Baidu) wenxiang @ baidu.com ec 02 2015 @ Strata + Hadoop World, Singapor 1

Upload: tachyon-nexus-inc

Post on 11-Jan-2017

1.382 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Interactive Data Analytics with Spark on Tachyon

in Baidu

Bin Fan (Tachyon Nexus)[email protected]

Xiang Wen (Baidu)[email protected]

Dec 02 2015 @ Strata + Hadoop World, Singapore

1

Page 2: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Who Are We?• Bin Fan

• Tachyon Project Contributor

• Software Engineer at Tachyon Nexus

• Xiang Wen

• From Baidu Big Data Department

• Senior Software Engineer at Baidu

2

Page 3: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

• Team consists of Tachyon creators, top contributors

• Series A ($7.5 million) from Andreessen Horowitz

• Committed to Tachyon Open Source Project

• www.tachyonnexus.com

3

Page 4: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Agenda

• Tachyon Basics & New Features• Motivation• Building an interactive data service

– Spark + Tachyon• Future Works

4

Page 5: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

History of Tachyon

• Started at UC Berkeley AMPLab– From summer 2012

• Open sourced– April 2013 (two and half years ago)– Apache License 2.0– Latest Release: Version 0.8.2 (November 2015)

5

Page 6: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

One of the Fastest GrowingBig Data Open Source Project

> 170 Contributors (v0.8) 3x increment over the last year

http://tachyon-project.org/community/

6

Page 7: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

> 50 Organizations

One of the Fastest GrowingBig Data Open Source Project

7

Page 8: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

An Open SourceMemory-Centric

Distributed Storage System

What is

8

Page 9: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Tachyon Stack

9

Page 10: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Memory-speed data sharing

across jobs and frameworksSpark Job

Spark mem

Hadoop MR Job

YARN

HDFS / Amazon S3

HDFSdisk

block 1

block 3

block 2

block 4Tachyonin-memoryData

Why Use Tachyon?Data survive in memory after computation

crashescrash

Off-heap storage, no

GC

10

Page 11: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Enable Faster Innovation in Storage Layer

11

Page 12: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

What if data size exceeds memory capacity?

12

Page 13: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Tiered Storage: Tachyon Manages More Than DRAM

MEMSSD

HDD

Faster

Higher Capacity

13

Page 14: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Configurable Storage TiersMEM only

MEM + HDD

SSD only

14

Page 15: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Pluggable Data Management Policy

Evict stale data to lower tier

Promote hot data to upper

tier

15

Page 16: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Pin Data in Memory

16

Page 17: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Transparent Naming

17

Page 18: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Unified Namespace Across Under Storage Systems

18

Page 19: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

More Features

• Remote Write Support• Easy deployment with Mesos and YARN• Initial Security Support• One Command Cluster Deployment• Metrics Reporting for Clients, Workers, and

Master

19

Page 20: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Rich Choice of Under Storage Supports

20

Page 21: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

How Easy to Use Tachyon in

scala> val file = sc.textFile(“hdfs://foo”)

scala> val file = sc.textFile(“tachyon://foo”)

21

Page 22: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Use Case: a SAAS Company

• Framework: Impala

• Under Storage: S3

• Storage Media: MEM + SSD

• 15x Performance Improvement

22

Page 23: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Use Case: a Biotechnology Company

• Framework: Spark & MapReduce

• Under Storage: GlusterFS

• Storage Media: MEM and SSD

23

Page 24: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

When Tachyon Meets Baidu

~ 100 nodes in deployment, > 1 PB storage space30X Acceleration of our Big Data Analytics Workload

24

Page 25: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Agenda

• Tachyon Basics & New Features• Motivation• Building an interactive data service

– Spark + Tachyon• Future Works

25

Page 26: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Background

Logs

Data Warehouse

Online Data Services26

Hours or Days

Hours

Page 27: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Frustrated data explorers

• Example:– John is a PM and he needs to keep track of the top user

actions for a new feature– Based on the top actions of the day, he will perform

additional analysis– But John is very frustrated that each query takes tens of

minutes to finish

27

Page 28: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

A dedicated service for data exploring

• Manages PBs of data• Most queries within one minute

28

Page 29: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

User Scenario

Web Site Client

Structured Logs

Data Warehouse

Data Marts

Service Gate

Query Engine

select some_action, count(*) from event_tablewhere event_day=‘20151123’group by user_action

Have first try

Try another way

select some_action, event_hour, count(*)from event_tablewhere event_day=‘20151123’group by user_action, event_hour

Not as expected!See what happens in original log

select *from event_logwhere event_day=‘20151123’ and event_hour=’01’limit 10

29

Page 30: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Agenda

• Tachyon Basics & New Features• Motivation• Building an interactive data service

– Spark + Tachyon• Future Works

30

Page 31: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Choose Spark as compute solution

Data Warehouse

BFS

Service GateService Gate

Hive

Map Reduce

4X Improvement but not good enough!

Compute Center

Data Center

31

Page 32: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Baidu File System (BFS)

Data Center

Choose Tachyon as storage solution

Spark Task

Spark mem

Spark Task

Spark mem

HDFSdisk

block 1

block 3

block 2

block 4

Tachyonin-memory

block 1

block 3 block 4

Compute Center• Read from remote data

center: ~ 100 ~ 150 seconds• Read from Tachyon cluster

local node: 10 ~ 15 sec• Read from Tachyon machine

local node: ~ 5 sec

Tachyon Brings 30X Speed-up !

32

Page 33: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Overall Performance

MR (sec) Spark (sec) Spark + Tachyon (sec)

0

200

400

600

800

1000

1200 Setup:1. Use MR to query 6 TB of data2. Use Spark to query 6 TB of

data3. Use Spark + Tachyon to query

6 TB of data

Results:1. Spark + Tachyon achieves 50-

fold speedup compared to MR

33

Page 34: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Architecture

SparkContext

OperationManager

ViewManager

Run Query

Build Cache

Data Warehouse

Cache

Scheduler

Ask & Profile

34

Page 35: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Catalyst helps to be ‘transparent’

lookupRelation

CacheableRelation

Tachyon HDFS

Union HiveTableScanwithUncachedPartitions

HiveTableScanwithCachedPartitions

35

Page 36: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Cache Policy

• Prefetch– Fetch the views daily in advance when system is idle– The views fetched are based on the pattern of the past query

history profiling, e.g. 3 months query logs

• On Demand caches– Fetch the views at runtime when system is serving regular queries.– Using machine learning to generate policy file monthly for

views/tables– When a query is accessing some views, and parts of views match

our pre-generated policy, those views will be cached at that time.

36

Page 37: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Hot Query: Cache Hit

37

Spark

TachyonHDFS

Operation Manager

Query UI

View Manager Cache

Meta

Page 38: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Cold Query: Cache Miss

38

Spark

TachyonHDFS

Operation Manager

Query UI

View Manager Cache

Meta

Page 39: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Daily Stats with Cache

• Daily– Table

• Queries: 100 – 300• Hit Rate: ~40%

– Partition• Queries: 80K – 120K• Hit Rate: ~40 – 50%

• Performance with Cache– avg 2 - 3 time faster than without Cache

39

Page 40: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Agenda

• Tachyon Basics & New Features• Motivation• Building an interactive data service

– Spark + Tachyon• Future Works

40

Page 41: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Improve Caching System

• As Extended Meta Service– Improve legacy schema/input-format– Load block meta into cache layer– Index / Materialized View

• Cost Based Caching/Optimizing– Better performance, hit rate & execution– Lower storage needs for cache layer

41

Page 42: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

More User Scenario

• If John is data scientist– Need a way to construct dataset conveniently – Usually have many tries with same dataset

• An interactive system should help a lot– Spark is an ideal solution

42

Page 43: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Hardware assisted big data infrastructure

• Hardware– GPU– FPGA

• Applications– Accelerate common SQL and ML operators– TableScan && InputFormat && Serde

• Lower down the cost– 10 dollars for big data– 1 more dollar for interactive big data

43

Page 44: Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Q&A

44