an open source memory-centric distributed storage...
Post on 17-Feb-2018
231 Views
Preview:
TRANSCRIPT
Haoyuan Li, Tachyon Nexus & UC Berkeley
November 19, 2015 @ AMPCamp 6
An Open Source Memory-Centric Distributed Storage System
Outline
• Open Source
• Introduction to Tachyon (Before 2015)
• Deployments and New Features
• Getting Involved
2
Background • Started at UC Berkeley AMPLab – From summer 2012
• Open sourced – April 2013 (two and half years ago) – Apache License 2.0 – Latest Release: Version 0.8.2 (November 2015)
• Deployed at > 100 companies
3
4
Tachyon: one of the Fastest GrowingBig Data Open Source Project
Contributors Growth
5
v0.4!Feb ‘14
v0.3!Oct ‘13
v0.2 Apr ‘13
v0.1 Dec ‘12
v0.6!Mar ‘15
v0.5!Jul ‘14
v0.7!Jul ‘15
1 3 15
30
46
70
111
Contributors Growth
6
> 170 Contributors (V0.8) (3x increment over the last AMPCamp)
> 50 Organizations
Thanks to Contributors and Users!
7
h"p://tachyon-project.org/community/
One Tachyon ProductionDeployment Example
• Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap)
• Framework: SparkSQL • Under Storage: Baidu’s File System • Storage Media: MEM + HDD • 100+ nodes deployment • 1PB+ managed space • 30x Performance Improvement
8
Outline
• Open Source
• Introduction to Tachyon (Before 2015)
• New Features
• Getting Involved
9
Tachyon is an Open Source
Memory-centricDistributed
Storage System 10
11
Why Tachyon?
Performance Trend: Memory is Fast
• RAM throughput increasing exponentially
• Disk throughput increasing slowly
12
Memory-locality key to interactive response times
Price Trend: Memory is Cheaper
source:jcmit.com13
Realized by many…
14
15
Is the Problem Solved?
16
Missing a Solution for the Storage Layer
A Use Case Example with -
• Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data
map
filter map
join reduce
Lineage Tracking
17
Issue 1
18
Data Sharing is the bottleneck in analytics pipeline:Slow writes to disk
Spark Job1
Spark mem block manager
block 1
block 3
Spark Job2
Spark mem block manager
block 3
block 1
HDFS / Amazon S3 block 1
block 3
block 2
block 4
storage engine & execution engine same process (slow writes)
Issue 1
19
Spark Job
Spark mem block manager
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3 block 1
block 3
block 2
block 4
Data Sharing is the bottleneck in analytics pipeline:Slow writes to disk
storage engine & execution engine same process (slow writes)
Issue 1 resolved with Tachyon
20
Memory-speed data sharingamong jobs in different
frameworks execution engine & storage engine same process (fast writes)
Spark Job
Spark mem
Hadoop MR Job
YARN
HDFS / Amazon S3 block 1
block 3
block 2
block 4
HDFSdisk
block1
block3
block2
block4Tachyon!in-memory
block 1
block 3 block 4
Issue 2
21
Spark Task
Spark memory block manager
block 1
block 3
HDFS / Amazon S3 block 1
block 3
block 2
block 4
execution engine & storage engine same process
Cache loss when process crashes
Issue 2
22
crash
Spark memory block manager
block 1
block 3
HDFS / Amazon S3 block 1
block 3
block 2
block 4
execution engine & storage engine same process
Cache loss when process crashes
HDFS / Amazon S3
Issue 2
23
block 1
block 3
block 2
block 4
execution engine & storage engine same process
crash
Cache loss when process crashes
HDFS / Amazon S3 block 1
block 3
block 2
block 4 Tachyon!in-memory
block 1
block 3 block 4
Issue 2 resolved with Tachyon
24
Spark Task
Spark memory block manager
execution engine & storage engine same process
Keep in-memory data safe,even when a job crashes.
Issue 2 resolved with Tachyon
25
HDFSdisk
block1
block3
block2
block4
execution engine & storage engine same process
Tachyon!in-memory
block 1
block 3 block 4
crash
HDFS / Amazon S3 block 1
block 3
block 2
block 4
Keep in-memory data safe,even when a job crashes.
HDFS / Amazon S3
Issue 3
26
In-memory Data Duplication & Java Garbage Collection
Spark Job1
Spark mem block manager
block 1
block 3
Spark Job2
Spark mem block manager
block 3
block 1
block 1
block 3
block 2
block 4
execution engine & storage engine same process (duplication & GC)
Issue 3 resolved with Tachyon
27
No in-memory data duplication,much less GC
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3 block 1
block 3
block 2
block 4
execution engine & storage engine same process (no duplication & GC)
HDFSdisk
block1
block3
block2
block4Tachyon!in-memory
block 1
block 3 block 4
Previously Mentioned
• A memory-centric storage architecture
• Push lineage down to storage layer
28
Tachyon Memory-Centric Architecture
29
Tachyon Memory-Centric Architecture
30
Lineage in Tachyon
31
Outline
• Open Source
• Introduction to Tachyon (Before 2015)
• Deployments and New Features
• Getting Involved
32
1) Eco-system: Enable new workload in any storage;
Work with the framework of your choice;
33
2) Tachyon running in production environments,
both in the Cloud and on Premise.
34
Use Case: Baidu
• Framework: SparkSQL • Under Storage: Baidu’s File System • Storage Media: MEM + HDD • 100+ nodes deployment • 1PB+ managed space • 30x Performance Improvement
35
Use Case: an Oil Company
• Framework: Spark
• Under Storage: GlusterFS
• Storage Media: MEM only
• Analyzing data in traditional storage
36
Use Case: a SAAS Company
• Framework: Impala
• Under Storage: S3
• Storage Media: MEM + SSD
• 15x Performance Improvement
37
Use Case: a Biotechnology Company
• Framework: Spark & MapReduce
• Under Storage: GlusterFS
• Storage Media: MEM and SSD
38
Use Case: a SAAS Company
• Framework: Spark
• Under Storage: S3
• Storage Media: SSD only
• Elastic Tachyon deployment
39
Use Case: a Retail Company
• Framework: Spark & MapReduce
• Under Storage: HDFS
• Storage Media: MEM
40
Run Everywhere
41
Enable Faster Innovation in Storage Layer
42
What if data size exceeds memory capacity?
43
3) Tiered Storage:Tachyon Manages More Than DRAM
MEM SSD
HDD
Faster
Higher Capacity
44
Configurable Storage Tiers
MEM only
MEM + HHD
SSD only
45
4) Pluggable Data Management Policy
Evict stale data to lower tier
Promote hot data to upper tier
46
Pin Data in Memory
5) Transparent Naming
47
6) Unified Namespace
48
More Features
• 7) Remote Write Support • 8) Easy deployment with Mesos and Yarn • 9) Initial Security Support • 10) One Command Cluster Deployment • 11) Metrics Reporting for Clients, Workers,
and Master
49
12) More Under Storage Supports
50
Reported Tachyon Usage
51
• Team consists of Tachyon creators, top contributors
• Series A ($7.5 million) from Andreessen Horowitz
• Committed to Tachyon Open Source
• http://www.tachyonnexus.com
52
Outline
• Open Source
• Introduction to Tachyon (Before 2015)
• Deployments and New Features
• Getting Involved
53
Memory-Centric Distributed Storage
Welcome to try, contact, and collaborate!
54
JIRA New Contributor Tasks
• Try Tachyon: http://tachyon-project.org
• Develop Tachyon: https://github.com/amplab/tachyon
• Meet Friends: http://www.meetup.com/Tachyon
• Get News: http://goo.gl/mwB2sX
• Tachyon Nexus: http://www.tachyonnexus.com • Contact us: haoyuan@tachyonnexus.com
55
top related