effective spark with alluxio at strata+hadoop world san jose 2017
TRANSCRIPT
EFFECTIV E SPAR K WITH ALLUXIO
Strata + Hadoop World San Jose 2017Calvin Jia, Alluxio Inc.
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
2
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
BIG DATA ECOSYSTEM YESTERDAY
4
• Fastest growing open-source project in the big data ecosystem
• 400+ contributors from 100+ organizations
• Running in large production clusters
• Community members are welcome!
FASTEST GROWING BIG DATA PROJECTS
Popular Open Source Projects’ Growth
Months
Num
ber o
f Con
trib
utor
s
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
5
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
6
GUARANTEE PERFORMANCE TO MEET SLAS
RESULTS
• Introducing Alluxio provides the system with smooth performance in an environment with highly variable load
• Without Alluxio, the performance of critical jobs could take more than 4x the expected time
• With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience
• Real-time machine learning
• 10+ TB of storage
• Mix of Memory + HDD
A leading online retailer uses
Alluxio to compute conversion
attribution
7
ALLUXIO
SHARE DATA ACROSS JOBS @ MEMORY SPEED
Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased from hours to seconds
• Alluxio enabled workflows that were impossible before
• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
8
TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS
We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing
• Improved the performance of their system with 15x – 300x speedups
• Tiered storage feature manages storage resources including memory, SSD and disk
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
9
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
10
CONSOLIDATING MEMORY
11
Storage Engine & Execution EngineSame Process
• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
Spark Compute
Spark Storage
block 1
block 3
CONSOLIDATING MEMORY
12
Storage Engine & Execution EngineDifferent process
• Half the memory used• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
DATA RESILIENCE DURING CRASH
13
Spark Compute
Spark Storageblock 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
Storage Engine & Execution EngineSame Process
DATA RESILIENCE DURING CRASH
14
CRASH
Spark Storageblock 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine & Execution EngineSame Process
DATA RESILIENCE DURING CRASH
15
CRASH
HDFS / Amazon S3block 1
block 3
block 2
block 4
Storage Engine & Execution EngineSame Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
DATA RESILIENCE DURING CRASH
16
Spark Compute
Spark Storage
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
Storage Engine & Execution EngineDifferent process
DATA RESILIENCE DURING CRASH
17
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
CRASH Storage Engine & Execution EngineDifferent process
ACCESSING ALLUXIO DATA FROM SPARK
18
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
CODE EXAMPLE FOR SPARK RDDS
19
Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)
Reading RDD from
Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)
CODE EXAMPLE FOR SPARK DATAFRAMES
20
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
21
ENVIRONMENT
Spark 2.0.0 + Alluxio 1.2.0
Single worker: Amazon r3.2xlarge
Comparisons:• Alluxio• Spark Storage Level: MEMORY_ONLY• Spark Storage Level: MEMORY_ONLY_SER• Spark Storage Level: DISK_ONLY
22
0
50
100
150
200
250
0 10 20 30 40 50
Tim
e [s
econ
ds]
RDD Size [GB]
READING CACHED RDD
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
23
24
0 50 100 150 200 250
Alluxio(textFile)
Alluxio(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (SSD)
25
0 100 200 300 400 500 600 700 800
Alluxio(textFile)
Alluxio(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (S3)
7x speedup
16x speedup
26
0
50
100
150
200
250
0 10 20 30 40 50
Tim
e [s
econ
ds]
DataFrame Size [GB]
READING CACHED DATAFRAME (PARQUET)
Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
27
0 50 100 150 200 250
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME (SSD)
28
0 250 500 750 1000 1250 1500 1750
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME(S3)
10x average speedup, 17x peak speedup
CONCLUSION
• Easy to use with Spark
• Predictable and improved performance
• Easily connect to various storages
29
Contact: [email protected]
Twitter: @jiacalvin
Websites: www.alluxio.com and www.alluxio.org
Thank you!
30
31
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
BENEFITS