effective spark with alluxio at strata+hadoop world san jose 2017

EFFECTIV E SPAR K WITH ALLUXIO

Strata + Hadoop World San Jose 2017Calvin Jia, Alluxio Inc.

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

2

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO

…

…

FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System

Unifying Data at Memory Speed

BIG DATA ECOSYSTEM ISSUES

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

3

BIG DATA ECOSYSTEM YESTERDAY

4

• Fastest growing open-source project in the big data ecosystem

• 400+ contributors from 100+ organizations

• Running in large production clusters

• Community members are welcome!

FASTEST GROWING BIG DATA PROJECTS

Popular Open Source Projects’ Growth

Months

Num

ber o

f Con

trib

utor

s

OUTLINE





5

ACCELERATE I/O TO/FROM REMOTE STORAGE

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.

- Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster runs stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

6

GUARANTEE PERFORMANCE TO MEET SLAS

RESULTS

• Introducing Alluxio provides the system with smooth performance in an environment with highly variable load

• Without Alluxio, the performance of critical jobs could take more than 4x the expected time

• With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience

• Real-time machine learning

• 10+ TB of storage

• Mix of Memory + HDD

A leading online retailer uses

Alluxio to compute conversion

attribution

7

ALLUXIO

SHARE DATA ACROSS JOBS @ MEMORY SPEED

Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.

- Barclays

RESULTS

• Barclays workflow iteration time decreased from hours to seconds

• Alluxio enabled workflows that were impossible before

• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds

Barclays uses query and machine

learning to train models for risk

management

• 6 node deployment

• 1TB of storage

• Memory only

ALLUXIO

Relational Database

8

TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS

We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.

- Qunar

RESULTS

• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing

• Improved the performance of their system with 15x – 300x speedups

• Tiered storage feature manages storage resources including memory, SSD and disk

• 200+ nodes deployment

• 6 billion logs (4.5 TB) daily

• Mix of Memory + HDD

ALLUXIO

Qunar uses real-time machine

learning for their website ads.

9

OUTLINE





10

CONSOLIDATING MEMORY

11

Storage Engine & Execution EngineSame Process

• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O

Spark Compute

Spark Storage

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Spark Compute

Spark Storage

block 1

block 3

CONSOLIDATING MEMORY

12

Storage Engine & Execution EngineDifferent process

• Half the memory used• Inter-process Sharing Happens at Memory Speed

Spark Compute

Spark Storage


block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Spark Compute

Spark Storage

DATA RESILIENCE DURING CRASH

13

Spark Compute

Spark Storageblock 1

block 3


block 3

block 2

block 4



14

CRASH

Spark Storageblock 1

block 3


block 3

block 2

block 4

• Process Crash Requires Network and/or Disk I/O to Re-read Data



15

CRASH


block 3

block 2

block 4


• Process Crash Requires Network and/or Disk I/O to Re-read Data


16

Spark Compute

Spark Storage


block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Storage Engine & Execution EngineDifferent process


17

• Process Crash – Data is Re-read at Memory Speed


block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

CRASH Storage Engine & Execution EngineDifferent process

ACCESSING ALLUXIO DATA FROM SPARK

18

Writing Data Write to an Alluxio file

Reading Data Read from an Alluxio file

CODE EXAMPLE FOR SPARK RDDS

19

Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)

Reading RDD from

Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)

CODE EXAMPLE FOR SPARK DATAFRAMES

20

Writing to Alluxio df.write.parquet(alluxioPath)

Reading from Alluxio df = sc.read.parquet(alluxioPath)

OUTLINE





21

ENVIRONMENT

Spark 2.0.0 + Alluxio 1.2.0

Single worker: Amazon r3.2xlarge

Comparisons:• Alluxio• Spark Storage Level: MEMORY_ONLY• Spark Storage Level: MEMORY_ONLY_SER• Spark Storage Level: DISK_ONLY

22

0

50

100

150

200

250

0 10 20 30 40 50

Tim

e [s

econ

ds]

RDD Size [GB]

READING CACHED RDD

Alluxio (textFile) Alluxio (objectFile) DISK_ONLY

MEMORY_ONLY_SER MEMORY_ONLY

23

24

0 50 100 150 200 250

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB RDD (SSD)

25

0 100 200 300 400 500 600 700 800

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB RDD (S3)

7x speedup

16x speedup

26

0

50

100

150

200

250

0 10 20 30 40 50

Tim

e [s

econ

ds]

DataFrame Size [GB]

READING CACHED DATAFRAME (PARQUET)

Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY

27

0 50 100 150 200 250

Alluxio

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB DATAFRAME (SSD)

28

0 250 500 750 1000 1250 1500 1750

Alluxio

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB DATAFRAME(S3)

10x average speedup, 17x peak speedup

CONCLUSION

• Easy to use with Spark

• Predictable and improved performance

• Easily connect to various storages

29

Contact: [email protected]

Twitter: @jiacalvin

Websites: www.alluxio.com and www.alluxio.org

Thank you!

30

31

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

BENEFITS

effective spark with alluxio at strata+hadoop world san jose 2017

Technology