how apache arrow and parquet boost cross-language interoperability

17

Click here to load reader

Upload: uwe-korn

Post on 21-Apr-2017

1.059 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: How Apache Arrow and Parquet boost cross-language interoperability

Uwe L. Korn PyData Paris 14th June 2016

How Apache Arrow and Parquet boost cross-language interop

Page 2: How Apache Arrow and Parquet boost cross-language interoperability

About me

• Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail

industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL

Page 3: How Apache Arrow and Parquet boost cross-language interoperability

Agenda The Problem Arrow Parquet Outlook

Page 4: How Apache Arrow and Parquet boost cross-language interoperability

Why is columnar better?

Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

Page 5: How Apache Arrow and Parquet boost cross-language interoperability

Different Systems - Varying Python Support

• Various levels of Python Support • Build in Python • Python API • No Python at all

• Each tool/algorithm works on columnar data

• Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution

Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )

Page 6: How Apache Arrow and Parquet boost cross-language interoperability

Apache Arrow

• Specification for in-memory columnar data layout

• No overhead for cross-system / cross-language communication

• Designed for efficiency (exploit SIMD, cache locality, ..)

• Supports nested data structures

Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )

Page 7: How Apache Arrow and Parquet boost cross-language interoperability

Apache Arrow - The Impact• An example: Retrieve a dataset from an MPP database

and analyze it in Pandas

• Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again

• Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and

Pandas

Page 8: How Apache Arrow and Parquet boost cross-language interoperability

Apache Arrow

• Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java /

Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, ..

• Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..

• Spec: https://github.com/apache/arrow/blob/master/format/Layout.md

Page 9: How Apache Arrow and Parquet boost cross-language interoperability

Arrow in Action: Feather

• Language-agnostic file format for binary data frame storage

• Read performance close to raw disk I/O

• by Wes McKinney (Python) and Hadley Wickham (R)

• Julia Support in progress

Arrow Arrays

Feather Metadata (flatbuffers)

Page 10: How Apache Arrow and Parquet boost cross-language interoperability

Apache Parquet

Page 11: How Apache Arrow and Parquet boost cross-language interoperability

Apache Parquet

• Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query efficient

• multiple encodings • predicate pushdown • column-wise compression

• many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world

Page 12: How Apache Arrow and Parquet boost cross-language interoperability

The Basics

• 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits:

• pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown

Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

File

Row Group

Column ChunkPage

Page 13: How Apache Arrow and Parquet boost cross-language interoperability

Using Parquet in Python

• You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas()• Needs to pass through Spark, very slow

• Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion

Page 14: How Apache Arrow and Parquet boost cross-language interoperability

State of Arrow & Parquet

Arrow in-memory spec for columnar data

• Java (beta) • C++ (in progress) • Python (in progress) • Planned:

• Julia • R

Parquet columnar on-disk storage

• Java (mature) • C++ (in progress) • Python (in progress) • Planned:

• Julia • R

Page 15: How Apache Arrow and Parquet boost cross-language interoperability

Upcoming

• Parquet <-Arrow-> Pandas • IPC on its way

• alpha implementation using memory mapped files • JVM <-> native with shared reference counting

Page 16: How Apache Arrow and Parquet boost cross-language interoperability

Get Involved!

[email protected] & [email protected] • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet

Page 17: How Apache Arrow and Parquet boost cross-language interoperability

Questions ?!