how apache arrow and parquet boost cross-language interoperability
TRANSCRIPT
Uwe L. Korn PyData Paris 14th June 2016
How Apache Arrow and Parquet boost cross-language interop
About me
• Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail
industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
Agenda The Problem Arrow Parquet Outlook
Why is columnar better?
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Different Systems - Varying Python Support
• Various levels of Python Support • Build in Python • Python API • No Python at all
• Each tool/algorithm works on columnar data
• Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution
Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system / cross-language communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Supports nested data structures
Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
Apache Arrow - The Impact• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again
• Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and
Pandas
Apache Arrow
• Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java /
Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, ..
• Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..
• Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
Arrow in Action: Feather
• Language-agnostic file format for binary data frame storage
• Read performance close to raw disk I/O
• by Wes McKinney (Python) and Hadley Wickham (R)
• Julia Support in progress
Arrow Arrays
Feather Metadata (flatbuffers)
Apache Parquet
Apache Parquet
• Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query efficient
• multiple encodings • predicate pushdown • column-wise compression
• many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
The Basics
• 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits:
• pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown
Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
File
Row Group
Column ChunkPage
Using Parquet in Python
• You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas()• Needs to pass through Spark, very slow
• Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
State of Arrow & Parquet
Arrow in-memory spec for columnar data
• Java (beta) • C++ (in progress) • Python (in progress) • Planned:
• Julia • R
Parquet columnar on-disk storage
• Java (mature) • C++ (in progress) • Python (in progress) • Planned:
• Julia • R
Upcoming
• Parquet <-Arrow-> Pandas • IPC on its way
• alpha implementation using memory mapped files • JVM <-> native with shared reference counting
Get Involved!
• [email protected] & [email protected] • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet
Questions ?!