enabling python to be a better big data citizen
TRANSCRIPT
![Page 1: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/1.jpg)
1 © Cloudera, Inc. All rights reserved.
Enabling Python to be a Be=er Big Data Ci?zen Wes McKinney @wesmckinn NYC Python Meetup 2016-‐02-‐17
![Page 2: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/2.jpg)
2 © Cloudera, Inc. All rights reserved.
Me
• R&D at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba?ng)}
• Mostly work in Python and Cython/C/C++
![Page 3: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/3.jpg)
3 © Cloudera, Inc. All rights reserved.
Industry Analy?cs Scien?fic Compu?ng
Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines
Homogeneous data Mul?dimensional arrays HPC tools Linear algebra Scien?fic data formats (e.g. HDF5) Fewer physical machines
Some simplis?c generaliza?ons
Python: heavy investment, generally
Python: light investment, generally
![Page 4: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/4.jpg)
4 © Cloudera, Inc. All rights reserved.
A sample big data architecture
Kafka
Kafka
Kafka
Kafka
Application dataHDFS
JSON Spark/MapReduce
Columnar storage
Analytic SQL Engine
User
SQL
![Page 5: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/5.jpg)
5 © Cloudera, Inc. All rights reserved.
pandas
• Hugely popular Python table / “data frame” library • Labeled table, array, and ?me series data structures
• Popular for data prepara?on, ETL, and in-‐memory analy?cs • Built using Python’s scien?fic compu?ng stack • User API / domain specific language • Bespoke in-‐memory analy?cs / rela?onal algebra engine • IO interfaces (CSV, SQL, etc.) • Expanded data type system (beyond NumPy)
• Supports flat data only (or semi-‐structured data that can be fla=ened)
![Page 6: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/6.jpg)
6 © Cloudera, Inc. All rights reserved.
2016 Python Data Trends
• Improved Python interoperability with the Apache Hadoop ecosystem • I’m working with {Arrow, Kudu, Impala, Parquet, Spark}
• Support for big data file formats like Apache Parquet • Na?ve in-‐memory Python support for nested / JSON-‐like data
![Page 7: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/7.jpg)
7 © Cloudera, Inc. All rights reserved.
Ibis in a nutshell
• For Python programmers doing analy?cs in industry • Project Blog: h=p://blog.ibis-‐project.org • Cross-‐team project @ Cloudera • Apache-‐licensed, open source h=p://github.com/cloudera/ibis • Craoing a compelling Python-‐on-‐Hadoop user experience • Remove SQL coding from user workflows • Develop high performance extensions in Python
![Page 8: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/8.jpg)
8 © Cloudera, Inc. All rights reserved.
![Page 9: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/9.jpg)
9 © Cloudera, Inc. All rights reserved.
Enabling interoperability with big data systems
• Distributed / MPP query engines: implemented in a host language • Typically C/C++ or Java/Scala
• User-‐defined func?ons (UDFs) through various means • Implement in host language • Implement in user language through some external language protocol (ooen RPC-‐based)
• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)
![Page 10: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/10.jpg)
10 © Cloudera, Inc. All rights reserved.
Execu?ng data science languages in the compute layer
UIIbis, SQL, Spark API, …
ComputeAnalytic SQL, Spark, MapReduce
StorageHDFS, Kudu, HBase
Python, R, Julia, …?
![Page 11: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/11.jpg)
11 © Cloudera, Inc. All rights reserved.
Python interoperability challenges
• Problem 1: Serializa?on / deserializa?on overhead
in partition 0
…
in partition n - 1
Big data system
Python function
input
Python function
input
User-supplied Python code
output
output
out partition 0
…
out partition n - 1
Big data system
![Page 12: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/12.jpg)
12 © Cloudera, Inc. All rights reserved.
Data movement can be extremely costly
in partition 0 Python function
input
Ques:ons • How to represent “data in-‐flight” (RPC)? • Cost of conversion between in-‐memory data structures and RPC representa?on • How to communicate schemas / metadata?
![Page 13: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/13.jpg)
13 © Cloudera, Inc. All rights reserved.
Data movement can be extremely costly
in partition 0 Python function
input
Slow data movement / conversion can largely undermine the performance benefits of Python’s
high performance in-‐memory data tools
![Page 14: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/14.jpg)
14 © Cloudera, Inc. All rights reserved.
Python interoperability challenges
• Problem 2: Scalar vs vectorized computa?ons
result = np.empty(n)for i in range(n): result[i] = f(a[i], b[i])
result = f(a, b)
SCALAR
VECTORIZEDoften100-1000x faster
![Page 15: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/15.jpg)
15 © Cloudera, Inc. All rights reserved.
Apache Arrow: What is it?
• h=p://arrow.apache.org • Not a piece of sooware, exactly! • A standardized in-‐memory representa?on for columnar data • Enables • Suitable for implemen?ng high-‐performance analy?cs in-‐memory (think like “pandas internals”) • Cheap data interchange amongst systems, li=le or no serializa?on • Flexible support for complex JSON-‐like data
• Targets: Impala, Kudu, Parquet, Spark
![Page 16: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/16.jpg)
16 © Cloudera, Inc. All rights reserved.
Columnar data persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},
![Page 17: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/17.jpg)
17 © Cloudera, Inc. All rights reserved.
Columnar data person.addresses.street
person.addresses
025
offset013610
abbcccddddf
person.addresses.number
23456
offset
![Page 18: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/18.jpg)
18 © Cloudera, Inc. All rights reserved.
Apache Arrow in prac?ce
![Page 19: Enabling Python to be a Better Big Data Citizen](https://reader033.vdocuments.mx/reader033/viewer/2022050613/58a2cb0a1a28ab217a8b6909/html5/thumbnails/19.jpg)
19 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own