pyspark best practices by juliet hougland
TRANSCRIPT
![Page 1: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/1.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland Spark Summit Europe 2015 @j_houg
PySpark Best Practices
![Page 2: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/2.jpg)
‹#›© Cloudera, Inc. All rights reserved.
![Page 3: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/3.jpg)
‹#›© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
![Page 4: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/4.jpg)
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
![Page 5: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/5.jpg)
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
![Page 6: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/6.jpg)
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
![Page 7: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/7.jpg)
‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
![Page 8: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/8.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
![Page 9: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/9.jpg)
‹#›© Cloudera, Inc. All rights reserved.
PySpark Execution Model
![Page 10: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/10.jpg)
‹#›© Cloudera, Inc. All rights reserved.
PySpark Driver Program
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Function closures need to be executed on worker nodes by a python process.
![Page 11: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/11.jpg)
‹#›© Cloudera, Inc. All rights reserved.
How do we ship around Python functions?
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
![Page 13: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/13.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Pickle!
sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
![Page 14: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/14.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Best Practices for Writing PySpark
![Page 15: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/15.jpg)
‹#›© Cloudera, Inc. All rights reserved.
REPLs and Notebookshttps://flic.kr/p/5hnPZp
![Page 16: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/16.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Share your code
https://flic.kr/p/sw2cnL
![Page 17: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/17.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Standard Python Projectmy_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py
![Page 18: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/18.jpg)
‹#›© Cloudera, Inc. All rights reserved.
What is the shape of a PySpark job?
https://flic.kr/p/4vWP6U
![Page 19: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/19.jpg)
‹#›© Cloudera, Inc. All rights reserved.
• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data
PySpark Structure?
https://flic.kr/p/ZW54
Shout out to my colleagues in the UK
![Page 20: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/20.jpg)
‹#›© Cloudera, Inc. All rights reserved.
PySpark Structure?my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv
• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data
![Page 21: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/21.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Simple Main Method
![Page 22: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/22.jpg)
‹#›© Cloudera, Inc. All rights reserved.
• Write a function for anything inside an transformation • Make it static
• Separate Feature generation or data standardization from your modeling
Write Testable Code
Featurize.py …
@static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double
@static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])
![Page 23: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/23.jpg)
‹#›© Cloudera, Inc. All rights reserved.
• Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead
Write Serializable Code
https://flic.kr/p/za5cy
![Page 24: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/24.jpg)
‹#›© Cloudera, Inc. All rights reserved.
• Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/spark-testing-base
Testing with SparkTestingBase
![Page 25: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/25.jpg)
‹#›© Cloudera, Inc. All rights reserved.
• Unit test as much as possible • Integration test the whole flow • Use sample of real data
• Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results
Testing Suggestions
https://flic.kr/p/tucHHL
![Page 26: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/26.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Best Practices for Running PySpark
![Page 27: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/27.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Writing distributed code is the easy part…
Running it is hard.
![Page 28: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/28.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Get Serious About Logs
• Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces
![Page 29: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/29.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Know your environment
• You may want to use python packages on your cluster • Actively manage dependencies on your cluster
• Spark versions <1.4.0 require the same version of Python on driver and workers
![Page 30: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/30.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Complex Dependencies
![Page 31: PySpark Best Practices by Juliet Hougland](https://reader031.vdocuments.mx/reader031/viewer/2022022415/586f7a371a28ab10258b7261/html5/thumbnails/31.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Many Python EnvironmentsPath to Python binary to use on the cluster can be set with PYSPARK_PYTHON
Can be set it in spark-env.sh
if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi
http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/