spark meetup amsterdam - dealing with bad actors in etl, databricks
TRANSCRIPT
![Page 1: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/1.jpg)
Exceptions are the NormDealing with Bad Actors in ETL
Herman van Hövell (@Westerflyer)Spark Meetup| Amsterdam | Feb 8th 2017
![Page 2: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/2.jpg)
About Me
• Software Engineer at Databricks (Spark Core/SQL)• Committer for Apache Spark• Worked as a data analyst in Logistics, Finance and Marketing
![Page 3: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/3.jpg)
Overview
1. What’s an ETL Pipeline?- How is it different from a regular query execution pipeline?
2. Using SparkSQL for ETL- Dealing with Dirty Data (Bad Records or Files)
3. New Features in Spark 2.2 and 2.3- Focus on building ETL-friendly pipelines
![Page 4: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/4.jpg)
What is a Data Pipeline?
1. Sequence of transformations on data2. Source data is typically semi-structured/unstructured (JSON,
CSV etc.)3. Output data is structured and ready for use by analysts and data
scientists4. Source and destination are often on different storage systems.
![Page 5: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/5.jpg)
Example of a Data Pipeline
Aggregate ReportingApplications
ML Model
Ad-hoc Queries
Kafka DatabaseCloud
WarehouseLogs
![Page 6: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/6.jpg)
ETL is the First Step in a Data Pipeline
1. ETL stands for EXTRACT, TRANSFORM and LOAD
2. Goal is to “clean” or “curate” the data- Retrieve data from source (EXTRACT)- Transform data into a consumable format (TRANSFORM)- Transmit data to downstream consumers (LOAD)
![Page 7: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/7.jpg)
An Example
Extract
Load
spark.read.csv("/source/path").groupBy(...).agg(...).write.mode("append").parquet("/output/path")
![Page 8: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/8.jpg)
An Example
Extract
Load
spark.read.csv("/source/path").groupBy(...).agg(...).write.mode("append").parquet("/output/path")
EXTRACT
![Page 9: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/9.jpg)
An Example
Extract
Load
spark.read.csv("/source/path").groupBy(...).agg(...).write.mode("append").parquet("/output/path")
EXTRACT
TRANSFORM
![Page 10: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/10.jpg)
An Example
Extract
Load
spark.read.csv("/source/path").groupBy(...).agg(...).write.mode("append").parquet("/output/path")
EXTRACT
TRANSFORM
LOAD
![Page 11: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/11.jpg)
Why is ETL Hard?
![Page 12: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/12.jpg)
Why is ETL Hard?1. Source Data can be Messy- Incomplete information- Missing data stored as empty strings, “none”, “missing”, “xxx” etc.
2. Source Data can be Inconsistent- Data conversion and type validation in many cases is error-prone
- For e.g., expecting a number but found ”123 000”- different formats “31/12/2017” “12/31/2007”
- Incorrect information- For e.g., expecting 5 fields in CSV, but can’t find 5 fields.
![Page 13: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/13.jpg)
Why is ETL Hard?3. Source Data can be Constantly Arriving- At least once or exactly once semantics- Fault tolerance- Scalability
4. Source Data can be Complex- For e.g., Nested JSON data to extract and flatten- Dealing with inconsistency is even worse
![Page 14: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/14.jpg)
This is why ETL is importantConsumers of this data don’t want to deal with this messiness and complexity
![Page 15: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/15.jpg)
On the flip side
1. A few bad records can fail a job• These are not the same as transient errors• No recourse for recovery
2. Support for ETL features• File formats and conversions have gaps• For e.g., multi-line support, date conversions
3. Performance
![Page 16: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/16.jpg)
Using SparkSQL for ETL
![Page 17: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/17.jpg)
Dealing with Bad Data: Skip Corrupt Files
![Page 18: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/18.jpg)
Dealing with Bad Data: Skip Corrupt Files
spark.sql.files.ignoreCorruptFiles = true
Missing orCorrupt
File
[SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt or non-existent files. The contents that have been read will still be returned.
![Page 19: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/19.jpg)
Dealing with Bad Data: Skip Corrupt Records
Missing orCorruptRecords
![Page 20: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/20.jpg)
Dealing with Bad Data: Skip Corrupt Records
Missing orCorruptRecords
[SPARK-12833][SPARK-13764]TextFile formats (JSON and CSV) support 3 different ParseModes while reading data:
1. PERMISSIVE2. DROPMALFORMED3. FAILFAST
![Page 21: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/21.jpg)
JSON: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}{"a":{, b:3}{"a":5, "b":6, "c":7}
spark.read.option("mode", "PERMISSIVE").json(corruptRecords).show()
Can be configured via spark.sql.columnNameOfCorruptRecord
![Page 22: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/22.jpg)
JSON: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}{"a":{, b:3}{"a":5, "b":6, "c":7}
spark.read.option("mode", ”DROPMALFORMED").json(corruptRecords).show()
![Page 23: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/23.jpg)
JSON: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}{"a":{, b:3}{"a":5, "b":6, "c":7}
spark.read.option("mode", ”FAILFAST").json(corruptRecords).show()
org.apache.spark.sql.catalyst.json.SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a":{, b:3}
![Page 24: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/24.jpg)
CSV: Dealing with Corrupt Recordsyear,make,model,comment,blank"2012","Tesla","S","No comment",1997,Ford,E350,"Go get one now they",2015,Chevy,Volt
spark.read.format("csv").option("mode", "PERMISSIVE").load(corruptRecords).show()
![Page 25: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/25.jpg)
CSV: Dealing with Corrupt Recordsyear,make,model,comment,blank"2012","Tesla","S","No comment",1997,Ford,E350,"Go get one now they",2015,Chevy,Volt
spark.read.format("csv").option("mode", ”DROPMALFORMED").load(corruptRecords).show()
![Page 26: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/26.jpg)
CSV: Dealing with Corrupt Recordsyear,make,model,comment,blank"2012","Tesla","S","No comment",1997,Ford,E350,"Go get one now they",2015,Chevy,Volt
spark.read.format("csv").option("mode", ”FAILFAST").load(corruptRecords).show()
java.lang.RuntimeException: Malformed line in FAILFAST mode: 2015,Chevy,Volt
![Page 27: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/27.jpg)
Apache Spark 2.2 and 2.3Massive focus on functionality, usability and performance
![Page 28: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/28.jpg)
New Features in Spark 2.2 and 2.3
1. Functionality:- Better JSON and CSV Support
2. Usability:- Better Error Messages
3. Performance:- Python UDF Processing
![Page 29: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/29.jpg)
Functionality: Better JSON Support
[SPARK-18352] Multi-line JSON Support- Spark currently reads JSON one line at a time- This currently requires custom ETL
spark.read.option("wholeFile",true).json(path)
Availability: Spark 2.2
![Page 30: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/30.jpg)
Functionality: Better CSV Support
[SPARK-16099] Improved/Performant CSV Datasource- Multiline CSV Support- Additional options for CSV Parsing- Whole text reader for dataframes
Availability: Spark 2.2
![Page 31: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/31.jpg)
Functionality: Better CSV Support
More Fine-grained (record-level) tolerance to errors- Provide users with controls on how to handle these errors- Ignore and report errors post-hoc- Ignore bad rows up to a certain number or percentage
Availability: Spark 2.2
![Page 32: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/32.jpg)
Functionality: Working with Nested Data
[SPARK-19480] Higher order functions in SQL- Enable users to manipulate nested data in Spark- Operations include map, filter, reduce on arrays/maps
tbl_x
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
![Page 33: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/33.jpg)
Functionality: Working with Nested Data
[SPARK-19480] Higher order functions in SQL
Availability: Spark 2.3+
tbl_x
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
SELECT key,TRANSFORM(values, v -> v + key)
FROM tbl_x
![Page 34: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/34.jpg)
Usability: Better Error Messages
scala.MatchError: start (of class java.lang.String)
![Page 35: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/35.jpg)
Usability: Better Error Messages
1. Spark must explain why data is bad2. This is especially true for data conversion3. Which row in your source data could not be converted ?4. Which column could not be converted ?
Availability: Spark 2.2 and 2.3
![Page 36: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/36.jpg)
Performance: Python Performance
1. Python is the most popular language for ETL2. Python UDFs are used to express data conversions/transformations3. UDFs are processed in a separate python process4. Any improvements to python UDF processing will improve ETL.- E.g., improve Python serialization using column batches- Applies to R and Scala as well
Availability: Spark 2.3+
![Page 37: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/37.jpg)
Recap
1. Using SparkSQL for ETL- Dealing with Bad Records or Files
2. New Features in Spark 2.2 and 2.3- Focus on functionality, usability and performance
![Page 38: Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks](https://reader033.vdocuments.mx/reader033/viewer/2022042908/58f2b7fc1a28ab7e738b464b/html5/thumbnails/38.jpg)
Questions?