anatomy of data source api : a deep dive into spark data source api
TRANSCRIPT
Anatomy of Data Source API
A deep dive into the Spark Data source API
https://github.com/phatak-dev/anatomy_of_spark_datasource_api
● Madhukara Phatak
● Big data consultant and trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
Agenda
● Data Source API● Schema discovery● Build Scan● Data type inference● Save● Column pruning● Filter push
Data source API● Universal API for loading/saving structured data● Built In support for Hive, Avro, Json,JDBC, Parquet● Third party integration through spark-packages● Support for smart sources● Third parties already supporting
○ Csv ○ MongoDB○ Cassandra (in works)etc
Data source API
Building CSV data source● Ability to load and save csv data● Automatic schema discovery● Support for user schema override● Automatic data type inference● Ability to columns pruning● Filter push
Schema discovery
Tag v0.1
CsvSchemaDiscovery Example
Default Source● Spark looks for a class named DefaultSource in given
package of Data source API● Default Source should extend RelationProvider trait● Relation Provider is responsible for taking user
parameters and turn them into a Base relation
● SchemaRelationProvider trait allows to specify user defined schema
● Ex : DefaultSource.scala
Base Relation● Represent collection of tuples with known schema● Methods needed to be overridden
○ def sqlContext Return sqlContext for building Data Frames○ def schema:StructTypeReturns the schema of the relation in terms of StructType (analogous to hive serde)
● Ex : CsvRelation.scala
Reading Data
Tag v0.2
TableScan● Table scan is a trait to be implemented for reading data● It’s Base Relation that can produce all of it’s tuples as
an RDD of Row objects● Methods to override
○ def buildScan(): RDD[Row]● In csv example, we use sc.textFile to create RDD and
then Row.fromSeq to convert to an ROW● CsvTableScanExample.scala
Data Type inference
Tag v0.3
Inferring data types● Treated every value as string as now● Sample data and infer schema for each row● Take inferred schema of first row ● Update table scan to cast it to right data type● Ex: CsvSchemaDiscovery.scala● Ex : SalesSumExample.scala
Save As Csv
Tag v0.4
CreateTableRelationProvider● DefaultSource should implement
CreateTableRelationProvider in order to support save call
● Override createRelation method to implement save mechanism
● Convert RDD[Row] to String and use saveAsTextFile to save
● Ex : CsvSaveExample.scala
Column Pruning
Tag v0.5
PrunedScan● CsvRelation should implement PrunedScan trait to
optimize the columns access● PrunedScan gives information to the data source which
columns it wants to access● When we build RDD[Row] we only give columns need● No performance benefit in Csv data, just for demo. But it
has great performance benefits in sources like jdbc● Ex : SalesSumExample.scala
Filter push
Tag v0.6
PrunedFilterScan● CsvRelation should implement PrunedFilterScan trait
to optimize filtering● PrunedFilterScan pushes filters to data source ● When we build RDD[Row] we only give rows which
satisfy the filter● It’s an optimization. The filters will be evaluated again.● Ex :CsvFilerExample.scala