anatomy of data source api : a deep dive into spark data source api

20
Anatomy of Data Source API A deep dive into the Spark Data source API https://github.com/phatak-dev/anatomy_of_spark_datasource_api

Upload: datamantra

Post on 06-Aug-2015

164 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Anatomy of Data Source API : A deep dive into Spark Data source API

Anatomy of Data Source API

A deep dive into the Spark Data source API

https://github.com/phatak-dev/anatomy_of_spark_datasource_api

Page 2: Anatomy of Data Source API : A deep dive into Spark Data source API

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Anatomy of Data Source API : A deep dive into Spark Data source API

Agenda

● Data Source API● Schema discovery● Build Scan● Data type inference● Save● Column pruning● Filter push

Page 4: Anatomy of Data Source API : A deep dive into Spark Data source API

Data source API● Universal API for loading/saving structured data● Built In support for Hive, Avro, Json,JDBC, Parquet● Third party integration through spark-packages● Support for smart sources● Third parties already supporting

○ Csv ○ MongoDB○ Cassandra (in works)etc

Page 5: Anatomy of Data Source API : A deep dive into Spark Data source API

Data source API

Page 6: Anatomy of Data Source API : A deep dive into Spark Data source API

Building CSV data source● Ability to load and save csv data● Automatic schema discovery● Support for user schema override● Automatic data type inference● Ability to columns pruning● Filter push

Page 7: Anatomy of Data Source API : A deep dive into Spark Data source API

Schema discovery

Tag v0.1

Page 8: Anatomy of Data Source API : A deep dive into Spark Data source API

CsvSchemaDiscovery Example

Page 9: Anatomy of Data Source API : A deep dive into Spark Data source API

Default Source● Spark looks for a class named DefaultSource in given

package of Data source API● Default Source should extend RelationProvider trait● Relation Provider is responsible for taking user

parameters and turn them into a Base relation

● SchemaRelationProvider trait allows to specify user defined schema

● Ex : DefaultSource.scala

Page 10: Anatomy of Data Source API : A deep dive into Spark Data source API

Base Relation● Represent collection of tuples with known schema● Methods needed to be overridden

○ def sqlContext Return sqlContext for building Data Frames○ def schema:StructTypeReturns the schema of the relation in terms of StructType (analogous to hive serde)

● Ex : CsvRelation.scala

Page 11: Anatomy of Data Source API : A deep dive into Spark Data source API

Reading Data

Tag v0.2

Page 12: Anatomy of Data Source API : A deep dive into Spark Data source API

TableScan● Table scan is a trait to be implemented for reading data● It’s Base Relation that can produce all of it’s tuples as

an RDD of Row objects● Methods to override

○ def buildScan(): RDD[Row]● In csv example, we use sc.textFile to create RDD and

then Row.fromSeq to convert to an ROW● CsvTableScanExample.scala

Page 13: Anatomy of Data Source API : A deep dive into Spark Data source API

Data Type inference

Tag v0.3

Page 14: Anatomy of Data Source API : A deep dive into Spark Data source API

Inferring data types● Treated every value as string as now● Sample data and infer schema for each row● Take inferred schema of first row ● Update table scan to cast it to right data type● Ex: CsvSchemaDiscovery.scala● Ex : SalesSumExample.scala

Page 15: Anatomy of Data Source API : A deep dive into Spark Data source API

Save As Csv

Tag v0.4

Page 16: Anatomy of Data Source API : A deep dive into Spark Data source API

CreateTableRelationProvider● DefaultSource should implement

CreateTableRelationProvider in order to support save call

● Override createRelation method to implement save mechanism

● Convert RDD[Row] to String and use saveAsTextFile to save

● Ex : CsvSaveExample.scala

Page 17: Anatomy of Data Source API : A deep dive into Spark Data source API

Column Pruning

Tag v0.5

Page 18: Anatomy of Data Source API : A deep dive into Spark Data source API

PrunedScan● CsvRelation should implement PrunedScan trait to

optimize the columns access● PrunedScan gives information to the data source which

columns it wants to access● When we build RDD[Row] we only give columns need● No performance benefit in Csv data, just for demo. But it

has great performance benefits in sources like jdbc● Ex : SalesSumExample.scala

Page 19: Anatomy of Data Source API : A deep dive into Spark Data source API

Filter push

Tag v0.6

Page 20: Anatomy of Data Source API : A deep dive into Spark Data source API

PrunedFilterScan● CsvRelation should implement PrunedFilterScan trait

to optimize filtering● PrunedFilterScan pushes filters to data source ● When we build RDD[Row] we only give rows which

satisfy the filter● It’s an optimization. The filters will be evaluated again.● Ex :CsvFilerExample.scala