dataengconf sf16 - spark sql workshop

44
Agenda Brief Review of Spark (15 min) Intro to Spark SQL (30 min) Code session 1: Lab (45 min) Break (15 min) Intermediate Topics in Spark SQL (30 min) Code session 2: Quiz (30 min) Wrap up (15 min)

Upload: hakka-labs

Post on 16-Apr-2017

384 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: DataEngConf SF16 - Spark SQL Workshop

Agenda● Brief Review of Spark (15 min)● Intro to Spark SQL (30 min)● Code session 1: Lab (45 min)● Break (15 min)● Intermediate Topics in Spark SQL (30 min)● Code session 2: Quiz (30 min)● Wrap up (15 min)

Page 2: DataEngConf SF16 - Spark SQL Workshop

Spark ReviewBy Aaron Merlob

Page 3: DataEngConf SF16 - Spark SQL Workshop

Apache Spark● Open-source cluster computing framework ● “Successor” to Hadoop MapReduce● Supports Scala, Java, and Python!

https://en.wikipedia.org/wiki/Apache_Spark

Page 4: DataEngConf SF16 - Spark SQL Workshop

Spark Core + Libraries

https://spark.apache.org

Page 5: DataEngConf SF16 - Spark SQL Workshop

Resilient Distributed Dataset● Distributed Collection● Fault-tolerant● Parallel operation - Partitioned● Many data sourcesImplementation...

RDD - Main Abstraction

Page 6: DataEngConf SF16 - Spark SQL Workshop

Immutable

Mute

Immutable

Lazily Evaluated

Cachable

Type Inferred

Page 7: DataEngConf SF16 - Spark SQL Workshop

Lazily EvaluatedHow Good Is Aaron’s Presentation? Immutable

Lazily Evaluated

Cachable

Type Inferred

Page 8: DataEngConf SF16 - Spark SQL Workshop

CachableImmutable

Lazily Evaluated

Cachable

Type Inferred

Page 9: DataEngConf SF16 - Spark SQL Workshop

Type Inferred (Scala)Immutable

Lazily Evaluated

Cachable

Type Inferred

Page 10: DataEngConf SF16 - Spark SQL Workshop

RDD Operations Actions

Transformations

Page 11: DataEngConf SF16 - Spark SQL Workshop

Cache & PersistTransformed RDDs recomputed each actionStore RDDs in memory using cache (or persist)

Page 12: DataEngConf SF16 - Spark SQL Workshop

SparkContext.● Your way to get data into/out of RDDs● Given as ‘sc’ when you launch Spark shell.

For example: sc.parallelize()

SparkContext

Page 13: DataEngConf SF16 - Spark SQL Workshop

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Page 14: DataEngConf SF16 - Spark SQL Workshop

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Page 15: DataEngConf SF16 - Spark SQL Workshop

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Page 16: DataEngConf SF16 - Spark SQL Workshop

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Page 17: DataEngConf SF16 - Spark SQL Workshop

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2).cache()result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Page 18: DataEngConf SF16 - Spark SQL Workshop

Spark SQLBy Aaron Merlob

Page 19: DataEngConf SF16 - Spark SQL Workshop

Spark SQLRDDs with Schemas!

Page 20: DataEngConf SF16 - Spark SQL Workshop

Spark SQLRDDs with Schemas!

Schemas = Table Names + Column Names + Column Types = Metadata

Page 21: DataEngConf SF16 - Spark SQL Workshop

Schemas● Schema Pros

○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured

● Schema Cons○ ??○ ??○ ??

Page 22: DataEngConf SF16 - Spark SQL Workshop

Schemas● Schema Pros

○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured

● Schema Cons○ Make your data more structured○ Reduce future flexibility (app is more fragile)○ Y2K

Page 23: DataEngConf SF16 - Spark SQL Workshop

HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

Page 24: DataEngConf SF16 - Spark SQL Workshop

HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

FYI - a less preferred alternative:org.apache.spark.sql.SQLContext

Page 25: DataEngConf SF16 - Spark SQL Workshop

DataFramesPrimary abstraction in Spark SQL

Evolved from SchemaRDDExposes functionality via SQL or DF APISQL for developer productivity (ETL, BI, etc)DF for data scientist productivity (R / Pandas)

Page 26: DataEngConf SF16 - Spark SQL Workshop

Live Coding - Spark-ShellMaven Packages for CSV and Avroorg.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1

spark-shell --packages $SPARK_PKGS

Page 27: DataEngConf SF16 - Spark SQL Workshop

Live Coding - Loading CSVval path = "AAPL.csv"val df = sqlContext.read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(path)df.registerTempTable("stocks")

Page 28: DataEngConf SF16 - Spark SQL Workshop

CachingIf I run a query twice, how many times will the data be read from disk?

Page 29: DataEngConf SF16 - Spark SQL Workshop

CachingIf I run a query twice, how many times will the data be read from disk?

1. RDDs are lazy.2. Therefore the data will be read twice.3. Unless you cache the RDD, All transformations

in the RDD will execute on each action.

Page 30: DataEngConf SF16 - Spark SQL Workshop

Caching TablessqlContext.cacheTable("stocks")

Particularly useful when using Spark SQL to explore data, and if your data is on S3.

sqlContext.uncacheTable("stocks")

Page 31: DataEngConf SF16 - Spark SQL Workshop

Caching in SQLSQL Command Speed`CACHE TABLE sales;` Eagerly`CACHE LAZY TABLE sales;` Lazily`UNCACHE TABLE sales;` Eagerly

Page 32: DataEngConf SF16 - Spark SQL Workshop

Caching ComparisonCaching Spark SQL DataFrames vs

caching plain non-DataFrame RDDs● RDDs cached at level of individual records● DataFrames know more about the data.● DataFrames are cached using an in-memory

columnar format.

Page 33: DataEngConf SF16 - Spark SQL Workshop

Caching ComparisonWhat is the difference between these:(a) sqlContext.cacheTable("df_table")(b) df.cache(c) sqlContext.sql("CACHE TABLE df_table")

Page 34: DataEngConf SF16 - Spark SQL Workshop

Lab 1Spark SQL Workshop

Page 35: DataEngConf SF16 - Spark SQL Workshop

Spark SQL,the SequelBy Aaron Merlob

Page 36: DataEngConf SF16 - Spark SQL Workshop

Live Coding - Filetype ETL● Read in a CSV● Export as JSON or Parquet● Read JSON

Page 37: DataEngConf SF16 - Spark SQL Workshop

Live Coding - Common● Show● Sample● Take● First

Page 38: DataEngConf SF16 - Spark SQL Workshop

Read FormatsFormat ReadParquet sqlContext.read.parquet(path)

ORC sqlContext.read.orc(path)

JSON sqlContext.read.json(path)

CSV sqlContext.read.format(“com.databricks.spark.csv”).load(path)

Page 39: DataEngConf SF16 - Spark SQL Workshop

Write FormatsFormat WriteParquet sqlContext.write.parquet(path)

ORC sqlContext.write.orc(path)

JSON sqlContext.write.json(path)

CSV sqlContext.write.format(“com.databricks.spark.csv”).save(path)

Page 40: DataEngConf SF16 - Spark SQL Workshop

Schema InferenceInfer schema of JSON files:

● By default it scans the entire file.● It finds the broadest type that will fit a field.● This is an RDD operation so it happens fast.

Infer schema of CSV files:● CSV parser uses same logic as JSON

parser.

Page 41: DataEngConf SF16 - Spark SQL Workshop

User Defined FunctionsHow do you apply a “UDF”?● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)Notes:● UDFs can take single or multiple arguments● Optional registerFunction arg2: ‘return type’

Page 42: DataEngConf SF16 - Spark SQL Workshop

Live Coding - UDF● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)

Page 43: DataEngConf SF16 - Spark SQL Workshop

Live Coding - AutocompleteFind all types available for SQL schemas +UDF

Types and their meanings:StringType = StringIntegerType = IntDoubleType = Double

Page 44: DataEngConf SF16 - Spark SQL Workshop

Spark UI on port 4040