dataengconf sf16 - spark sql workshop

Agenda● Brief Review of Spark (15 min)● Intro to Spark SQL (30 min)● Code session 1: Lab (45 min)● Break (15 min)● Intermediate Topics in Spark SQL (30 min)● Code session 2: Quiz (30 min)● Wrap up (15 min)

Spark ReviewBy Aaron Merlob

Apache Spark● Open-source cluster computing framework ● “Successor” to Hadoop MapReduce● Supports Scala, Java, and Python!

https://en.wikipedia.org/wiki/Apache_Spark

Spark Core + Libraries

https://spark.apache.org

Resilient Distributed Dataset● Distributed Collection● Fault-tolerant● Parallel operation - Partitioned● Many data sourcesImplementation...

RDD - Main Abstraction

Immutable

Mute

Immutable

Lazily Evaluated

Cachable

Type Inferred

Lazily EvaluatedHow Good Is Aaron’s Presentation? Immutable

Lazily Evaluated

Cachable

Type Inferred

CachableImmutable

Lazily Evaluated

Cachable

Type Inferred

Type Inferred (Scala)Immutable

Lazily Evaluated

Cachable

Type Inferred

RDD Operations Actions

Transformations

Cache & PersistTransformed RDDs recomputed each actionStore RDDs in memory using cache (or persist)

SparkContext.● Your way to get data into/out of RDDs● Given as ‘sc’ when you launch Spark shell.

For example: sc.parallelize()

SparkContext

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Transformation vs. Action?val data = sc.parallelize(Seq(

“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).

reduceByKey((v1, v2) => v1 + v2).cache()result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()

Spark SQLBy Aaron Merlob

Spark SQLRDDs with Schemas!

Spark SQLRDDs with Schemas!

Schemas = Table Names + Column Names + Column Types = Metadata

Schemas● Schema Pros

○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured

● Schema Cons○ ??○ ??○ ??

Schemas● Schema Pros

○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured

● Schema Cons○ Make your data more structured○ Reduce future flexibility (app is more fragile)○ Y2K

HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

FYI - a less preferred alternative:org.apache.spark.sql.SQLContext

DataFramesPrimary abstraction in Spark SQL

Evolved from SchemaRDDExposes functionality via SQL or DF APISQL for developer productivity (ETL, BI, etc)DF for data scientist productivity (R / Pandas)

Live Coding - Spark-ShellMaven Packages for CSV and Avroorg.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1

spark-shell --packages $SPARK_PKGS

Live Coding - Loading CSVval path = "AAPL.csv"val df = sqlContext.read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(path)df.registerTempTable("stocks")

CachingIf I run a query twice, how many times will the data be read from disk?

CachingIf I run a query twice, how many times will the data be read from disk?

1. RDDs are lazy.2. Therefore the data will be read twice.3. Unless you cache the RDD, All transformations

in the RDD will execute on each action.

Caching TablessqlContext.cacheTable("stocks")

Particularly useful when using Spark SQL to explore data, and if your data is on S3.

sqlContext.uncacheTable("stocks")

Caching in SQLSQL Command Speed`CACHE TABLE sales;` Eagerly`CACHE LAZY TABLE sales;` Lazily`UNCACHE TABLE sales;` Eagerly

Caching ComparisonCaching Spark SQL DataFrames vs

caching plain non-DataFrame RDDs● RDDs cached at level of individual records● DataFrames know more about the data.● DataFrames are cached using an in-memory

columnar format.

Caching ComparisonWhat is the difference between these:(a) sqlContext.cacheTable("df_table")(b) df.cache(c) sqlContext.sql("CACHE TABLE df_table")

Lab 1Spark SQL Workshop

Spark SQL,the SequelBy Aaron Merlob

Live Coding - Filetype ETL● Read in a CSV● Export as JSON or Parquet● Read JSON

Live Coding - Common● Show● Sample● Take● First

Read FormatsFormat ReadParquet sqlContext.read.parquet(path)

ORC sqlContext.read.orc(path)

JSON sqlContext.read.json(path)

CSV sqlContext.read.format(“com.databricks.spark.csv”).load(path)

Write FormatsFormat WriteParquet sqlContext.write.parquet(path)

ORC sqlContext.write.orc(path)

JSON sqlContext.write.json(path)

CSV sqlContext.write.format(“com.databricks.spark.csv”).save(path)

Schema InferenceInfer schema of JSON files:

● By default it scans the entire file.● It finds the broadest type that will fit a field.● This is an RDD operation so it happens fast.

Infer schema of CSV files:● CSV parser uses same logic as JSON

parser.

User Defined FunctionsHow do you apply a “UDF”?● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)Notes:● UDFs can take single or multiple arguments● Optional registerFunction arg2: ‘return type’

Live Coding - UDF● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)

Live Coding - AutocompleteFind all types available for SQL schemas +UDF

Types and their meanings:StringType = StringIntegerType = IntDoubleType = Double

Spark UI on port 4040

dataengconf sf16 - spark sql workshop

Technology