intro to spark with zeppelin
TRANSCRIPT
Robert HryniewiczData Evangelist@RobHryniewicz
Intro to Spark & Zeppelin
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Background
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Spark?
Apache Open Source Project - originally developed at AMPLab (University of California Berkeley)
Data Processing Engine - focused on in-memory distributed computing use-cases API - Scala, Python, Java and R
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLLib GraphX
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark? Elegant Developer APIs
– Single environment for data munging and Machine Learning (ML)
In-memory computation model – Fast!– Effective for iterative computations and ML
Machine Learning– Implementation of distributed ML algorithms– Pipeline API (Spark ML)
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History of Hadoop & Spark
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Basics
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code
What is it?
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD - Resilient Distributed Dataset Primary abstraction in Spark
– An Immutable collection of objects (or records, or elements) that can be operated on in parallel
Distributed– Collection of elements partitioned across nodes in a cluster– Each RDD is composed of one or more partitions– User can control the number of partitions– More partitions => more parallelism
Resilient– Recover from node failures– An RDD keeps its lineage information -> it can be recreated from parent RDDs
Created by starting with a file in Hadoop Distributed File System (HDFS) or an existing collection in the driver program
May be persisted in memory for efficient reuse across parallel operations (caching)
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD – Resilient Distributed Dataset
Partition 1
Partition 2
Partition 3
RDD 2
Partition 1
Partition 2
Partition 3
Partition 4
RDD 1
ClusterNodes
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:
– DataFrames API– SQL queries– Datasets API
Same execution engine for all three Spark SQL interfaces provide more information about both structure and computation
being performed than basic Spark RDD API
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
Conceptually equivalent to a table in relational DB or data frame in R/Python API available in Scala, Java, Python, and R Richer optimizations (significantly faster than RDDs) Distributed collection of data organized into named columns Underneath is an RDD
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFramesCSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame(with RDD underneath)
Column
Row
Created from Various Sources
DataFrames from HIVE:– Reading and writing HIVE tables,
including ORC
DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro
DataFrames from existing RDDs– with toDF()function
Data is described as a DataFrame with rows, columns and a schema
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context and Hive Context
Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)
SQLContext
Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs
HiveContext
val hc = HiveContext(sc)
Use when your data resides in
Hive
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Examples
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT
5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD vs. DataFrame
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs vs. DataFrames
RDD
DataFrame
Lower-level API (more control) Lots of existing code & users Compile-time type-safety
Higher-level API (faster development) Faster sorting, hashing, and serialization More opportunities for automatic optimization Lower memory pressure
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example
dept name ageBio H Smith 48CS A Turing 54Bio B Jones 43Phys E Witten 61
Find average age by department?
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Optimizations Spark SQL uses an underlying optimization engine (Catalyst)
– Catalyst can perform intelligent optimization since it understands the schema
Spark SQL does not materialize all the columns (as with RDD) only what’s needed
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin & HDP Sandbox
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin
Web-based Notebook for interactive analytics Use Cases
– Data exploration and discovery– Visualization– Interactive snippet-at-a-time experience– “Modern Data Science Studio”
Features– Deeply integrated with Spark and Hadoop– Supports multiple language backends– Pluggable “Interpreters”
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s not included with Spark?
Resource Management
Storage
Applications
Spark Core Engine
ScalaJava
Pythonlibraries
MLlib (Machine learning)
Spark SQL*
Spark Streaming*
Spark Core Engine
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Sandbox
What’s included in the Sandbox?
Zeppelin Latest Hortonworks Data Platform (HDP)
– Spark– YARN Resource Management– HDFS Distributed Storage Layer– And many more components... YARN
ScalaJava
PythonR
APIs
Spark Core Engine
Spark SQL
Spark StreamingMLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
NHDFS
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS Hadoop Distributed File System
Interactive Real-TimeBatch
Applications BatchNeeds to happen but, no timeframe limitations
Interactive Needs to happen at Human time
Real-Time Needs to happen at Machine Execution time.
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark on YARN?
Utilize existing HDP cluster infrastructure Resource management
– share Spark workloads with other workloads like PIG, HIVE, etc.
Scheduling and queues
Spark Driver
ClientSpark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why HDFS?Fault Tolerant Distributed Storage• Divide files into big blocks and distribute 3 copies randomly across the cluster• Processing Data Locality
• Not Just storage but computation
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle & Governance
FalconAtlas
AdministrationAuthenticationAuthorizationAuditingData Protection
RangerKnoxAtlasHDFS EncryptionData Workflow
SqoopFlumeKafkaNFSWebHDFS
Provisioning, Managing, & Monitoring
AmbariCloudbreakZookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBaseAccumuloPhoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HCC DS, Analytics, and Spark Related Questions Sample
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab Preview
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Link to Tutorials with Lab Instructions
http://tinyurl.com/hwx-intro-to-spark
Thank you!community.hortonworks.com