intro to spark with zeppelin

38
Robert Hryniewicz Data Evangelist @RobHryniewicz Intro to Spark & Zeppelin

Upload: hortonworks

Post on 16-Jan-2017

926 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Intro to Spark with Zeppelin

Robert HryniewiczData Evangelist@RobHryniewicz

Intro to Spark & Zeppelin

Page 2: Intro to Spark with Zeppelin

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark Background

Page 3: Intro to Spark with Zeppelin

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Spark?

Apache Open Source Project - originally developed at AMPLab (University of California Berkeley)

Data Processing Engine - focused on in-memory distributed computing use-cases API - Scala, Python, Java and R

Page 4: Intro to Spark with Zeppelin

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Ecosystem

Spark Core

Spark SQL Spark Streaming MLLib GraphX

Page 5: Intro to Spark with Zeppelin

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Spark? Elegant Developer APIs

– Single environment for data munging and Machine Learning (ML)

In-memory computation model – Fast!– Effective for iterative computations and ML

Machine Learning– Implementation of distributed ML algorithms– Pipeline API (Spark ML)

Page 6: Intro to Spark with Zeppelin

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

History of Hadoop & Spark

Page 7: Intro to Spark with Zeppelin

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark Basics

Page 8: Intro to Spark with Zeppelin

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Context

Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code

What is it?

Page 9: Intro to Spark with Zeppelin

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDD - Resilient Distributed Dataset Primary abstraction in Spark

– An Immutable collection of objects (or records, or elements) that can be operated on in parallel

Distributed– Collection of elements partitioned across nodes in a cluster– Each RDD is composed of one or more partitions– User can control the number of partitions– More partitions => more parallelism

Resilient– Recover from node failures– An RDD keeps its lineage information -> it can be recreated from parent RDDs

Created by starting with a file in Hadoop Distributed File System (HDFS) or an existing collection in the driver program

May be persisted in memory for efficient reuse across parallel operations (caching)

Page 10: Intro to Spark with Zeppelin

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDD – Resilient Distributed Dataset

Partition 1

Partition 2

Partition 3

RDD 2

Partition 1

Partition 2

Partition 3

Partition 4

RDD 1

ClusterNodes

Page 11: Intro to Spark with Zeppelin

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL

Page 12: Intro to Spark with Zeppelin

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Overview

Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:

– DataFrames API– SQL queries– Datasets API

Same execution engine for all three Spark SQL interfaces provide more information about both structure and computation

being performed than basic Spark RDD API

Page 13: Intro to Spark with Zeppelin

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrames

Conceptually equivalent to a table in relational DB or data frame in R/Python API available in Scala, Java, Python, and R Richer optimizations (significantly faster than RDDs) Distributed collection of data organized into named columns Underneath is an RDD

Page 14: Intro to Spark with Zeppelin

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFramesCSVAvro

HIVE

Spark SQL

Text

Col1 Col2 … … ColN

DataFrame(with RDD underneath)

Column

Row

Created from Various Sources

DataFrames from HIVE:– Reading and writing HIVE tables,

including ORC

DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro

DataFrames from existing RDDs– with toDF()function

Data is described as a DataFrame with rows, columns and a schema

Page 15: Intro to Spark with Zeppelin

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL Context and Hive Context

Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)

SQLContext

Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs

HiveContext

val hc = HiveContext(sc)

Use when your data resides in

Hive

Page 16: Intro to Spark with Zeppelin

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Examples

Page 17: Intro to Spark with Zeppelin

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrame Example

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

Reading Data From Table

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+

Page 18: Intro to Spark with Zeppelin

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrame Example

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

Using DataFrame API to Filter Data (show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

Page 19: Intro to Spark with Zeppelin

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL Example

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT

5").show

Using SQL to Query and Filter Data (again, show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

Page 20: Intro to Spark with Zeppelin

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDD vs. DataFrame

Page 21: Intro to Spark with Zeppelin

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDDs vs. DataFrames

RDD

DataFrame

Lower-level API (more control) Lots of existing code & users Compile-time type-safety

Higher-level API (faster development) Faster sorting, hashing, and serialization More opportunities for automatic optimization Lower memory pressure

Page 22: Intro to Spark with Zeppelin

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Frames are Intuitive

RDD Example

Equivalent Data Frame Example

dept name ageBio H Smith 48CS A Turing 54Bio B Jones 43Phys E Witten 61

Find average age by department?

Page 23: Intro to Spark with Zeppelin

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Optimizations Spark SQL uses an underlying optimization engine (Catalyst)

– Catalyst can perform intelligent optimization since it understands the schema

Spark SQL does not materialize all the columns (as with RDD) only what’s needed

Page 24: Intro to Spark with Zeppelin

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin & HDP Sandbox

Page 25: Intro to Spark with Zeppelin

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin

Web-based Notebook for interactive analytics Use Cases

– Data exploration and discovery– Visualization– Interactive snippet-at-a-time experience– “Modern Data Science Studio”

Features– Deeply integrated with Spark and Hadoop– Supports multiple language backends– Pluggable “Interpreters”

Page 26: Intro to Spark with Zeppelin

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s not included with Spark?

Resource Management

Storage

Applications

Spark Core Engine

ScalaJava

Pythonlibraries

MLlib (Machine learning)

Spark SQL*

Spark Streaming*

Spark Core Engine

Page 27: Intro to Spark with Zeppelin

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDP Sandbox

What’s included in the Sandbox?

Zeppelin Latest Hortonworks Data Platform (HDP)

– Spark– YARN Resource Management– HDFS Distributed Storage Layer– And many more components... YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

Page 28: Intro to Spark with Zeppelin

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

Interactive Needs to happen at Human time

Real-Time Needs to happen at Machine Execution time.

Page 29: Intro to Spark with Zeppelin

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Spark on YARN?

Utilize existing HDP cluster infrastructure Resource management

– share Spark workloads with other workloads like PIG, HIVE, etc.

Scheduling and queues

Spark Driver

ClientSpark

Application Master

YARN container

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Spark Executor

YARN container

Task Task

Page 30: Intro to Spark with Zeppelin

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why HDFS?Fault Tolerant Distributed Storage• Divide files into big blocks and distribute 3 copies randomly across the cluster• Processing Data Locality

• Not Just storage but computation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

0100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

Page 31: Intro to Spark with Zeppelin

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFS EncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

Hortonworks Data Platform 2.4.x

Deployment ChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

Page 32: Intro to Spark with Zeppelin

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Community Connection

Page 33: Intro to Spark with Zeppelin

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

community.hortonworks.com

Page 34: Intro to Spark with Zeppelin

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

community.hortonworks.com

Page 35: Intro to Spark with Zeppelin

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HCC DS, Analytics, and Spark Related Questions Sample

Page 36: Intro to Spark with Zeppelin

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Lab Preview

Page 37: Intro to Spark with Zeppelin

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Link to Tutorials with Lab Instructions

http://tinyurl.com/hwx-intro-to-spark

Page 38: Intro to Spark with Zeppelin

Thank you!community.hortonworks.com