introduction to spark - durham lug 20150916

29
Introduction to Apache Spark

Upload: ian-pointer

Post on 19-Jan-2017

224 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Introduction To Spark - Durham LUG 20150916

Introduction to Apache Spark

Page 2: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

The Leader in Big Data Consulting

● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.

● Installation○ Installation of Hadoop or relevant technology.

● Data Consolidation○ Load data from diverse sources into a single scalable repository.

● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.

● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to

necessary employees who will analyze the data.

Mammoth Data, based in downtown Durham (right above Toast)

Page 3: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Lead Consultant on all things DevOps and Spark

● @carsondial

Me!

Page 4: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Apache Spark™ is a fast and general engine for large-scale data processing

● Not all that helpful, is it?

What Is Apache Spark?!

Page 5: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Framework for massive parallel computing (cluster)

● Harnessing power of cheap memory

● Direct Acyclic Graph (DAG) computing engine

● It goes very fast!

● Apache Project (spark.apache.org)

What Is Apache Spark?! No, But Really…

Page 6: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Performance

● Developer productivity

Why Spark?

Page 7: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Graysort benchmark (100TB)

● Hadoop - 72 minutes / 2100 nodes / datacentre

● Spark - 23 minutes / 206 nodes / AWS

● HDFS versus Memory

Performance!

Page 8: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● First class support for Scala, Java, Python, and R!

● Data Science friendly

Developers!

Page 9: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

Word Count: Hadoop

Page 10: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

from pyspark import SparkContext

logFile = "hdfs:///input"sc = SparkContext("spark://spark-m:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs:///output")

Word Count: Spark

Page 11: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Spark Streaming

● GraphX (graph algorithms)

● MLLib (machine learning)

● Dataframes (data access)

Spark: Batteries Included

Page 12: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Analytics (batch / streaming)

● Machine Learning

● ETL (Extract - Transform - Load)

● …and many more!

Applications

Page 13: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● RDD = Resilient Distributed Dataset

● Immutable, Fault-tolerant

● Operated on in parallel

● Can be created manually or from external sources

RDDs – The Building Block

Page 14: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Transformations

● Actions

● Transformations are lazy

● Actions evaluate transformations in pipeline as well as performing action

RDDs – The Building Block

Page 15: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● map()

● filter()

● pipe()

● sample()

● …and more!

RDDs – Example Transformations

Page 16: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● reduce()

● count()

● take()

● saveAsTextFile()

● …and yes, more

RDDs – Example Actions

Page 17: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

from pyspark import SparkContext

logFile = "hdfs:///input"sc = SparkContext("spark://spark-m:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs:///output")

Word Count: Spark

Page 18: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● cache() / persist()

● When an action is performed for the first time - keep the result in memory

● Different levels of persistence available

RDDs – cache()

Page 19: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Micro-batches (DStreams of RDDs)

● Access to other parts of Spark (MLLib, GraphX, Dataframes)

● Fault-tolerant

● Connectors to Kafka, Flume, Kinesis, ZeroMQ

● (we’ll come back to this)

Streaming

Page 20: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Spark SQL

● Support for JSON, Cassandra, SQL databases, etc.

● Easier syntax than RDDs

● Dataframes ‘borrowed’ from Python/R

● Catalyst query planner

Dataframes

Page 21: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

val sc = new SparkContext()val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df = sqlContext.read.json("people.json")

df.show()

df.filter(df("age") >= 35).show()

df.groupBy("age").count().show()

Dataframes: Example

Page 22: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Optimizing query planning for Spark

● Takes Dataframe operations and ‘compiles’ them down to RDD operations

● Often faster than writing RDD code manually

● Use Dataframes whenever possible (v1.4+)

Dataframes: Catalyst

Page 23: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

Dataframes: Catalyst

Page 24: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Standalone

● YARN (Hadoop ecosystem)

● Mesos (Hipster ecosystem)

Deploying Spark

Page 25: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Spark-Shell

● Zeppelin

Demos

Page 26: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Spark Streaming is not ‘pure’ streaming

● Low latency requirements - use Storm

● Still immature in some ways

● Come to my All Things Open talk to learn more!

Spark for Everything?

Page 27: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● http://www.meetup.com/Triangle-Apache-Spark-Meetup/

● Next meeting likely to be in late October

Triangle Apache Spark Meetup Group

Page 28: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● spark.apache.org

● databricks.com

● zeppelin.incubator.apache.org

● mammothdata.com/white-papers/spark-a-modern-tool-for-big-data-applications

Links

Page 29: Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco

● Questions for you! (for a $15 Digital Ocean voucher)

1. What is a RDD?2. What’s the difference between a transformation and an action?3. When wouldn’t you use Spark Streaming?

Questions?