apache spark sneha challa- google pittsburgh-aug 25th

42
Using Apache Spark PRESENTER : SNEHA CHALLA VENUE : GOOGLE, PITTSBURGH DATE : AUGUST 25 TH , 2015

Upload: sneha-challa

Post on 21-Jan-2018

320 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache spark  sneha challa- google pittsburgh-aug 25th

Using Apache SparkPRESENTER: SNEHA CHALLA

VENUE: GOOGLE, P IT TSBURGH

DATE: AUGUST 25 TH, 2015

Page 2: Apache spark  sneha challa- google pittsburgh-aug 25th

What is Apache Spark?Spark is an open source computation engine built on top of the popular Hadoop

Distributed File System (HDFS). Fast and general cluster computing engine for large scale data processing

Efficient Usable

Offers In memory computing

DAG Execution EngineUp to 10× faster on disk,

100× in memory 2-5× less code

Rich APIs in Java, Scala, Python

Interactive shell

Page 3: Apache spark  sneha challa- google pittsburgh-aug 25th

Why Spark? Runs Everywhere

• Standalone Mode (private cluster)

• Apache Mesos – Cluster manager

• Hadoop YARN• Amazon EC2 (prepared

deployment)

Page 4: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark CommunitySpark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide variety of companies.

MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA

More than 150 Contributors in the past One Year

25 + Companies Contributing and it’s growing.

Spark was designed to both make traditional Map Reduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. Spark can be used to build fast end-to-end Machine Learning workflows.

Page 5: Apache spark  sneha challa- google pittsburgh-aug 25th

SPARK COMMUNITY

Page 6: Apache spark  sneha challa- google pittsburgh-aug 25th

Elephant in the room- Map Reduce

Page 7: Apache spark  sneha challa- google pittsburgh-aug 25th

Why Apache Spark?

Page 8: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark VS Map Reduce Run programs up to 100x faster than Hadoop MapReduce in

memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports

cyclic data flow and in-memory computing.

Page 9: Apache spark  sneha challa- google pittsburgh-aug 25th

Why Spark?

Page 10: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark Installation

http://spark.apache.org/downloads.html

Page 11: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark Installation• Extract compressed folder spark-1.3.0-binhadoop2.4

• From terminal, go to spark-1.3.0-bin-hadoop2.4/

bin

• Run pyspark

• Run rdd = sc.parallelize([0, 1, 2]);

rdd.map(lambda x: x*x).collect()

• Get result [0,1,4]

• It’s that easy!

Windows users might need to download and run additional winutils.exe for smooth running of applications

Download winutils here http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path.

and add to $HADOOP_HOME/bin• Download a bigger zip (1.9GB) fromhttp://bit.ly/1FpZAXH

Page 12: Apache spark  sneha challa- google pittsburgh-aug 25th

Interactive Shell & Spark ContextInteractive Shell:

The Fastest Way to Learn Spark

Available in Python and Scala

Runs as an application on an existing Spark Cluster.

OR Can run locally

Spark Context:

Main entry point to Spark functionality

Available in shell as variable scIn standalone programs, you’d make your

own by calling SC object(see later for details)

Page 13: Apache spark  sneha challa- google pittsburgh-aug 25th

Key Concepts – RDD Distributional Model

RDD – Resilient Distributed DatasetPrograms are written in terms of transformations on these Distributed Datasets

• RDD = Resilient Distributed Database• Transformations convert one RDD into another• No actual calculation• Actions force calculation of result• Lazy evaluation

Page 14: Apache spark  sneha challa- google pittsburgh-aug 25th

RDD’s - MotivationRDDs are motivated by two types of applications that current data flow systems handle inefficiently:

1) Iterative algorithms - Common in graph applications and Machine Learning

2) Interactive Data Mining Tools

To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

However, RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized computations.

Page 15: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark ArchitectureBasic Abstraction: RDDs

Immutability

Laziness

Transformations

Page 16: Apache spark  sneha challa- google pittsburgh-aug 25th

Programming ModelTwo types of operations on an RDD:• transformations• actionsTransformations are lazily evaluated - they are not executed when you issue the command.RDDs are recomputed when an action is executed.

Page 17: Apache spark  sneha challa- google pittsburgh-aug 25th

Data distribution/partitioningRDD - read only collection of objects that are partitioned across a set of machines.

Page 18: Apache spark  sneha challa- google pittsburgh-aug 25th

RDD Computational Model

• Operators on RDDs form a directed acyclic graph.• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.

Page 19: Apache spark  sneha challa- google pittsburgh-aug 25th

FIRST STEP IN DATA ANALYSIS :Create an RDDRead data from a text file on local machine , S3 or HDFS into RDD.

Give life to an RDD using Spark Context

# Convert a python collection to an RDD# Turn a Python collection into an RDD>sc.parallelize ([7, 8, 9])# Load text file from local FS, HDFS, or S3>sc.textFile(“textfile.txt”)>sc.textFile(“directory/*.txt”)>sc.textFile(“hdfs://namenode:9000/path/file”)

Page 20: Apache spark  sneha challa- google pittsburgh-aug 25th

Transformations – RDD: mapPass each element of an RDD through a function

>>>rdd = sc.parallelize(range(1,8))

>>>result_rdd = rdd.map(lambda x: x%3)

Page 21: Apache spark  sneha challa- google pittsburgh-aug 25th

Transformations – RDD :reduce>>>rdd = sc.parallelize(range(1,8))

>>>result_rdd = rdd.reduce(add)

Page 22: Apache spark  sneha challa- google pittsburgh-aug 25th

RDD Transformations: Reduce By Key>>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)])

>>>result_rdd = rdd.reduceByKey(add)

Page 23: Apache spark  sneha challa- google pittsburgh-aug 25th

Some more RDD Transformationsrdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

>>> rdd = sc.parallelize([2,3,4])

sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on transformation

[1,1,1,2,2,3]

rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.

>>> rdd = sc.parallelize([1,2,3,4,5])

rdd.filter(lambda x: x%2 ==0).collect()

[2,4]

Page 24: Apache spark  sneha challa- google pittsburgh-aug 25th

Some more RDD TransformationssortBy(self, keyfunc, ascending=True, numPartitions=None)

Sorts this RDD by given keyfunc

>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]

rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).

rdd.cache() : Cache RDD in memory for repeated use

countByKey(self):

rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])

rdd.countByKey().items()

join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with matching keys in self and others.

Page 25: Apache spark  sneha challa- google pittsburgh-aug 25th

Setting the level of parallelismAll the pair RDD operations take an optional second parameter for number of tasks.

> rdd.reduceByKey(lambda x, y: x + y, 5)

>rdd.groupByKey(5)

>rdd.join(pageViews, 5)

Page 26: Apache spark  sneha challa- google pittsburgh-aug 25th

Some RDD ActionsRDD Transformations are lazily evaluated . Actions kick off computation on transformations.Eg: Collect(), glom() etc rdd.collect() : Return RDD content as a listrdd = sc.parallelize([1,2,3], 3)rdd2 = rdd.map(lambda x: x*x)rdd2.glom().collect(): [1, 4, 9] rdd.glom().collect(): rdd = sc.parallelize([0,1,2], 3)rdd2 = rdd.map(lambda x: x*x)rdd2.collect(): [[0], [1], [4]] saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a

given directory in the local filesystem, HDFS or any other Hadoop-supported file system. take(n) :Return an array with the first n elements of the dataset. first() : return the first element of the dataset. Similar to take(1)

Page 27: Apache spark  sneha challa- google pittsburgh-aug 25th

In Map Reduce you get only two operators – map and reduce. Whereas Spark offers 80+ operations!

Automatic parallelization of workflows on SPARK. In Spark a whole series of individual tasks is expressed as a single program flow that is lazily evaluated., so that system has a complete picture of the execution graph.

Page 28: Apache spark  sneha challa- google pittsburgh-aug 25th

Word Countfrom pyspark import SparkContext

logFile = "hdfs://localhost:9000/user/bigdatavm/input"

sc = SparkContext("spark://bigdata-vm:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")

Page 29: Apache spark  sneha challa- google pittsburgh-aug 25th

Fault tolerant - PersistentRDDs track lineage information that can be used to efficiently re compute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“\t”)[2])

Spark will persist or cache RDD slices in memory on each node during operations.You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.

Page 30: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark Libraries

Page 31: Apache spark  sneha challa- google pittsburgh-aug 25th

MLlib Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.

Shipped with Spark since version 0.8

35 contributors

Highlights include:

Basic statistics - summary statistics ,correlation and stratified sampling

hypothesis testing, random data generation

Linear models of regression (logistic and linear regression, SVM’s)

Naive Bayes and Decision Tree classifiers, ensemble of trees

Collaborative Filtering with ALS

K-Means clustering and Gaussian Mixture.

Stochastic gradient descent

SVD (singular value decomposition) and PCA

Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

Page 32: Apache spark  sneha challa- google pittsburgh-aug 25th

Running a Spark ApplicationCommand: submit-spark <python_file_path>

Let’s see the implementation of

1) K-Means

2) Logistic Regression

Page 33: Apache spark  sneha challa- google pittsburgh-aug 25th

K-Means# Import the required pyspark functions

from pyspark.mllib.clustering import KMeans

from numpy import array

from math import sqrt

from pyspark import SparkContext

sc =SparkContext()

data = sc.textFile("C:\Users\snehachalla\Downloads\spark-1.4.1-bin-hadoop2.4\spark-1.4.1-bin-hadoop2.4\bin\kmeans_data.txt")

parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()

Page 34: Apache spark  sneha challa- google pittsburgh-aug 25th

K-Means (Cont..)# Build the model (cluster the data)

clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")

# Evaluate clustering by computing the sum of squared errors

def error(point):

center = clusters.centers[clusters.predict(point)]

return sqrt(sum([x**2 for x in (point - center)]))

cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)

print("Sum of squared error = " + str(cost))

Page 35: Apache spark  sneha challa- google pittsburgh-aug 25th

Logistic Regression

from pyspark.mllib.classification import LogisticRegressionWithSGD

from numpy import array

# Load and parse the data

data = sc.textFile("mllib/data/sample_svm_data.txt")

parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

model = LogisticRegressionWithSGD.train(parsedData)

# Build the model

labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),

model.predict(point.take(range(1, point.size)))))

# Evaluating the model on training data

trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())

print("Training Error = " + str(trainErr)

Page 36: Apache spark  sneha challa- google pittsburgh-aug 25th

Spark UIRun Spark in local mode (pyspark) .

Spark UI is at http://localhost:4040

You will be able to see the RDD sizes and Identify slow running tasks.

Page 37: Apache spark  sneha challa- google pittsburgh-aug 25th

MOVING TO A CLUSTER – EC2

Page 38: Apache spark  sneha challa- google pittsburgh-aug 25th

Setting up a EMR ClusterIf your data is too large to compute on your local machine - then you’re in the right

place. An easy way to get Spark running is with EC2.

Create an account on aws.amazon.com

Get a keypair from aws Console: This is the security for your instance

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName

Create EMR instance and configure the nodes:

https://console.aws.amazon.com/console/home?region=us-east-1#

Launch EMR instance

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1

Page 39: Apache spark  sneha challa- google pittsburgh-aug 25th

EMR Clusterhttps://console.aws.amazon.com/s3/home?region=us-east-1

Data can be uploaded to the bucket on Amazon S3 .

Page 40: Apache spark  sneha challa- google pittsburgh-aug 25th

For more info on Spark Website: http://spark.apache.org

Tutorials: http://ampcamp.berkeley.edu

Spark Summit: http://spark-summit.org

Github: https://github.com/apache/spark

Mailing lists: [email protected],[email protected]

Python API documentation. http://spark.apache.org/docs/latest/api/python/

Page 41: Apache spark  sneha challa- google pittsburgh-aug 25th

Questions?

THANK YOU

Page 42: Apache spark  sneha challa- google pittsburgh-aug 25th

Referenceshttp://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf

http://www.andrew.cmu.edu/user/amaurya/docs/spark_talk/presentation.pdf

http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python