apache spark sneha challa- google pittsburgh-aug 25th

Using Apache SparkPRESENTER: SNEHA CHALLA

VENUE: GOOGLE, P IT TSBURGH

DATE: AUGUST 25 TH, 2015

What is Apache Spark?Spark is an open source computation engine built on top of the popular Hadoop

Distributed File System (HDFS). Fast and general cluster computing engine for large scale data processing

Efficient Usable

Offers In memory computing

DAG Execution EngineUp to 10× faster on disk,

100× in memory 2-5× less code

Rich APIs in Java, Scala, Python

Interactive shell

Why Spark? Runs Everywhere

• Standalone Mode (private cluster)

• Apache Mesos – Cluster manager

• Hadoop YARN• Amazon EC2 (prepared

deployment)

Spark CommunitySpark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide variety of companies.

MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA

More than 150 Contributors in the past One Year

25 + Companies Contributing and it’s growing.

Spark was designed to both make traditional Map Reduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. Spark can be used to build fast end-to-end Machine Learning workflows.

SPARK COMMUNITY

Elephant in the room- Map Reduce

Why Apache Spark?

Spark VS Map Reduce Run programs up to 100x faster than Hadoop MapReduce in

memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports

cyclic data flow and in-memory computing.

Why Spark?

Spark Installation

http://spark.apache.org/downloads.html

http://spark.apache.org/downloads.html

Spark Installation• Extract compressed folder spark-1.3.0-binhadoop2.4

• From terminal, go to spark-1.3.0-bin-hadoop2.4/

bin

• Run pyspark

• Run rdd = sc.parallelize([0, 1, 2]);

rdd.map(lambda x: x*x).collect()

• Get result [0,1,4]

• It’s that easy!

Windows users might need to download and run additional winutils.exe for smooth running of applications

Download winutils here http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path.

and add to $HADOOP_HOME/bin• Download a bigger zip (1.9GB) fromhttp://bit.ly/1FpZAXH

http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path

Interactive Shell & Spark ContextInteractive Shell:

The Fastest Way to Learn Spark

Available in Python and Scala

Runs as an application on an existing Spark Cluster.

OR Can run locally

Spark Context:

Main entry point to Spark functionality

Available in shell as variable scIn standalone programs, you’d make your

own by calling SC object(see later for details)

Key Concepts – RDD Distributional Model

RDD – Resilient Distributed DatasetPrograms are written in terms of transformations on these Distributed Datasets

• RDD = Resilient Distributed Database• Transformations convert one RDD into another• No actual calculation• Actions force calculation of result• Lazy evaluation

RDD’s - MotivationRDDs are motivated by two types of applications that current data flow systems handle inefficiently:

1) Iterative algorithms - Common in graph applications and Machine Learning

2) Interactive Data Mining Tools

To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

However, RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized computations.

Spark ArchitectureBasic Abstraction: RDDs

Immutability

Laziness

Transformations

Programming ModelTwo types of operations on an RDD:• transformations• actionsTransformations are lazily evaluated - they are not executed when you issue the command.RDDs are recomputed when an action is executed.

Data distribution/partitioningRDD - read only collection of objects that are partitioned across a set of machines.

RDD Computational Model

• Operators on RDDs form a directed acyclic graph.• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.

FIRST STEP IN DATA ANALYSIS :Create an RDDRead data from a text file on local machine , S3 or HDFS into RDD.

Give life to an RDD using Spark Context

# Convert a python collection to an RDD# Turn a Python collection into an RDD>sc.parallelize ([7, 8, 9])# Load text file from local FS, HDFS, or S3>sc.textFile(“textfile.txt”)>sc.textFile(“directory/*.txt”)>sc.textFile(“hdfs://namenode:9000/path/file”)

Transformations – RDD: mapPass each element of an RDD through a function

>>>rdd = sc.parallelize(range(1,8))

>>>result_rdd = rdd.map(lambda x: x%3)

Transformations – RDD :reduce>>>rdd = sc.parallelize(range(1,8))

>>>result_rdd = rdd.reduce(add)

RDD Transformations: Reduce By Key>>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)])

>>>result_rdd = rdd.reduceByKey(add)

Some more RDD Transformationsrdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

>>> rdd = sc.parallelize([2,3,4])

sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on transformation

[1,1,1,2,2,3]

rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.

>>> rdd = sc.parallelize([1,2,3,4,5])

rdd.filter(lambda x: x%2 ==0).collect()

[2,4]

Some more RDD TransformationssortBy(self, keyfunc, ascending=True, numPartitions=None)

Sorts this RDD by given keyfunc

>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]

rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).

rdd.cache() : Cache RDD in memory for repeated use

countByKey(self):

rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])

rdd.countByKey().items()

join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with matching keys in self and others.

Setting the level of parallelismAll the pair RDD operations take an optional second parameter for number of tasks.

> rdd.reduceByKey(lambda x, y: x + y, 5)

>rdd.groupByKey(5)

>rdd.join(pageViews, 5)

Some RDD ActionsRDD Transformations are lazily evaluated . Actions kick off computation on transformations.Eg: Collect(), glom() etc rdd.collect() : Return RDD content as a listrdd = sc.parallelize([1,2,3], 3)rdd2 = rdd.map(lambda x: x*x)rdd2.glom().collect(): [1, 4, 9] rdd.glom().collect(): rdd = sc.parallelize([0,1,2], 3)rdd2 = rdd.map(lambda x: x*x)rdd2.collect(): [[0], [1], [4]] saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a

given directory in the local filesystem, HDFS or any other Hadoop-supported file system. take(n) :Return an array with the first n elements of the dataset. first() : return the first element of the dataset. Similar to take(1)

In Map Reduce you get only two operators – map and reduce. Whereas Spark offers 80+ operations!

Automatic parallelization of workflows on SPARK. In Spark a whole series of individual tasks is expressed as a single program flow that is lazily evaluated., so that system has a complete picture of the execution graph.

Word Countfrom pyspark import SparkContext

logFile = "hdfs://localhost:9000/user/bigdatavm/input"

sc = SparkContext("spark://bigdata-vm:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")

Fault tolerant - PersistentRDDs track lineage information that can be used to efficiently re compute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“\t”)[2])

Spark will persist or cache RDD slices in memory on each node during operations.You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.

Spark Libraries

MLlib Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.

Shipped with Spark since version 0.8

35 contributors

Highlights include:

Basic statistics - summary statistics ,correlation and stratified sampling

hypothesis testing, random data generation

Linear models of regression (logistic and linear regression, SVM’s)

Naive Bayes and Decision Tree classifiers, ensemble of trees

Collaborative Filtering with ALS

K-Means clustering and Gaussian Mixture.

Stochastic gradient descent

SVD (singular value decomposition) and PCA

Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

Running a Spark ApplicationCommand: submit-spark <python_file_path>

Let’s see the implementation of

1) K-Means

2) Logistic Regression

K-Means# Import the required pyspark functions

from pyspark.mllib.clustering import KMeans

from numpy import array

from math import sqrt

from pyspark import SparkContext

sc =SparkContext()

data = sc.textFile("C:\Users\snehachalla\Downloads\spark-1.4.1-bin-hadoop2.4\spark-1.4.1-bin-hadoop2.4\bin\kmeans_data.txt")

parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()

K-Means (Cont..)# Build the model (cluster the data)

clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")

# Evaluate clustering by computing the sum of squared errors

def error(point):

center = clusters.centers[clusters.predict(point)]

return sqrt(sum([x**2 for x in (point - center)]))

cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)

print("Sum of squared error = " + str(cost))

Logistic Regression

from pyspark.mllib.classification import LogisticRegressionWithSGD

from numpy import array

# Load and parse the data

data = sc.textFile("mllib/data/sample_svm_data.txt")

parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

model = LogisticRegressionWithSGD.train(parsedData)

# Build the model

labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),

model.predict(point.take(range(1, point.size)))))

# Evaluating the model on training data

trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())

print("Training Error = " + str(trainErr)

Spark UIRun Spark in local mode (pyspark) .

Spark UI is at http://localhost:4040

You will be able to see the RDD sizes and Identify slow running tasks.

http://localhost:4040/

MOVING TO A CLUSTER – EC2

Setting up a EMR ClusterIf your data is too large to compute on your local machine - then you’re in the right

place. An easy way to get Spark running is with EC2.

Create an account on aws.amazon.com

Get a keypair from aws Console: This is the security for your instance

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName

Create EMR instance and configure the nodes:

https://console.aws.amazon.com/console/home?region=us-east-1#

Launch EMR instance

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName

https://console.aws.amazon.com/console/home?region=us-east-1

EMR Clusterhttps://console.aws.amazon.com/s3/home?region=us-east-1

Data can be uploaded to the bucket on Amazon S3 .

https://console.aws.amazon.com/s3/home?region=us-east-1

For more info on Spark Website: http://spark.apache.org

Tutorials: http://ampcamp.berkeley.edu

Spark Summit: http://spark-summit.org

Github: https://github.com/apache/spark

Mailing lists: [email protected],[email protected]

Python API documentation. http://spark.apache.org/docs/latest/api/python/

mailto:[email protected],[email protected]

http://spark.apache.org/docs/latest/api/python/

Questions?

THANK YOU

Referenceshttp://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf

http://www.andrew.cmu.edu/user/amaurya/docs/spark_talk/presentation.pdf

http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python



http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python

apache spark sneha challa- google pittsburgh-aug 25th

Data & Analytics