spark rev1

7/23/2019 Spark Rev1

http://slidepdf.com/reader/full/spark-rev1 1/7

resentation on Apache Spark

By-

B V S Mridula (1039060)Milind Baluni (100319)

!" !anesh ()

#ilip Payra ()

Sa$eer %ayak ()



&ntroduction to Apache Spark

●

It is a framework for performing general data analytics on distributedcomputing cluster like Hadoop● It provides in memory computations for increase speed and data

process over MapReduce.● Runs on top of existing Hadoop cluster and access Hadoop data store

(HDF!● "an also process structured data in Hive and treaming data from

HDF#Flume#$afka#%witter



& 'ative support for multiplelanguages wit identical

)*Is

& +se of closures# iterations#and oter common

language constructs to

minimi,e code

& +nified )*I for batc and

streaming

High-Productivity LanguageSupport

Pythonlines = sc.textFile(...)lines.filter(lam bda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(s = >s.contains(“ERROR”)).count()

Java

JavaR! "trin#> lines = sc.textFile(...)$lines.filter(new Function! "trin#% &oolean> ()' &oolean call("trin# s) ' return s.contains(“error”)$ ).count()$



'hy Apache Spark is used● )pace park is best for performing iterative -obs like macine learning.

● )pace spark maintains an inmemory copy of map reduce -ob using RDD(ResilientDistributed Datasets!

● /it tis# if te map reduce -ob is once loaded into inmemory copy.

● %e need for loading tat map reduce -ob again and again from memory getsreduced. /it tis tere is a tremendous increase in speed.

● %e inmemory copy olds te fre0uently used map reduce -ob witin itself.

● Run programs up to 122x faster tan Hadoop MapReduce in memory# or 12x fasteron disk.



Spark liraries

Spark SQL:

It is a park3s module for working wit structured data.

Spark Streaming:

/ic makes it easy to build scalable faulttolerant streamingapplications.

MLlib:

It is )pace park3s scalable macine learning library.

GraphX:

It is )pace park3s )*I for graps and grapparallel computation.



&s Apache Spark *oin* to replace +adoop

● Hadoop essentially consists of a MapReduce pase and a file system(HDF! wereas park is a framework tat executes -obs# sopractically park can only replace te MapReduce pase in teHadoop 4cosystem.

● park was mainly designed to run on top of Hadoop so as to minimi,e

te -ob execution time.

● park is an alternative to te traditional MapReduce model tat usedto work only in batc mode. park supports bot batc as well asrealtime processing.

● park mainly utili,es te primary memory of te system to provideefficient output. %us# it re0uires a igend macine to execute -obs.Hadoop# on te oter and# can easily run on commodity ardware.

● park3s way of andle fault tolerance is very fast as compared to

Hadoop3s. %is minimi,es network I56 and guarantees fault tolerance.



%ank 7ou8

spark rev1

Documents