spark rev1
TRANSCRIPT
7/23/2019 Spark Rev1
http://slidepdf.com/reader/full/spark-rev1 1/7
resentation on Apache Spark
By-
B V S Mridula (1039060)Milind Baluni (100319)
!" !anesh ()
#ilip Payra ()
Sa$eer %ayak ()
7/23/2019 Spark Rev1
http://slidepdf.com/reader/full/spark-rev1 2/7
&ntroduction to Apache Spark
●
It is a framework for performing general data analytics on distributedcomputing cluster like Hadoop● It provides in memory computations for increase speed and data
process over MapReduce.● Runs on top of existing Hadoop cluster and access Hadoop data store
(HDF!● "an also process structured data in Hive and treaming data from
HDF#Flume#$afka#%witter
7/23/2019 Spark Rev1
http://slidepdf.com/reader/full/spark-rev1 3/7
& 'ative support for multiplelanguages wit identical
)*Is
& +se of closures# iterations#and oter common
language constructs to
minimi,e code
& +nified )*I for batc and
streaming
High-Productivity LanguageSupport
Pythonlines = sc.textFile(...)lines.filter(lam bda s: “ERROR” in s).count()
Scalaval lines = sc.textFile(...)lines.filter(s = >s.contains(“ERROR”)).count()
Java
JavaR! "trin#> lines = sc.textFile(...)$lines.filter(new Function! "trin#% &oolean> ()' &oolean call("trin# s) ' return s.contains(“error”)$ ).count()$
7/23/2019 Spark Rev1
http://slidepdf.com/reader/full/spark-rev1 4/7
'hy Apache Spark is used● )pace park is best for performing iterative -obs like macine learning.
● )pace spark maintains an inmemory copy of map reduce -ob using RDD(ResilientDistributed Datasets!
● /it tis# if te map reduce -ob is once loaded into inmemory copy.
● %e need for loading tat map reduce -ob again and again from memory getsreduced. /it tis tere is a tremendous increase in speed.
● %e inmemory copy olds te fre0uently used map reduce -ob witin itself.
● Run programs up to 122x faster tan Hadoop MapReduce in memory# or 12x fasteron disk.
7/23/2019 Spark Rev1
http://slidepdf.com/reader/full/spark-rev1 5/7
Spark liraries
Spark SQL:
It is a park3s module for working wit structured data.
Spark Streaming:
/ic makes it easy to build scalable faulttolerant streamingapplications.
MLlib:
It is )pace park3s scalable macine learning library.
GraphX:
It is )pace park3s )*I for graps and grapparallel computation.
7/23/2019 Spark Rev1
http://slidepdf.com/reader/full/spark-rev1 6/7
&s Apache Spark *oin* to replace +adoop
● Hadoop essentially consists of a MapReduce pase and a file system(HDF! wereas park is a framework tat executes -obs# sopractically park can only replace te MapReduce pase in teHadoop 4cosystem.
● park was mainly designed to run on top of Hadoop so as to minimi,e
te -ob execution time.
● park is an alternative to te traditional MapReduce model tat usedto work only in batc mode. park supports bot batc as well asrealtime processing.
● park mainly utili,es te primary memory of te system to provideefficient output. %us# it re0uires a igend macine to execute -obs.Hadoop# on te oter and# can easily run on commodity ardware.
● park3s way of andle fault tolerance is very fast as compared to
Hadoop3s. %is minimi,es network I56 and guarantees fault tolerance.