scrap your mapreduce - apache spark
TRANSCRIPT
![Page 2: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/2.jpg)
2
Some properties of “Big Data”
•Big data is inherently immutable, meaning it is not supposed to updated once generated.
•Mostly the operations are coarse grained when it comes to write
•Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across clusterof many such machines
• The distributed nature makes the programming complicated.
![Page 3: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/3.jpg)
3
Brush up for Hadoop concepts
Distributed Storage => HDFS
Cluster Manager => YARN
Fault tolerance => achieved via replication
Job scheduling => Scheduler in YARN
Mapper
Reducer
Combiner
![Page 4: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/4.jpg)
4http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
![Page 5: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/5.jpg)
5
Map Reduce Programming Model
![Page 6: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/6.jpg)
6https://twitter.com/francesc/status/507942534388011008
![Page 7: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/7.jpg)
7http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop
![Page 8: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/8.jpg)
8
http://www.slideshare.net/JimArgeropoulos/hadoop-101-32661121
![Page 9: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/9.jpg)
9
MapReduce pain points
• considerable latency
• only Map and Reduce phases
• Non trivial to test
• results into complex workflow
• Not suitable for Iterative processing
![Page 10: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/10.jpg)
10
Immutability and MapReduce model
• HDFS storage is immutable or append-only.
• The MapReduce model lacks to exploit the immutable nature of
the data.
• The intermediate results are persisted resulting in huge of IO,
causing a serious performance hit.
![Page 11: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/11.jpg)
11
Wouldn’t it be very nice if we could have• Low latency
• Programmer friendly programming model
• Unified ecosystem
• Fault tolerance and other typical distributed system properties
• Easily testable code
• Of course open source :)
![Page 12: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/12.jpg)
12
What is Apache Spark
• Cluster computing Engine
• Abstracts the storage and cluster management
• Unified interfaces to data
• API in Scala, Python, Java, R*
![Page 13: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/13.jpg)
13
Where does it fit in existing Bigdata ecosystem
http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
![Page 14: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/14.jpg)
14
Why should you care about Apache Spark
• Abstracts underlying storage,
• Abstracts cluster management
• Easy programming model
• Very easy to test the code
• Highly performant
![Page 15: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/15.jpg)
15
• Petabyte sort record
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
![Page 16: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/16.jpg)
16
• Offers in memory caching of data
• Specialized Applications
• GraphX for graph processing
• Spark Streaming
• MLib for Machine learning
• Spark SQL
• Data exploration via Spark-Shell
![Page 17: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/17.jpg)
17
Programming model
for
Apache Spark
![Page 18: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/18.jpg)
18
Word Count example
val file = spark.textFile("input path")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
counts.saveAsTextFile("destination path")
![Page 19: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/19.jpg)
19
Comparing example with MapReduce
![Page 20: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/20.jpg)
20
Spark Shell Demo
• SparkContext
• RDD
• RDD operations
![Page 21: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/21.jpg)
21
RDD
• RDD stands for Resilient Distributed Dataset.
• basic abstraction for Spark
![Page 22: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/22.jpg)
22
• Equivalent of Distributed collections.
• The interface makes distributed nature of underlying data transparent.
• RDD is immutable
• Can be created via,
• parallelising a collection,
• transforming an existing RDD by applying a transformation function,
• reading from a persistent data store like HDFS.
![Page 23: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/23.jpg)
23
RDD is lazily evaluated
RDD has two type of operations
• Transformations
Create a DAG of transformations to be applied on the RDD
Does not evaluating anything
• Actions
Evaluate the DAG of transformations
![Page 24: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/24.jpg)
24
RDD operations
Transformations
map(f : T ⇒ U) : RDD[T] ⇒ RDD[U]
filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T]
flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]
sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling)
union() : (RDD[T],RDD[T]) ⇒ RDD[T]
join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]
groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])]
reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
![Page 25: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/25.jpg)
25
Actions
count() : RDD[T] ⇒ Long
collect() : RDD[T] ⇒ Seq[T]
reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T
lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
![Page 26: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/26.jpg)
26
Job Execution
![Page 27: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/27.jpg)
27
Spark Execution in Context of YARN
http://kb.cnblogs.com/page/198414/
![Page 28: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/28.jpg)
28
Fault tolerance via lineage
MappedRDD
FilteredRDD
FlatMappedRDD
MappedRDD
HadoopRDD
![Page 29: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/29.jpg)
29
Testing
![Page 30: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/30.jpg)
30
Why is Spark more performant than MapReduce
![Page 31: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/31.jpg)
31
Reduced IO
• No disk IO between phases since phases themselves are pipelined
• No network IO involved unless a shuffle is required
![Page 32: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/32.jpg)
32
No Mandatory Shuffle
• Programs not bounded by map and reduce phases
• No mandatory Shuffle and sort required
![Page 33: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/33.jpg)
33
In memory caching of data
• Optional In memory caching
• DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied
![Page 34: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/34.jpg)
34
Questions?
![Page 35: Scrap Your MapReduce - Apache Spark](https://reader034.vdocuments.mx/reader034/viewer/2022042511/55a600581a28abbf498b45ad/html5/thumbnails/35.jpg)
35
Thank You!