introduction to spark
DESCRIPTION
Introduction to Spark for the Boulder / Denver Spark meetupTRANSCRIPT
![Page 1: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/1.jpg)
Intro to SparkDave Smelker
![Page 2: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/2.jpg)
What is spark• In-Memory Map/Reduce Engine• Spark was developed in 2009 by the
Berkley Amp lab• Converted to an Apache project in
2013• Scala based• Scala, Java, and Python API
![Page 3: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/3.jpg)
Most Active Big Data Project within Apache
Data from Spark-Summit 2014
![Page 4: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/4.jpg)
Spark
Spark Streaming
Stand alone HDFS
Spark SQL
Tachyon
MLBase
Cassandra Cloud Services
GraphX
RDBMS
![Page 5: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/5.jpg)
Spark VS. Hadoop• Hadoop Map/Reduce Limitations• High Latency• No in-memory caching• Map/Reduce code very complicated to write
• Spark• In-Memory Processing• Very Easy API• Can run stand alone even on Windows• 100x faster in memory and 10x faster on disk
![Page 6: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/6.jpg)
Hadoop Word Count ExampleSee Code
![Page 7: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/7.jpg)
Spark Word Count Examplefile = spark.textFile(“file.name”)
file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)
![Page 8: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/8.jpg)
RDD – Resilient Distributed Dataset• Operations• Transformations• Actions
• Persistence• Allows an RDD to persist between operations• Provides the ability to write to disk if to large for
memory
• Parallelized Collections• Typically you want 2-4 slices per CPU in your cluster
![Page 9: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/9.jpg)
OperationsTransformations Actions
• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct
• Reduce• Collect• Count• First, Take• SaveAs• CountByKey
![Page 10: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/10.jpg)
Operations continued
![Page 11: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/11.jpg)
Persistence • Store a RDD for later operations• Each node persists a partition• Partitions are fault-tolerant• persist() or cache()
![Page 12: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/12.jpg)
Persistence storage levels• MEMORY_ONLY - Store RDD as deserialized Java objects in
the JVM• MEMORY_AND_DISK - Store RDD as deserialized Java
objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk
• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition).
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk
• DISK_ONLY - Store the RDD partitions only on disk.• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the
levels above, but replicate each partition on two cluster nodes.
• OFF_HEAP - Store RDD in serialized format in Tachyon
![Page 13: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/13.jpg)
Spark Advantages• Same code can be used for streaming and batch
processing• In Memory Processing• Fault tolerant rdd persistence • Machine Learning library built in• Spark SQL (Coming Soon)• Data Graphing (GraphX, Bagel/Pregel)
![Page 14: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/14.jpg)
Spark Drawbacks• No append for output • Lack of job schedule• Spark on Yarn not quite ready for prime time• Still a young project
![Page 15: Introduction to Spark](https://reader031.vdocuments.mx/reader031/viewer/2022012401/556623f4d8b42a61238b4da1/html5/thumbnails/15.jpg)
Questions?