in 2020sp, this lecture and lecture 17 are both …...in hadoop, each developer tends to invent his...
TRANSCRIPT
![Page 1: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/1.jpg)
CS5412 / Lecture 20Apache Spark and RDDs
Kishore Pusukuri, Spring 2020
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 1
In 2020SP, this lecture and lecture
17 are both optional extra material
![Page 2: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/2.jpg)
Recap
2
MapReduce• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner• Takes care of scheduling tasks, monitoring them and re-executes
the failed tasks
HDFS & MapReduce: Running on the same set of nodes compute nodes and storage nodes same (keeping data close to the computation) very high throughput
YARN & MapReduce: A single master resource manager, one slave node manager per node, and AppMaster per application
![Page 3: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/3.jpg)
Today’s Topics
3
•Motivation•Spark Basics•Spark Programming
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 4: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/4.jpg)
History of Hadoop and Spark
4HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 5: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/5.jpg)
Apache Spark
5
Yet Another Resource Negotiator (YARN)
Spark Stream
Spark SQL
Other Applications
Data Ingestion Systems
e.g., Apache Kafka,
Flume, etcHa doop NoSQL Da ta ba se (HBa se)
Ha doop Distributed File System (HDFS)
S3, Cassandra etc., other storage systems
Mesos etc.Spark Core(Standalone Scheduler)
Data Storage
Resource manager
Hadoop Spark
Processing
** Spark can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN)
Spark ML
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 6: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/6.jpg)
Apache Hadoop Lacks Unified Vision
6
• Sparse Modules• Diversity of APIs• Higher Operational Costs
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 7: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/7.jpg)
Spark Ecosystem: A Unified Pipeline
7
Note: Spark is not designed for IoT real-time. The streaming layer is used for continuous input streams like financial data from stock markets, where events occur steadily and must be processed as they occur. But there is no sense of direct I/O from sensors/actuators. For IoT use cases, Spark would not be suitable.
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 8: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/8.jpg)
Key ideas
In Hadoop, each developer tends to invent his or her own style of work
With Spark , se r ious e f fo r t to s tandard ize a round the idea tha t peop le a re writing pa ra lle l code tha t often runs for ma ny “cycles” or “ite ra tions” in which a lot of reuse of informa tion occurs.
Spark centers on Resilient Distributed Dataset, RDDs, that capture the informa tion be ing reused.
8HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 9: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/9.jpg)
How this works
You express your applicat ion as a graph of RDDs.
The graph is only evaluated as needed, and they only compute the RDDs a ctua lly needed for the output you ha ve requested.
Then Spark can be told to cache the reusea ble informa tion e ither in memory, in SSD stora ge or even on disk, ba sed on when it will be needed again, how big it is, and how costly it would be to recreate.
You write the RDD logic and control all of this via hints9HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 10: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/10.jpg)
Motivation (1)
10
MapReduce: The original scalable, general, processing engine of the Hadoop ecosystem
• Disk-ba sed da ta processing fra mework (HDFS files)• Persists inte rmedia te results to disk• Da ta is re loa ded from disk with every query → Costly I/O • Best for ETL like workloa ds (ba tch processing)• Costly I/O → Not a ppropria te for ite ra tive or strea m
processing workloa ds
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 11: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/11.jpg)
Motivation (2)
11
Spark: General purpose computational framework that substantially improves performance of MapReduce, but retains the basic model
• Memory ba sed da ta processing fra mework → a voids costly I/O by keeping inte rmedia te results in memory
• Levera ges distributed memory • Remembers opera tions a pplied to da ta se t• Da ta loca lity ba sed computa tion → High Performa nce• Best for both ite ra tive (or strea m processing) a nd ba tch
workloa dsHTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 12: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/12.jpg)
Motivation - Summa ry
12
Softwa re engineering point of view Ha doop code ba se is huge Contributions/Extensions to Ha doop a re cumbersome J a va -only hinders wide a doption, but J a va support is funda menta l
System/Fra mework point of view Unified pipe line Simplified da ta flow Fa ste r processing speed
Da ta a bstra ction point of view New funda menta l a bstra ction RDD Ea sy to extend with new opera tors More descriptive computing mode l
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 13: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/13.jpg)
Today’s Topics
13
•Motivation•Spark Basics•Spark Programming
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 14: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/14.jpg)
Spark Basics(1)
14
Spa rk: Flexible , in-memory da ta processing fra mework written in Sca la
Goa ls:• Simplic ity (Ea sie r to use ):
Rich APIs for Sca la , J a va , a nd Python
• Genera lity: APIs for diffe rent types of workloa ds Ba tch, Strea ming, Ma chine Lea rning, Gra ph
• Low La tency (Performa nce) : In-memory processing a nd ca ching
• Fa ult-tole ra nce : Fa ults shouldn’t be specia l ca se
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 15: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/15.jpg)
Spark Basics(2)
15
There a re two wa ys to ma nipula te da ta in Spa rk• Spa rk She ll:
Inte ra ctive – for lea rning or da ta explora tion Python or Sca la
• Spa rk Applica tions For la rge sca le da ta processing Python, Sca la , or J a va
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 16: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/16.jpg)
Spark Core: Code Base (2012)
16HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 17: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/17.jpg)
Spark Shell
17
The Spa rk She ll provides inte ra ctive da ta explora tion (REPL)
REPL: Repeat/Evaluate/Print Loop
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 18: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/18.jpg)
Spark Fundamentals
18
•Spark Context
•Resilient Distributed Data
•Transformations
•Actions
Example of an application:
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 19: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/19.jpg)
Spark Context (1)
19
•Every Spark application requires a spark context: the main entry point to the Spark API
•Spark Shell provides a preconfigured Spark Context called “sc”
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 20: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/20.jpg)
Spark Context (2)
20
•Standalone applications Driver code Spark Context•Spark Context holds configuration information and represents connection to a Spark cluster
Standalone Application (Drives Computation)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 21: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/21.jpg)
Spark Context (3)
21
Spa rk context works a s a c lient a nd represents connection to a Spa rk c luste r
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 22: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/22.jpg)
Spark Fundamentals
22
•Spark Context
•Resilient Distributed Data
•Transformations
•Actions
Example of an application:
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 23: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/23.jpg)
Resilient Distributed Dataset (RDD)
23
The RDD(Resilient Distributed Dataset) is the fundamental unit of data in Spark: An Immutable collection of objects (or records, or elements) that can be operated on “in parallel” (spread across a cluster)Resilient -- if data in memory is lost, it can be recreated
• Recover from node fa ilures• An RDD keeps its linea ge informa tion it ca n be recrea ted from
pa rent RDDsDistributed -- processed a cross the c luste r
• Ea ch RDD is composed of one or more pa rtitions (more pa rtitions –more pa ra lle lism)
Dataset -- initia l da ta ca n come from a file or be crea ted
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 24: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/24.jpg)
RDDs
24
Key Idea: Write applications in terms of transformations on distributed datasets. One RDD per transformation.
• Orga nize the RDDs into a DAG showing how da ta flows.• RDD ca n be sa ved a nd reused or recomputed. Spa rk ca n
sa ve it to disk if the da ta se t does not fit in memory• Built through pa ra lle l tra nsforma tions (ma p, filte r, group-by,
join, e tc). Automa tica lly rebuilt on fa ilure• Controlla ble persistence (e .g. ca ching in RAM)
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 25: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/25.jpg)
RDDs are designed to be “immutable”
25
• Crea te once , then reuse without cha nges. Spa rk knows linea ge ca n be recrea ted a t a ny time Fa ult-tole ra nce
• Avoids da ta inconsistency problems (no simulta neous upda tes) Correctness
• Ea sily live in memory a s on disk Ca ching Sa fe to sha re a cross processes/ta sks Improves performa nce
• Tra deoff: (Fault-tolerance & Correctness) vs (Disk Memory & CPU)
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 26: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/26.jpg)
Creating a RDD
26
Three wa ys to crea te a RDD• From a file or se t of file s• From da ta in memory• From a nother RDD
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 27: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/27.jpg)
Example: A File-ba sed RDD
27HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 28: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/28.jpg)
Spark Fundamentals
28
•Spark Context
•Resilient Distributed Data
•Transformations
•Actions
Example of an application:
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 29: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/29.jpg)
RDD Operations
29
Two types of operationsTransformations: Define a new RDD based on current RDD(s)Actions: return values
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 30: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/30.jpg)
RDD Transformations
30
•Set of operations on a RDD that define how they should be transformed
•As in relational algebra, the application of a transformation to an RDD yields a new RDD (because RDD are immutable)
•Transformations are lazily evaluated, which allow for optimizations to take place before execution
•Examples: map(), filter(), groupByKey(), sortByKey(), etc.
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 31: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/31.jpg)
Example: map and filter Transformations
31HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 32: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/32.jpg)
RDD Actions
32
•Apply transformation chains on RDDs, eventually performing some additional operations (e.g., counting)
•Some actions only store data to an external data source (e.g. HDFS), others fetch data from the RDD (and its transformation chain) upon which the action is applied, and convey it to the driver
•Some common actionscount() – return the number of elementstake(n) – return an array of the first n elementscollect()– return an array of all elementssaveAsTextFile(file) – save to text file(s)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 33: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/33.jpg)
Graph of RDDs
A collection of RDDs ca n be understood a s a gra ph
Nodes in the gra ph a re the RDDs, which mea ns the code but a lso the a ctua l da ta object tha t could would crea te a t runtime when executed on specific pa ra mete rs + da ta . Reminder: Ha doop is a “rea d only” model, so we ca n “ma te ria lize” a n RDD a ny time we like .
Edges represent how da ta objects a re a ccessed: RDD B might consume the object crea ted by RDD A. This gives us a directed edge A → B
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 33
![Page 34: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/34.jpg)
Lazy Execution of RDDs (1)
34
Data in RDDs is not processed until an action is performed
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 35: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/35.jpg)
Lazy Execution of RDDs (2)
35
Data in RDDs is not processed until an action is performed
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 36: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/36.jpg)
Lazy Execution of RDDs (3)
36
Data in RDDs is not processed until an action is performed
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 37: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/37.jpg)
Lazy Execution of RDDs (4)
37
Data in RDDs is not processed until an action is performed
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 38: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/38.jpg)
Lazy Execution of RDDs (5)
38
Data in RDDs is not processed until an action is performed
Output Action “triggers” computation, pull model
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 39: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/39.jpg)
Opportunities This Enables
On-demand optimization : Spark can behave like a compiler by first building a potentially complex RDD graph, but then trimming away unneeded computations that for today’s purpose, won’t be used.
Caching for later reuse. Graph transformations : A significant amount of effort is going into this area. It
is a lot like compiler-managed program transformation and aims at simplifying and speeding up the computation that will occur.
Dynamic decisions about what to schedule and when . Concept: minimum adequate set of input objects: RDD can run if all its inputs are ready
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 39
![Page 40: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/40.jpg)
Example: Mine error logs
40
Loa d e rror messa ges from a log into memory, then inte ra ctive ly sea rch for va rious pa tte rns:
lines = spark.textFile(“hdfs://...”) HadoopRDD
errors = lines.filter(lambda s: s.startswith(“ERROR”)) FilteredRDD
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
Result: full-text sea rch of Wikipedia in 0.5 sec (vs 20 sec for on-disk da ta )
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 41: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/41.jpg)
Key Idea: Elastic parallelism
RDDs operations are designed to offer embarrassing parallelism.
Spark wi l l sp read the t ask over the nodes where da ta res ides , o f fe r s a h igh ly concurrent execution tha t minimizes de la ys. Term: “pa rtitioned computa tion” .
If some component crashes or even is just slow, Spark simply kil ls that task and la unches a substitute .
41HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 42: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/42.jpg)
RDD and Partitions (Parallelism example)
42HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 43: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/43.jpg)
RDD Graph: Data Set vs Partition Views
43
Much like in Hadoop MapReduce, each RDD is associated to (input) partitions
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 44: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/44.jpg)
RDDs: Data Locality
44
•Data Locality Principle Keep high-value RDDs precomputed, in cache or SDD Run tasks that need the specific RDD with those same inputs
on the node where the cached copy resides. This can maximize in-memory computational performance.
Requires cooperation between your hints to Spark when you build the RDD, Spark runtime and optimization planner, and the underlying YARN resource manager.
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 45: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/45.jpg)
RDDs -- Summa ry
45
RDD a re pa rtitioned, loca lity a wa re , distributed collections RDD a re immuta ble
RDD a re da ta structures tha t: Either point to a direct da ta source (e .g. HDFS) Apply some tra nsforma tions to its pa rent RDD(s) to
genera te new da ta e lementsComputa tions on RDDs Represented by la zily eva lua ted linea ge DAGs composed
by cha ined RDDs
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 46: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/46.jpg)
Lifetime of a Job in Spark
46HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 47: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/47.jpg)
Anatomy of a Spark Application
47
Cluster Manager (YARN/Mesos)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 48: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/48.jpg)
Typical RDD pattern of useInstead of doing a lot of work in each RDD, developers split ta sks into lots of sma ll RDDs
These are then organized into a DAG.
Developer anticipates which will be costly to recompute a nd hints to Spa rk tha t it should ca che those .
48HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 49: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/49.jpg)
Why is this a good strategy?
Spark tries to run tasks that will need the same intermediary data on the same nodes.
If MapReduce jobs were arbi trary programs, this wouldn’t help because reuse would be very ra re .
But in fac t the MapReduce mode l i s ve ry repe t i t ious and i t e ra t ive , and of ten a pplies the sa me tra nsforma tions a ga in a nd a ga in to the sa me input file s.
Those pa rticula r RDDs become grea t ca ndida tes for ca ching.
Ma pReduce progra mmer ma y not know how ma ny ite ra tions will occur, but Spa rk itse lf is sma rt enough to evic t RDDs if they don’t a ctua lly ge t reused.
49HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 50: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/50.jpg)
Iterative Algorithms: Spark vs MapReduce
50HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 51: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/51.jpg)
Today’s Topics
51
•Motivation•Spark Basics•Spark Programming
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 52: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/52.jpg)
Spark Programming (1)
52
Crea ting RDDs# Turn a Python collection into an RDDsc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3sc.textFile(“file.txt”)sc.textFile(“directory/*.txt”)sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)sc.hadoopFile(keyClass, valClass, inputFmt, conf)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 53: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/53.jpg)
Spark Programming (2)
53
Ba sic Tra nsforma tions
nums = sc.parallelize([1, 2, 3])
# Pass each element through a functionsquares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicateeven = squares.filter(lambda x: x % 2 == 0) // {4}
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 54: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/54.jpg)
Spark Programming (3)
54
Ba sic Actionsnums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collectionnums.collect() # => [1, 2, 3]
# Return first K elementsnums.take(2) # => [1, 2]
# Count number of elementsnums.count() # => 3
# Merge elements with an associative functionnums.reduce(lambda x, y: x + y) # => 6
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 55: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/55.jpg)
Spark Programming (4)
55
Working with Key-Va lue Pa irsSpark’s “distributed reduce” transformations operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => bHTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 56: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/56.jpg)
Spark Programming (5)
56
Some Key-Va lue Opera tions
pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}
pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 57: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/57.jpg)
Example: Word Count
57
lines = sc.textFile(“hamlet.txt”)counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 58: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/58.jpg)
Example: Spark Streaming
58
Represents strea ms a s a se ries of RDDs over time (typica lly sub second inte rva ls, but it is configura ble )
val spammers = sc.sequenceFile(“hdfs://spammers.seq”)sc.twitterStream(...)
.filter(t => t.text.contains(“Santa Clara University”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 59: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/59.jpg)
Spark: Combining Libraries (Unified Pipeline)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 59
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)
# Train a machine learning model
model = KMeans.train(points, 10)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
![Page 60: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/60.jpg)
Spark: Setting the Level of Parallelism
60
All the pa ir RDD opera tions ta ke a n optiona l second pa ra mete r for number of ta sks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
HTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP
![Page 61: In 2020SP, this lecture and lecture 17 are both …...In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec5c2db4b59e275ef4fa8fc/html5/thumbnails/61.jpg)
Summary
Spark is a powerful “manager” for big data computing.It centers on a job scheduler for Hadoop (Ma pReduce) tha t is sma rt a bout where to run ea ch ta sk: co-loca te ta sk with da ta .The da ta ob jec t s a re “RDDs”: a k ind of rec ipe fo r genera t ing a f i l e f rom a n underlying da ta collection. RDD ca ching a llows Spa rk to run mostly from memory-ma pped da ta , for speed.
61
• Online tutorials: spark.apache.org/docs/latestHTTP:/ /WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP