- fall 2019david/cs348/lect-bigdata.pdf · groups together all intermediate values with the same...
TRANSCRIPT
Big DataFall 2019
School of Computer ScienceUniversity of Waterloo
Databases CS348
(material in part by Tamer Özsu)
(University of Waterloo) Big Data 1 / 25
Big Data Characteristics
Volume
Almost every dataset we use is high volume as discussed
Variety
Text dataStreaming data (mobile included)Graph and RDF data
Velocity
Streaming data
Veracity
Data quality & cleaningManaging uncertain dataSearch under uncertainty
(University of Waterloo) Big Data 2 / 25
Big Data Characteristics
VolumeAlmost every dataset we use is high volume as discussed
VarietyText dataStreaming data (mobile included)Graph and RDF data
VelocityStreaming data
VeracityData quality & cleaningManaging uncertain dataSearch under uncertainty
(University of Waterloo) Big Data 2 / 25
Global Information Storage Capacity
(University of Waterloo) Big Data 3 / 25
Platforms for Big Data
Architecture Changes
Dataflow-like rather than Persistent Storage and TransactionsScalability via PartitioningFail tolerance and lineage
(University of Waterloo) Big Data 4 / 25
MapReduce Basics
For data analysis of very large data setsHighly dynamic, irregular, schemaless, etc.SQL too heavy
“Embarrassingly parallel problems”New, simple parallel programming model
Data structured as (key, value) pairsE.g. (doc-id, content), (word, count), etc.
Functional programming style with two functions to be given:Map(k1,v1) → list(k2,v2)Reduce(k2, list (v2)) → list(v3)
Implemented on a distributed file system (e.g., Google FileSystem) on very large clusters
(University of Waterloo) Big Data 5 / 25
Map Function
User-defined functionProcesses input key/value pairsProduces a set of intermediate key/value pairs
Map function I/OInput: read a chunk from distributed file system (DFS)Output: Write to intermediate file on local disk
MapReduce libraryExecutes map functionGroups together all intermediate values with the same key (i.e.,generates a set of lists)Passes these lists to reduce functions
Effect of map functionProcesses and partitions input dataBuilds a distributed map (transparent to user)Similar to “group by” operation in SQL
(University of Waterloo) Big Data 6 / 25
Reduce Function
User-defined functionAccepts one intermediate key and a set of values for that key (i.e., alist)Merges these values together to form a (possibly) smaller setTypically, zero or one output value is generated per invocation
Reduce function I/OInput: read from intermediate files using remote reads on local filesof corresponding mapper nodesOutput: Each reduces writes its output as a file back to DFS
Effect of map functionSimilar to aggregation operation in SQL
(University of Waterloo) Big Data 7 / 25
Map Reduce Example
function map(String document):// document: document contentsfor each word w in document:emit (w, 1)
function reduce(String word, Iterator pCounts):// word: a word// pCounts: a list of aggregated partial countssum = 0for each pc in pCounts:sum += pc
emit (word, sum)
(University of Waterloo) Big Data 8 / 25
Map Reduce Framework
The MapReduce system:1 Prepare the Map() input
designate Map processors, assigns the input key K1 to them, andprovides them with all the input data for K1;
2 Run the user-provided Map() code – Map() is run exactly once foreach K1 key, generating output organized by key K2.
3 "Shuffle" the Map output to the Reduce processorsdesignates Reduce processors, assigns the K2 key to them, andprovides all the Map-generated data associated with that key.
4 Run the user-provided Reduce() code – Reduce() is run exactlyonce for each K2 key produced by the Map step.
5 Produce the final outputcollects all the Reduce output, and sorts it by K2 to produce thefinal outcome.
(University of Waterloo) Big Data 9 / 25
MapReduce Processing
...Inputdataset
Map
Map
Map
Map
(k1, v)
(k2, v)(k2, v)
(k2, v)
(k1, v)
(k1, v)
(k2, v)
Group by k
Group by k
(k1, (v , v , v))
(k2, (v , v , v , v)) Reduce
Reduce
Outputdataset
(University of Waterloo) Big Data 10 / 25
MapReduce Architecture
Scheduler
Master
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
(University of Waterloo) Big Data 11 / 25
HadoopMost popular MapReduce implementation – developed by Yahoo!Hadoop protocol stack
Hadoop Distributed File System (HDFS)
MapReduce
Hive & HiveQLHbase
3rd party analysis tools: R (stats),Mahoot (machine learning), . . .
Yarn
Data
(University of Waterloo) Big Data 12 / 25
Hadoop (cont’ed)
Two componentsProcessing engineHDFS: Hadoop Distributed Storage System – others possibleCan be deployed on the same machine or on different machines
ProcessesJob tracker: hosted on the master node and implements the scheduleTask tracker: hosted on the worker nodes and accepts tasks from jobtracker and executes them
HDFSName node: stores how data are partitioned, monitors the status of datanodes, and data dictionaryData node: Stores and manages data chunks assigned to it
(University of Waterloo) Big Data 13 / 25
Aggregation
Key Value
1 R1
2 R2
map
Aid Value
1 R1
2 R2
Mapper 1
R
Extracting aggregationattribute (Aid)
Key Value
3 R3
4 R4
map
Aid Value
1 R3
2 R4
Mapper 2
R
Map Phase
Partition
ingby
Aid
(Rou
ndRob
in)
Aid Value
1R1
R3 reduce Result
1, f(R1, R3)
Reducer 1
Groupingby Aid
Applying the aggrega-tion function for the
tuples with the same Aid
Aid Value
2R2
R4 reduce Result
2, f(R2, R4)
Reducer 2
Reduce Phase
(University of Waterloo) Big Data 14 / 25
θ-Join
Baseline implementation of R(A,B) 1 S(B,C)
1 Partition R and assign each partition to mappers2 Each mapper takes 〈a,b〉 tuples and converts them to a list of
key/value pairs of the form (b, 〈a,R〉)3 Each reducer pulls the pairs with the same key4 Each reducer joins tuples of R with tuples of S
(University of Waterloo) Big Data 15 / 25
Spark System
MapReduce does not perform well in iterative computationsWorkflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration
Spark objectivesBetter support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability
Fundamental conceptsRDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance
(University of Waterloo) Big Data 16 / 25
Spark System
MapReduce does not perform well in iterative computationsWorkflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration
Spark objectivesBetter support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability
Fundamental conceptsRDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance
(University of Waterloo) Big Data 16 / 25
Spark System
MapReduce does not perform well in iterative computationsWorkflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration
Spark objectivesBetter support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability
Fundamental conceptsRDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance
(University of Waterloo) Big Data 16 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
Tasks
Results
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter(_.startsWith(“ERROR”))
Transform RDD
messages = errors.map(_.split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter(_.contains(“foo”)).count
Action
cachedMsgs.filter(_.contains(“bar”)).count
Another Actionaccesses cache
(University of Waterloo) Big Data 17 / 25
RDD and Processing
HDFS
lines = spark.textFile(“hdfs://...”)
linesError, msg1
Warn, msg2
Error, msg1
Info, msg8
Warn, msg2
Info, msg8
Error, msg3
Info, msg5
Info, msg5
Error, msg4
Warn, msg9
Error, msg1
errors
errors = lines.filter(_.startsWith(“ERROR”))
Error, msg1
Error, msg1
Error, msg3 Error, msg4
Error, msg1
messages
messages = errors.map_.split(‘\t ’)(2)
msg1
msg1
msg3 msg4
msg1
Thes
ear
eno
tyet
gene
rate
d
(University of Waterloo) Big Data 18 / 25
RDD and Processing
lineserrors
messagesmsg1
msg1
msg3 msg4
msg1
lines
messages.filter(_.contains(“foo”)).count
errors
messagesmsg1
msg1
msg3 msg4
msg1
Now
the
RD
Ds
are
mat
eria
lized
;
Com
man
dno
tye
tex
ecut
ed
Driver
messages.filter(_.contains(“foo”).count
(University of Waterloo) Big Data 18 / 25
Example Transform Functions
Function Meaningmap(func) Return a new RDD formed by passing each element of the
source through a function funcfilter(func) Return a new RDD formed by selecting those elements of
the source on which func returns trueflatMap(func) Similar to map, but each input item can be mapped to 0 or
more output items (func should return a Seq not a singleitem)
mapPartitions(func) Similar to map, but runs separately on each partition(block) of the RDD, so func must be of type Iterator
sample(repl, fraction, seed) Sample a fraction fraction of the data, with or without re-placement (set repl accordingly), using a given randomnumber generator seed
union(otherDataset) intersection() Return a new RDD that contains the union/intersection ofthe elements in the source RDD and the argument
groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD of (K,Iterable<V>) pairs
reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD of (K, V)pairs where the values for each key are aggregated usingthe given reduce function func
(University of Waterloo) Big Data 19 / 25
Example Actions
Action Meaningcollect() Return all the elements of the dataset as an array at the
driver program.count()take(n) Return an array with the first n elements of the datasetfirst(func) Return the first element of the dataset (similar to take(1))takeSample(repl, num, [seed ]) Return an array with a random sample of num elements
of the dataset, with or without replacement, optionally pre-specifying a random number generator seed
takeOrdered(n, [ordering]) Return the first n elements of the RDD using either theirnatural order or a custom comparator
saveAsTextFile(path) Write the elements of the RDD as a text file (or set of textfiles) in a given directory in the local filesystem, HDFS orany other Hadoop-supported file system
countByKey() Only available on RDDs of type (K, V). Returns a hashmapof (K, Int) pairs with the count of each key
foreach(func) Run a function func on each element of the dataset
(University of Waterloo) Big Data 20 / 25
RDD Fault Tolerance
RDDs maintain lineage information that can be used toreconstruct lost partitionsLineage is constructed as an object and stored persistently forrecoveryExample
messages = spark.textFile(“hdfs://...”).lines.filter(_.startsWith(“ERROR”))
.errors.map(_.split(‘\t ’)(2))
HDFS File Filtered RDDfilter
(func=_.contains(...))
Mapped RDDmap
(func=_.split(...))
(University of Waterloo) Big Data 21 / 25
Iteration (transitive closure)var tcr = sc.parallelize(tuples,5)var deltar = tcr.localCheckpoint()var deltal = tcr.map(x=>(x._2,x._1)).
localCheckpoint()
do { deltar = deltal.join(tcr).map(x=>(x._2._1,x._2._2)).subtract(tcr).coalesce(5,true).distinct().localCheckpoint();
deltal = deltar.map(x=>(x._2,x._1)).localCheckpoint();
tcr = tcr.union(deltar)} while(deltal.count()>0)
tcr.foreach { t => println(t) }(University of Waterloo) Big Data 22 / 25
Lineage and StagesLineage example:
2.4.4 (/) Spark shell application UI
Details for Job 9
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
Status: SUCCEEDEDCompleted Stages: 6Skipped Stages: 45
Event Timeline DAG Visualization
Stage 234 (skipped)
parallelize
map
Stage 235 (skipped)
parallelize
Stage 236 (skipped)
join
map
subtract
Stage 237 (skipped)
parallelize
subtract
Stage 238 (skipped)
subtract
distinct
Stage 239 (skipped)
map
checkpoint
Stage 240 (skipped)
parallelize
distinct
union
Stage 241 (skipped)
join
map
subtract
Stage 242 (skipped)
parallelize
distinct
union
subtract
Stage 243 (skipped)
subtract
distinct
Stage 244 (skipped)
parallelize
distinct
union
distinct
union
subtract
Stage 245 (skipped)
map
checkpoint
Stage 246 (skipped)
parallelize
distinct
union
distinct
union
Stage 247 (skipped)
join
map
subtract
Stage 248 (skipped)
subtract
distinct
Stage 249 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
subtract
Stage 250 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
Stage 251 (skipped)
map
checkpoint
Stage 252 (skipped)
join
map
subtract
Stage 253 (skipped)
subtract
distinct
Stage 254 (skipped)
map
checkpoint
Stage 255 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
Stage 256 (skipped)
join
map
subtract
Stage 257 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 258 (skipped)
subtract
distinct
Stage 259 (skipped)
map
checkpoint
Stage 260 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 261 (skipped)
join
map
subtract
Stage 262 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 263 (skipped)
subtract
distinct
Stage 264 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 265 (skipped)
map
checkpoint
Stage 266 (skipped)
join
map
subtract
Stage 267 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 268 (skipped)
subtract
distinct
Stage 269 (skipped)
map
checkpoint
Stage 270 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 271 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 272 (skipped)
join
map
subtract
Stage 273 (skipped)
subtract
distinct
Stage 274 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 275 (skipped)
map
checkpoint
Stage 276 (skipped)
join
map
subtract
Stage 277 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 278 (skipped)
subtract
distinct
Stage 279
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 280
map
checkpoint
Stage 281
join
map
subtract
Stage 282
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 283
subtract
distinct
Stage 284
distinct
map
Completed Stages (6)
Stage Id ▾ (/jobs/job/?id=9&completedStage.sort=Stage+Id&completedStage.desc=false&completedStage.pageSize=100#completed) Description (/jobs/job/?id=9&completedStage.sort=Description&completedStage.pageSize=100#completed) Submitted (/jobs/job/?id=9&completedStage.sort=Submitted&completedStage.pageSize=100#completed) Duration (/jobs/job/?id=9&completedStage.sort=Duration&completedStage.pageSize=100#completed) Tasks: Succeeded/Total Input (/jobs/job/?id=9&completedStage.sort=Input&completedStage.pageSize=100#completed) Output (/jobs/job/?id=9&completedStage.sort=Output&completedStage.pageSize=100#completed) Shuffle Read (/jobs/job/?id=9&completedStage.sort=Shuffle+Read&completedStage.pageSize=100#completed) Shuffle Write (/jobs/job/?id=9&completedStage.sort=Shuffle+Write&completedStage.pageSize=100#completed)
284 count at <console>:30 (/stages/stage/?id=284&attempt=0) 2019/11/01 17:11:17 0.6 s
283 distinct at <console>:30 (/stages/stage/?id=283&attempt=0) 2019/11/01 17:11:09 8 s 112.7 MB
282 subtract at <console>:30 (/stages/stage/?id=282&attempt=0) 2019/11/01 17:11:00 7 s 4.3 MB 3.8 MB
281 subtract at <console>:30 (/stages/stage/?id=281&attempt=0) 2019/11/01 17:11:02 6 s 17.8 MB 108.9 MB
280 map at <console>:30 (/stages/stage/?id=280&attempt=0) 2019/11/01 17:11:00 0.7 s 1073.4 KB 3.5 MB
279 union at <console>:30 (/stages/stage/?id=279&attempt=0) 2019/11/01 17:11:00 2 s 4.3 MB 14.3 MB
Skipped Stages (45)
Stage Id ▾ (/jobs/job/?id=9&pendingStage.sort=Stage+Id&pendingStage.desc=false&pendingStage.pageSize=100#pending) Description (/jobs/job/?id=9&pendingStage.sort=Description&pendingStage.pageSize=100#pending) Submitted (/jobs/job/?id=9&pendingStage.sort=Submitted&pendingStage.pageSize=100#pending) Duration (/jobs/job/?id=9&pendingStage.sort=Duration&pendingStage.pageSize=100#pending) Tasks: Succeeded/Total Input (/jobs/job/?id=9&pendingStage.sort=Input&pendingStage.pageSize=100#pending) Output (/jobs/job/?id=9&pendingStage.sort=Output&pendingStage.pageSize=100#pending) Shuffle Read (/jobs/job/?id=9&pendingStage.sort=Shuffle+Read&pendingStage.pageSize=100#pending) Shuffle Write (/jobs/job/?id=9&pendingStage.sort=Shuffle+Write&pendingStage.pageSize=100#pending)
278 distinct at <console>:30 (/stages/stage/?id=278&attempt=0) Unknown Unknown
277 subtract at <console>:30 (/stages/stage/?id=277&attempt=0) Unknown Unknown
276 subtract at <console>:30 (/stages/stage/?id=276&attempt=0) Unknown Unknown
275 map at <console>:30 (/stages/stage/?id=275&attempt=0) Unknown Unknown
274 union at <console>:30 (/stages/stage/?id=274&attempt=0) Unknown Unknown
273 distinct at <console>:30 (/stages/stage/?id=273&attempt=0) Unknown Unknown
272 subtract at <console>:30 (/stages/stage/?id=272&attempt=0) Unknown Unknown
271 subtract at <console>:30 (/stages/stage/?id=271&attempt=0) Unknown Unknown
270 union at <console>:30 (/stages/stage/?id=270&attempt=0) Unknown Unknown
269 map at <console>:30 (/stages/stage/?id=269&attempt=0) Unknown Unknown
268 distinct at <console>:30 (/stages/stage/?id=268&attempt=0) Unknown Unknown
267 subtract at <console>:30 (/stages/stage/?id=267&attempt=0) Unknown Unknown
266 subtract at <console>:30 (/stages/stage/?id=266&attempt=0) Unknown Unknown
265 map at <console>:30 (/stages/stage/?id=265&attempt=0) Unknown Unknown
264 union at <console>:30 (/stages/stage/?id=264&attempt=0) Unknown Unknown
263 distinct at <console>:30 (/stages/stage/?id=263&attempt=0) Unknown Unknown
262 subtract at <console>:30 (/stages/stage/?id=262&attempt=0) Unknown Unknown
261 subtract at <console>:30 (/stages/stage/?id=261&attempt=0) Unknown Unknown
260 union at <console>:30 (/stages/stage/?id=260&attempt=0) Unknown Unknown
259 map at <console>:30 (/stages/stage/?id=259&attempt=0) Unknown Unknown
258 distinct at <console>:30 (/stages/stage/?id=258&attempt=0) Unknown Unknown
257 subtract at <console>:30 (/stages/stage/?id=257&attempt=0) Unknown Unknown
256 subtract at <console>:30 (/stages/stage/?id=256&attempt=0) Unknown Unknown
255 union at <console>:30 (/stages/stage/?id=255&attempt=0) Unknown Unknown
254 map at <console>:30 (/stages/stage/?id=254&attempt=0) Unknown Unknown
253 distinct at <console>:30 (/stages/stage/?id=253&attempt=0) Unknown Unknown
252 subtract at <console>:30 (/stages/stage/?id=252&attempt=0) Unknown Unknown
251 map at <console>:30 (/stages/stage/?id=251&attempt=0) Unknown Unknown
250 union at <console>:30 (/stages/stage/?id=250&attempt=0) Unknown Unknown
249 subtract at <console>:30 (/stages/stage/?id=249&attempt=0) Unknown Unknown
248 distinct at <console>:30 (/stages/stage/?id=248&attempt=0) Unknown Unknown
247 subtract at <console>:30 (/stages/stage/?id=247&attempt=0) Unknown Unknown
246 union at <console>:30 (/stages/stage/?id=246&attempt=0) Unknown Unknown
245 map at <console>:30 (/stages/stage/?id=245&attempt=0) Unknown Unknown
244 subtract at <console>:30 (/stages/stage/?id=244&attempt=0) Unknown Unknown
243 distinct at <console>:30 (/stages/stage/?id=243&attempt=0) Unknown Unknown
242 subtract at <console>:30 (/stages/stage/?id=242&attempt=0) Unknown Unknown
241 subtract at <console>:30 (/stages/stage/?id=241&attempt=0) Unknown Unknown
240 union at <console>:30 (/stages/stage/?id=240&attempt=0) Unknown Unknown
239 map at <console>:30 (/stages/stage/?id=239&attempt=0) Unknown Unknown
238 distinct at <console>:30 (/stages/stage/?id=238&attempt=0) Unknown Unknown
237 subtract at <console>:30 (/stages/stage/?id=237&attempt=0) Unknown Unknown
236 subtract at <console>:30 (/stages/stage/?id=236&attempt=0) Unknown Unknown
235 parallelize at <console>:26 (/stages/stage/?id=235&attempt=0) Unknown Unknown
234 map at <console>:25 (/stages/stage/?id=234&attempt=0) Unknown Unknown
Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)
2560/2560
2560/2560
2560/2560
2560/2560
1280/1280
2560/2560
0/1280
0/1280
0/1280
0/640
0/1280
0/640
0/640
0/640
0/640
0/320
0/320
0/320
0/320
0/160
0/320
0/160
0/160
0/160
0/160
0/80
0/80
0/80
0/80
0/80
0/40
0/40
0/40
0/20
0/40
0/40
0/20
0/20
0/20
0/10
0/20
0/10
0/10
0/10
0/10
0/5
0/5
0/5
0/5
0/5
0/5
⇒ replaces durability: system can replay failed stages
Stage example: 2.4.4
(/) Spark shell application UI
Details for Stage 282 (Attempt 0)
driver
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
Total Time Across All Tasks: 14 s
Locality Level Summary: Process local: 2560
Input Size / Records: 4.3 MB / 125250
Shuffle Write: 3.8 MB / 125250
DAG Visualization
Stage 282
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
ParallelCollectionRDD [0] [Cached]
parallelize at <console>:26
ShuffledRDD [11]
distinct at <console>:30
MapPartitionsRDD [12] [Cached]
distinct at <console>:30
UnionRDD [14]
union at <console>:30
ShuffledRDD [25]
distinct at <console>:30
MapPartitionsRDD [26] [Cached]
distinct at <console>:30
UnionRDD [28]
union at <console>:30
ShuffledRDD [39]
distinct at <console>:30
MapPartitionsRDD [40] [Cached]
distinct at <console>:30
UnionRDD [42]
union at <console>:30
ShuffledRDD [53]
distinct at <console>:30
MapPartitionsRDD [54] [Cached]
distinct at <console>:30
UnionRDD [56]
union at <console>:30
ShuffledRDD [67]
distinct at <console>:30
MapPartitionsRDD [68] [Cached]
distinct at <console>:30
UnionRDD [70]
union at <console>:30
ShuffledRDD [81]
distinct at <console>:30
MapPartitionsRDD [82] [Cached]
distinct at <console>:30
UnionRDD [84]
union at <console>:30
ShuffledRDD [95]
distinct at <console>:30
MapPartitionsRDD [96] [Cached]
distinct at <console>:30
UnionRDD [98]
union at <console>:30
ShuffledRDD [109]
distinct at <console>:30
MapPartitionsRDD [110] [Cached]
distinct at <console>:30
UnionRDD [112]
union at <console>:30
ShuffledRDD [123]
distinct at <console>:30
MapPartitionsRDD [124] [Cached]
distinct at <console>:30
UnionRDD [126]
union at <console>:30
MapPartitionsRDD [133]
subtract at <console>:30
Show Additional Metrics
Event Timeline
This page has more than the maximum number of tasks that can be shown in the visualization! Only the most recent 1000 tasks (of 2560 total) are shown. Enable zooming
Scheduler Delay
Task Deserialization Time
Shuffle Read Time
Executor Computing Time
Shuffle Write Time
Result Serialization Time
Getting Result Time
Summary Metrics for 2560 Completed Tasks
Metric Min 25th percentile Median 75th percentile Max
Duration 0 ms 3 ms 5 ms 8 ms 17 ms
Scheduler Delay 0 ms 0 ms 0 ms 0 ms 5 ms
Task Deserialization Time 0 ms 0 ms 0 ms 1 ms 5 ms
GC Time 0 ms 0 ms 0 ms 0 ms 3 ms
Result Serialization Time 0 ms 0 ms 0 ms 0 ms 1 ms
Getting Result Time 0 ms 0 ms 0 ms 0 ms 0 ms
Peak Execution Memory 1400.0 B 2.4 KB 3.5 KB 6.3 KB 9.9 KB
Input Size / Records 344.0 B / 9 848.0 B / 23 1456.0 B / 40 2.5 KB / 72 4.3 KB / 123
Shuffle Write Size / Records 410.0 B / 9 521.0 B / 23 1004.0 B / 40 1772.0 B / 72 17.2 KB / 123
Aggregated Metrics by Executor
Executor ID ▴ Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Input Size / Records Shuffle Write Size / Records Blacklisted
david.uwaterloo.ca:52245 16 s 2560 0 0 2560 4.3 MB / 125250 3.8 MB / 125250 false
Tasks (2560)
Index (/stages/stage/?id=282&attempt=0&task.sort=Index&task.pageSize=10000)
ID (/stages/stage/?id=282&attempt=0&task.sort=ID&task.pageSize=10000)
Attempt (/stages/stage/?id=282&attempt=0&task.sort=Attempt&task.pageSize=10000)
Status (/stages/stage/?id=282&attempt=0&task.sort=Status&task.pageSize=10000)
Locality Level (/stages/stage/?id=282&attempt=0&task.sort=Locality+Level&task.pageSize=10000)
Executor ID (/stages/stage/?id=282&attempt=0&task.sort=Executor+ID&task.pageSize=10000)
Host (/stages/stage/?id=282&attempt=0&task.sort=Host&task.pageSize=10000)
Launch Time (/stages/stage/?id=282&attempt=0&task.sort=Launch+Time&task.pageSize=10000)
Duration ▾ (/stages/stage/?id=282&attempt=0&task.sort=Duration&task.desc=false&task.pageSize=10000)
Scheduler Delay (/stages/stage/?id=282&attempt=0&task.sort=Scheduler+Delay&task.pageSize=10000)
Task Deserialization Time (/stages/stage/?id=282&attempt=0&task.sort=Task+Deserialization+Time&task.pageSize=10000)
GC Time (/stages/stage/?id=282&attempt=0&task.sort=GC+Time&task.pageSize=10000)
Result Serialization Time (/stages/stage/?id=282&attempt=0&task.sort=Result+Serialization+Time&task.pageSize=10000)
Getting Result Time (/stages/stage/?id=282&attempt=0&task.sort=Getting+Result+Time&tas
36 20491 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 17 ms 1 ms 0 ms 3 ms 0 ms 0 ms
4 17899 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 17 ms 0 ms 1 ms 3 ms 0 ms 0 ms
9 17904 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 1 ms 1 ms 3 ms 0 ms 0 ms
8 17903 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 0 ms 1 ms 3 ms 0 ms 0 ms
37 20492 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 15 ms 0 ms 0 ms 3 ms 0 ms 0 ms
2530 22985 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 14 ms 1 ms 0 ms 0 ms 0 ms
120 20575 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms
31 20486 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms
30 20485 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 1 ms 3 ms 0 ms 0 ms
16 17911 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms
14 17909 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms
2540 22995 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms
2539 22994 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms
2531 22986 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 1 ms 0 ms 0 ms
2521 22976 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 0 ms 0 ms 0 ms
275 20730 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 2 ms 0 ms 0 ms
266 20721 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 2 ms 0 ms 2 ms 0 ms 0 ms
121 20576 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 3 ms 0 ms 0 ms
52 20507 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 0 ms 0 ms
51 20506 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 1 ms 1 ms 0 ms 0 ms
38 20493 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms
17 17912 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms
10 17905 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 1 ms 1 ms 3 ms 0 ms 0 ms
2420 22875 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms
2146 22601 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
1961 22416 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms
1644 22099 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
1635 22090 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
1178 21633 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
815 21270 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms
805 21260 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms
504 20959 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
496 20951 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
495 20950 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 3 ms 0 ms 0 ms
421 20876 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
412 20867 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 0 ms 0 ms
267 20722 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms
236 20691 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
211 20666 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
154 20609 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
128 20583 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
127 20582 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
122 20577 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 0 ms 3 ms 0 ms 0 ms
59 20514 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 0 ms 0 ms
44 20499 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
43 20498 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 1 ms 0 ms 0 ms
39 20494 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
15 17910 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
12 17907 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
2532 22987 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms
2485 22940 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms
2484 22939 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 1 ms 0 ms 0 ms
2428 22883 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 0 ms 0 ms 0 ms
2157 22612 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 3 ms 0 ms 0 ms
2156 22611 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms
2147 22602 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms
1989 22444 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms
1980 22435 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms
1970 22425 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms
Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)
driver / localhost
370 380 390 400 410 420
17:11:09
430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850
(University of Waterloo) Big Data 23 / 25
Lineage and StagesLineage example:
2.4.4 (/) Spark shell application UI
Details for Job 9
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
+details
Status: SUCCEEDEDCompleted Stages: 6Skipped Stages: 45
Event Timeline DAG Visualization
Stage 234 (skipped)
parallelize
map
Stage 235 (skipped)
parallelize
Stage 236 (skipped)
join
map
subtract
Stage 237 (skipped)
parallelize
subtract
Stage 238 (skipped)
subtract
distinct
Stage 239 (skipped)
map
checkpoint
Stage 240 (skipped)
parallelize
distinct
union
Stage 241 (skipped)
join
map
subtract
Stage 242 (skipped)
parallelize
distinct
union
subtract
Stage 243 (skipped)
subtract
distinct
Stage 244 (skipped)
parallelize
distinct
union
distinct
union
subtract
Stage 245 (skipped)
map
checkpoint
Stage 246 (skipped)
parallelize
distinct
union
distinct
union
Stage 247 (skipped)
join
map
subtract
Stage 248 (skipped)
subtract
distinct
Stage 249 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
subtract
Stage 250 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
Stage 251 (skipped)
map
checkpoint
Stage 252 (skipped)
join
map
subtract
Stage 253 (skipped)
subtract
distinct
Stage 254 (skipped)
map
checkpoint
Stage 255 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
Stage 256 (skipped)
join
map
subtract
Stage 257 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 258 (skipped)
subtract
distinct
Stage 259 (skipped)
map
checkpoint
Stage 260 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 261 (skipped)
join
map
subtract
Stage 262 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 263 (skipped)
subtract
distinct
Stage 264 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 265 (skipped)
map
checkpoint
Stage 266 (skipped)
join
map
subtract
Stage 267 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 268 (skipped)
subtract
distinct
Stage 269 (skipped)
map
checkpoint
Stage 270 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 271 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 272 (skipped)
join
map
subtract
Stage 273 (skipped)
subtract
distinct
Stage 274 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 275 (skipped)
map
checkpoint
Stage 276 (skipped)
join
map
subtract
Stage 277 (skipped)
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 278 (skipped)
subtract
distinct
Stage 279
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
Stage 280
map
checkpoint
Stage 281
join
map
subtract
Stage 282
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
Stage 283
subtract
distinct
Stage 284
distinct
map
Completed Stages (6)
Stage Id ▾ (/jobs/job/?id=9&completedStage.sort=Stage+Id&completedStage.desc=false&completedStage.pageSize=100#completed) Description (/jobs/job/?id=9&completedStage.sort=Description&completedStage.pageSize=100#completed) Submitted (/jobs/job/?id=9&completedStage.sort=Submitted&completedStage.pageSize=100#completed) Duration (/jobs/job/?id=9&completedStage.sort=Duration&completedStage.pageSize=100#completed) Tasks: Succeeded/Total Input (/jobs/job/?id=9&completedStage.sort=Input&completedStage.pageSize=100#completed) Output (/jobs/job/?id=9&completedStage.sort=Output&completedStage.pageSize=100#completed) Shuffle Read (/jobs/job/?id=9&completedStage.sort=Shuffle+Read&completedStage.pageSize=100#completed) Shuffle Write (/jobs/job/?id=9&completedStage.sort=Shuffle+Write&completedStage.pageSize=100#completed)
284 count at <console>:30 (/stages/stage/?id=284&attempt=0) 2019/11/01 17:11:17 0.6 s
283 distinct at <console>:30 (/stages/stage/?id=283&attempt=0) 2019/11/01 17:11:09 8 s 112.7 MB
282 subtract at <console>:30 (/stages/stage/?id=282&attempt=0) 2019/11/01 17:11:00 7 s 4.3 MB 3.8 MB
281 subtract at <console>:30 (/stages/stage/?id=281&attempt=0) 2019/11/01 17:11:02 6 s 17.8 MB 108.9 MB
280 map at <console>:30 (/stages/stage/?id=280&attempt=0) 2019/11/01 17:11:00 0.7 s 1073.4 KB 3.5 MB
279 union at <console>:30 (/stages/stage/?id=279&attempt=0) 2019/11/01 17:11:00 2 s 4.3 MB 14.3 MB
Skipped Stages (45)
Stage Id ▾ (/jobs/job/?id=9&pendingStage.sort=Stage+Id&pendingStage.desc=false&pendingStage.pageSize=100#pending) Description (/jobs/job/?id=9&pendingStage.sort=Description&pendingStage.pageSize=100#pending) Submitted (/jobs/job/?id=9&pendingStage.sort=Submitted&pendingStage.pageSize=100#pending) Duration (/jobs/job/?id=9&pendingStage.sort=Duration&pendingStage.pageSize=100#pending) Tasks: Succeeded/Total Input (/jobs/job/?id=9&pendingStage.sort=Input&pendingStage.pageSize=100#pending) Output (/jobs/job/?id=9&pendingStage.sort=Output&pendingStage.pageSize=100#pending) Shuffle Read (/jobs/job/?id=9&pendingStage.sort=Shuffle+Read&pendingStage.pageSize=100#pending) Shuffle Write (/jobs/job/?id=9&pendingStage.sort=Shuffle+Write&pendingStage.pageSize=100#pending)
278 distinct at <console>:30 (/stages/stage/?id=278&attempt=0) Unknown Unknown
277 subtract at <console>:30 (/stages/stage/?id=277&attempt=0) Unknown Unknown
276 subtract at <console>:30 (/stages/stage/?id=276&attempt=0) Unknown Unknown
275 map at <console>:30 (/stages/stage/?id=275&attempt=0) Unknown Unknown
274 union at <console>:30 (/stages/stage/?id=274&attempt=0) Unknown Unknown
273 distinct at <console>:30 (/stages/stage/?id=273&attempt=0) Unknown Unknown
272 subtract at <console>:30 (/stages/stage/?id=272&attempt=0) Unknown Unknown
271 subtract at <console>:30 (/stages/stage/?id=271&attempt=0) Unknown Unknown
270 union at <console>:30 (/stages/stage/?id=270&attempt=0) Unknown Unknown
269 map at <console>:30 (/stages/stage/?id=269&attempt=0) Unknown Unknown
268 distinct at <console>:30 (/stages/stage/?id=268&attempt=0) Unknown Unknown
267 subtract at <console>:30 (/stages/stage/?id=267&attempt=0) Unknown Unknown
266 subtract at <console>:30 (/stages/stage/?id=266&attempt=0) Unknown Unknown
265 map at <console>:30 (/stages/stage/?id=265&attempt=0) Unknown Unknown
264 union at <console>:30 (/stages/stage/?id=264&attempt=0) Unknown Unknown
263 distinct at <console>:30 (/stages/stage/?id=263&attempt=0) Unknown Unknown
262 subtract at <console>:30 (/stages/stage/?id=262&attempt=0) Unknown Unknown
261 subtract at <console>:30 (/stages/stage/?id=261&attempt=0) Unknown Unknown
260 union at <console>:30 (/stages/stage/?id=260&attempt=0) Unknown Unknown
259 map at <console>:30 (/stages/stage/?id=259&attempt=0) Unknown Unknown
258 distinct at <console>:30 (/stages/stage/?id=258&attempt=0) Unknown Unknown
257 subtract at <console>:30 (/stages/stage/?id=257&attempt=0) Unknown Unknown
256 subtract at <console>:30 (/stages/stage/?id=256&attempt=0) Unknown Unknown
255 union at <console>:30 (/stages/stage/?id=255&attempt=0) Unknown Unknown
254 map at <console>:30 (/stages/stage/?id=254&attempt=0) Unknown Unknown
253 distinct at <console>:30 (/stages/stage/?id=253&attempt=0) Unknown Unknown
252 subtract at <console>:30 (/stages/stage/?id=252&attempt=0) Unknown Unknown
251 map at <console>:30 (/stages/stage/?id=251&attempt=0) Unknown Unknown
250 union at <console>:30 (/stages/stage/?id=250&attempt=0) Unknown Unknown
249 subtract at <console>:30 (/stages/stage/?id=249&attempt=0) Unknown Unknown
248 distinct at <console>:30 (/stages/stage/?id=248&attempt=0) Unknown Unknown
247 subtract at <console>:30 (/stages/stage/?id=247&attempt=0) Unknown Unknown
246 union at <console>:30 (/stages/stage/?id=246&attempt=0) Unknown Unknown
245 map at <console>:30 (/stages/stage/?id=245&attempt=0) Unknown Unknown
244 subtract at <console>:30 (/stages/stage/?id=244&attempt=0) Unknown Unknown
243 distinct at <console>:30 (/stages/stage/?id=243&attempt=0) Unknown Unknown
242 subtract at <console>:30 (/stages/stage/?id=242&attempt=0) Unknown Unknown
241 subtract at <console>:30 (/stages/stage/?id=241&attempt=0) Unknown Unknown
240 union at <console>:30 (/stages/stage/?id=240&attempt=0) Unknown Unknown
239 map at <console>:30 (/stages/stage/?id=239&attempt=0) Unknown Unknown
238 distinct at <console>:30 (/stages/stage/?id=238&attempt=0) Unknown Unknown
237 subtract at <console>:30 (/stages/stage/?id=237&attempt=0) Unknown Unknown
236 subtract at <console>:30 (/stages/stage/?id=236&attempt=0) Unknown Unknown
235 parallelize at <console>:26 (/stages/stage/?id=235&attempt=0) Unknown Unknown
234 map at <console>:25 (/stages/stage/?id=234&attempt=0) Unknown Unknown
Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)
2560/2560
2560/2560
2560/2560
2560/2560
1280/1280
2560/2560
0/1280
0/1280
0/1280
0/640
0/1280
0/640
0/640
0/640
0/640
0/320
0/320
0/320
0/320
0/160
0/320
0/160
0/160
0/160
0/160
0/80
0/80
0/80
0/80
0/80
0/40
0/40
0/40
0/20
0/40
0/40
0/20
0/20
0/20
0/10
0/20
0/10
0/10
0/10
0/10
0/5
0/5
0/5
0/5
0/5
0/5
⇒ replaces durability: system can replay failed stages
Stage example: 2.4.4
(/) Spark shell application UI
Details for Stage 282 (Attempt 0)
driver
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
localhost
Total Time Across All Tasks: 14 s
Locality Level Summary: Process local: 2560
Input Size / Records: 4.3 MB / 125250
Shuffle Write: 3.8 MB / 125250
DAG Visualization
Stage 282
parallelize
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
distinct
union
subtract
ParallelCollectionRDD [0] [Cached]
parallelize at <console>:26
ShuffledRDD [11]
distinct at <console>:30
MapPartitionsRDD [12] [Cached]
distinct at <console>:30
UnionRDD [14]
union at <console>:30
ShuffledRDD [25]
distinct at <console>:30
MapPartitionsRDD [26] [Cached]
distinct at <console>:30
UnionRDD [28]
union at <console>:30
ShuffledRDD [39]
distinct at <console>:30
MapPartitionsRDD [40] [Cached]
distinct at <console>:30
UnionRDD [42]
union at <console>:30
ShuffledRDD [53]
distinct at <console>:30
MapPartitionsRDD [54] [Cached]
distinct at <console>:30
UnionRDD [56]
union at <console>:30
ShuffledRDD [67]
distinct at <console>:30
MapPartitionsRDD [68] [Cached]
distinct at <console>:30
UnionRDD [70]
union at <console>:30
ShuffledRDD [81]
distinct at <console>:30
MapPartitionsRDD [82] [Cached]
distinct at <console>:30
UnionRDD [84]
union at <console>:30
ShuffledRDD [95]
distinct at <console>:30
MapPartitionsRDD [96] [Cached]
distinct at <console>:30
UnionRDD [98]
union at <console>:30
ShuffledRDD [109]
distinct at <console>:30
MapPartitionsRDD [110] [Cached]
distinct at <console>:30
UnionRDD [112]
union at <console>:30
ShuffledRDD [123]
distinct at <console>:30
MapPartitionsRDD [124] [Cached]
distinct at <console>:30
UnionRDD [126]
union at <console>:30
MapPartitionsRDD [133]
subtract at <console>:30
Show Additional Metrics
Event Timeline
This page has more than the maximum number of tasks that can be shown in the visualization! Only the most recent 1000 tasks (of 2560 total) are shown. Enable zooming
Scheduler Delay
Task Deserialization Time
Shuffle Read Time
Executor Computing Time
Shuffle Write Time
Result Serialization Time
Getting Result Time
Summary Metrics for 2560 Completed Tasks
Metric Min 25th percentile Median 75th percentile Max
Duration 0 ms 3 ms 5 ms 8 ms 17 ms
Scheduler Delay 0 ms 0 ms 0 ms 0 ms 5 ms
Task Deserialization Time 0 ms 0 ms 0 ms 1 ms 5 ms
GC Time 0 ms 0 ms 0 ms 0 ms 3 ms
Result Serialization Time 0 ms 0 ms 0 ms 0 ms 1 ms
Getting Result Time 0 ms 0 ms 0 ms 0 ms 0 ms
Peak Execution Memory 1400.0 B 2.4 KB 3.5 KB 6.3 KB 9.9 KB
Input Size / Records 344.0 B / 9 848.0 B / 23 1456.0 B / 40 2.5 KB / 72 4.3 KB / 123
Shuffle Write Size / Records 410.0 B / 9 521.0 B / 23 1004.0 B / 40 1772.0 B / 72 17.2 KB / 123
Aggregated Metrics by Executor
Executor ID ▴ Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Input Size / Records Shuffle Write Size / Records Blacklisted
david.uwaterloo.ca:52245 16 s 2560 0 0 2560 4.3 MB / 125250 3.8 MB / 125250 false
Tasks (2560)
Index (/stages/stage/?id=282&attempt=0&task.sort=Index&task.pageSize=10000)
ID (/stages/stage/?id=282&attempt=0&task.sort=ID&task.pageSize=10000)
Attempt (/stages/stage/?id=282&attempt=0&task.sort=Attempt&task.pageSize=10000)
Status (/stages/stage/?id=282&attempt=0&task.sort=Status&task.pageSize=10000)
Locality Level (/stages/stage/?id=282&attempt=0&task.sort=Locality+Level&task.pageSize=10000)
Executor ID (/stages/stage/?id=282&attempt=0&task.sort=Executor+ID&task.pageSize=10000)
Host (/stages/stage/?id=282&attempt=0&task.sort=Host&task.pageSize=10000)
Launch Time (/stages/stage/?id=282&attempt=0&task.sort=Launch+Time&task.pageSize=10000)
Duration ▾ (/stages/stage/?id=282&attempt=0&task.sort=Duration&task.desc=false&task.pageSize=10000)
Scheduler Delay (/stages/stage/?id=282&attempt=0&task.sort=Scheduler+Delay&task.pageSize=10000)
Task Deserialization Time (/stages/stage/?id=282&attempt=0&task.sort=Task+Deserialization+Time&task.pageSize=10000)
GC Time (/stages/stage/?id=282&attempt=0&task.sort=GC+Time&task.pageSize=10000)
Result Serialization Time (/stages/stage/?id=282&attempt=0&task.sort=Result+Serialization+Time&task.pageSize=10000)
Getting Result Time (/stages/stage/?id=282&attempt=0&task.sort=Getting+Result+Time&tas
36 20491 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 17 ms 1 ms 0 ms 3 ms 0 ms 0 ms
4 17899 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 17 ms 0 ms 1 ms 3 ms 0 ms 0 ms
9 17904 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 1 ms 1 ms 3 ms 0 ms 0 ms
8 17903 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 0 ms 1 ms 3 ms 0 ms 0 ms
37 20492 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 15 ms 0 ms 0 ms 3 ms 0 ms 0 ms
2530 22985 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 14 ms 1 ms 0 ms 0 ms 0 ms
120 20575 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms
31 20486 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms
30 20485 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 1 ms 3 ms 0 ms 0 ms
16 17911 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms
14 17909 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms
2540 22995 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms
2539 22994 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms
2531 22986 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 1 ms 0 ms 0 ms
2521 22976 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 0 ms 0 ms 0 ms
275 20730 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 2 ms 0 ms 0 ms
266 20721 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 2 ms 0 ms 2 ms 0 ms 0 ms
121 20576 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 3 ms 0 ms 0 ms
52 20507 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 0 ms 0 ms
51 20506 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 1 ms 1 ms 0 ms 0 ms
38 20493 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms
17 17912 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms
10 17905 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 1 ms 1 ms 3 ms 0 ms 0 ms
2420 22875 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms
2146 22601 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
1961 22416 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms
1644 22099 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
1635 22090 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
1178 21633 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
815 21270 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms
805 21260 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms
504 20959 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
496 20951 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
495 20950 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 3 ms 0 ms 0 ms
421 20876 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
412 20867 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 0 ms 0 ms
267 20722 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms
236 20691 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
211 20666 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
154 20609 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
128 20583 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
127 20582 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
122 20577 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 0 ms 3 ms 0 ms 0 ms
59 20514 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 0 ms 0 ms
44 20499 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms
43 20498 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 1 ms 0 ms 0 ms
39 20494 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms
15 17910 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
12 17907 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms
2532 22987 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms
2485 22940 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms
2484 22939 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 1 ms 0 ms 0 ms
2428 22883 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 0 ms 0 ms 0 ms
2157 22612 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 3 ms 0 ms 0 ms
2156 22611 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms
2147 22602 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms
1989 22444 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms
1980 22435 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms
1970 22425 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms
Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)
driver / localhost
370 380 390 400 410 420
17:11:09
430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850
(University of Waterloo) Big Data 23 / 25
New SPARK
DataFrame/Dataset⇒ tabluar abstraction on top of RDDs⇒ rudimentary column types⇒ better properties when written to disk
SQL (on SPARK)⇒ on-the-fly query optimizer (catalyst)
(University of Waterloo) Big Data 24 / 25
New SPARK
DataFrame/Dataset⇒ tabluar abstraction on top of RDDs⇒ rudimentary column types⇒ better properties when written to diskSQL (on SPARK)⇒ on-the-fly query optimizer (catalyst)
(University of Waterloo) Big Data 24 / 25
Take Home
Lots of open issues:
1 Tools/Frameworks that simplify writing of parallel programs2 Various ways of implementing scheduling/synchronization/failure
recovery3 Taking advantage of parallelism and partitioning (many levels)4 . . .
(University of Waterloo) Big Data 25 / 25