10/31/2018 1
Spark RDD
10/31/2018 2
Where are we?
Distributed storage in HDFS
MapReduce query execution in Hadoop
High-level data manipulation using Pig
10/31/2018 3
A Step Back to MapReduce
Designed in the early 2000’s
Machines were unreliable
Focused on fault-tolerance
Addressed data-intensive applications
Limited memory
10/31/2018 4
Shuffle Reduce Map
MapReduce in Practice
10/31/2018 5
M
M
M
…
R
R
R
…
Map
M
M
M
…
Reduce
R
R
R
…
Can we improve on that?
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 7
Shuffle Reduce Map
M2
M2
M2
…
R
R
R
…
Map
M1
M1
M1
…
Map
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 8
Shuffle Reduce
M2
M2
M2
…
R
R
R
…
M1
M1
M1
…
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 9
Shuffle Reduce Map
M1
M1
M1
…
R1
R1
R1
…
Map
M2
M2
M2
…
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 10
Shuffle Reduce Map
M1
M1
M1
…
R1
R1
R1
…
M2
M2
M2
…
Pig (at a higher level)
10/31/2018 11
FILTER FOREACH
FOREACH
JOIN
FILTER
GROUP
BY
RDD
Resilient Distributed Datasets
A distributed query processing engine
The Spark counterpart to MapReduce
Designed for in-memory processing
10/31/2018 12
In-memory Processing
The machine specs change
More reliable
Bigger memory
And the workload changed
Analytical queries
Iterative operations (like ML)
The main idea: Rather than storing
intermediate results to disk, keep them in
memory
How about fault tolerance?
10/31/2018 13
RDD Example
10/31/2018 14
FILTER FOREACH
FOREACH
JOIN
FILTER
GROUP
BY
Mem
Mem
Mem
Mem
RDD Abstraction
RDD is a pointer to a distributed dataset
Stores information about how to compute the
data rather than where the data is
Transformation: Converts an RDD to another
RDD
Action: Returns an answer of an operation
over an RDD
10/31/2018 15
Spark RDD Features
Lazy execution: Collect transformations and
execute on actions (Similar to Pig)
Lineage tracking: Keep track of the lineage of
each RDD for fault-tolerance
10/31/2018 16
RDD
10/31/2018 17
RDD RDD
Operation
Filter Operation
10/31/2018 18
RDD RDD
Filter
Similarly, projection operation
(ForEach in Pig)
Narrow dependency
GroupBy (Shuffle) Operation
10/31/2018 19
RDD RDD
Group By
Similar operation Join
Wide dependency
Types of Dependencies
Narrow dependencies
Wide dependencies
10/31/2018 20
Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies
Examples of Transformations
map
flatMap
reduceByKey
filter
sample
join
union
partitionBy
10/31/2018 21
Examples of Actions
count
collect
save(path)
persist
reduce
10/31/2018 22
How RDD can be helpful
Consolidate operations
Combine transformations
Iterative operations
Keep the output of an iteration in memory till the
next iteration
Data sharing
Reuse the same data without having to read it
multiple times
10/31/2018 23
Examples
# Initialize the Spark context
JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");
10/31/2018 24
Examples
# Initialize the Spark context
JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");
# Hello World! Example. Count the number of lines in the file
JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv");
long count = textFileRDD.count();
System.out.println("Number of lines is "+count);
10/31/2018 25
Examples
# Count the number of OK lines
JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String s) throws Exception {
String code = s.split("\t")[5];
return code.equals("200");
}
});
long count = okLines.count();
System.out.println("Number of OK lines is "+count);
10/31/2018 26
Examples
# Count the number of OK lines
# Shorten the implementation using lambdas (Java 8 and above)
JavaRDD<String> okLines = textFileRDD.filter(s -> s.split("\t")[5].equals("200"));
long count = okLines.count();
System.out.println("Number of OK lines is "+count);
10/31/2018 27
Examples
# Make it parametrized by taking the response code as a command line argument
String inputFileName = args[0];
String desiredResponseCode = args[1];
...
JavaRDD<String> textFileRDD = spark.textFile(inputFileName);
JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String s) {
String code = s.split("\t")[5];
return code.equals(desiredResponseCode);
}
}); 10/31/2018 28
Examples
# Count by response code
JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() {
@Override
public Tuple2<Integer, String> call(String s) {
String code = s.split("\t")[5];
return new Tuple2<Integer, String>(Integer.valueOf(code), s);
}
});
Map<Integer, Long> countByCode = linesByCode.countByKey();
System.out.println(countByCode);
10/31/2018 29
Further Reading
Spark home page: http://spark.apache.org/
Quick start:
http://spark.apache.org/docs/latest/quick-
start.html
RDD documentation:
http://spark.apache.org/docs/latest/rdd-
programming-guide.html
RDD Paper: Matei Zaharia et al. "Resilient Distributed
Datasets: A Fault-tolerant Abstraction for In-memory
Cluster Computing." NSDI’12
10/31/2018 30