cs014 introduction to data structures and...
TRANSCRIPT
![Page 1: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/1.jpg)
10/31/2018 1
![Page 2: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/2.jpg)
Spark RDD
10/31/2018 2
![Page 3: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/3.jpg)
Where are we?
Distributed storage in HDFS
MapReduce query execution in Hadoop
High-level data manipulation using Pig
10/31/2018 3
![Page 4: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/4.jpg)
A Step Back to MapReduce
Designed in the early 2000’s
Machines were unreliable
Focused on fault-tolerance
Addressed data-intensive applications
Limited memory
10/31/2018 4
![Page 5: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/5.jpg)
Shuffle Reduce Map
MapReduce in Practice
10/31/2018 5
M
M
M
…
R
R
R
…
Map
M
M
M
…
Reduce
R
R
R
…
Can we improve on that?
![Page 6: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/6.jpg)
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 7
Shuffle Reduce Map
M2
M2
M2
…
R
R
R
…
Map
M1
M1
M1
…
![Page 7: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/7.jpg)
Map
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 8
Shuffle Reduce
M2
M2
M2
…
R
R
R
…
M1
M1
M1
…
![Page 8: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/8.jpg)
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 9
Shuffle Reduce Map
M1
M1
M1
…
R1
R1
R1
…
Map
M2
M2
M2
…
![Page 9: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/9.jpg)
Pig
Slightly improves disk I/O by consolidating
map-only jobs
10/31/2018 10
Shuffle Reduce Map
M1
M1
M1
…
R1
R1
R1
…
M2
M2
M2
…
![Page 10: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/10.jpg)
Pig (at a higher level)
10/31/2018 11
FILTER FOREACH
FOREACH
JOIN
FILTER
GROUP
BY
![Page 11: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/11.jpg)
RDD
Resilient Distributed Datasets
A distributed query processing engine
The Spark counterpart to MapReduce
Designed for in-memory processing
10/31/2018 12
![Page 12: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/12.jpg)
In-memory Processing
The machine specs change
More reliable
Bigger memory
And the workload changed
Analytical queries
Iterative operations (like ML)
The main idea: Rather than storing
intermediate results to disk, keep them in
memory
How about fault tolerance?
10/31/2018 13
![Page 13: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/13.jpg)
RDD Example
10/31/2018 14
FILTER FOREACH
FOREACH
JOIN
FILTER
GROUP
BY
Mem
Mem
Mem
Mem
![Page 14: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/14.jpg)
RDD Abstraction
RDD is a pointer to a distributed dataset
Stores information about how to compute the
data rather than where the data is
Transformation: Converts an RDD to another
RDD
Action: Returns an answer of an operation
over an RDD
10/31/2018 15
![Page 15: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/15.jpg)
Spark RDD Features
Lazy execution: Collect transformations and
execute on actions (Similar to Pig)
Lineage tracking: Keep track of the lineage of
each RDD for fault-tolerance
10/31/2018 16
![Page 16: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/16.jpg)
RDD
10/31/2018 17
RDD RDD
Operation
![Page 17: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/17.jpg)
Filter Operation
10/31/2018 18
RDD RDD
Filter
Similarly, projection operation
(ForEach in Pig)
Narrow dependency
![Page 18: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/18.jpg)
GroupBy (Shuffle) Operation
10/31/2018 19
RDD RDD
Group By
Similar operation Join
Wide dependency
![Page 19: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/19.jpg)
Types of Dependencies
Narrow dependencies
Wide dependencies
10/31/2018 20
Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies
![Page 20: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/20.jpg)
Examples of Transformations
map
flatMap
reduceByKey
filter
sample
join
union
partitionBy
10/31/2018 21
![Page 21: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/21.jpg)
Examples of Actions
count
collect
save(path)
persist
reduce
10/31/2018 22
![Page 22: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/22.jpg)
How RDD can be helpful
Consolidate operations
Combine transformations
Iterative operations
Keep the output of an iteration in memory till the
next iteration
Data sharing
Reuse the same data without having to read it
multiple times
10/31/2018 23
![Page 23: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/23.jpg)
Examples
# Initialize the Spark context
JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");
10/31/2018 24
![Page 24: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/24.jpg)
Examples
# Initialize the Spark context
JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");
# Hello World! Example. Count the number of lines in the file
JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv");
long count = textFileRDD.count();
System.out.println("Number of lines is "+count);
10/31/2018 25
![Page 25: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/25.jpg)
Examples
# Count the number of OK lines
JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String s) throws Exception {
String code = s.split("\t")[5];
return code.equals("200");
}
});
long count = okLines.count();
System.out.println("Number of OK lines is "+count);
10/31/2018 26
![Page 26: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/26.jpg)
Examples
# Count the number of OK lines
# Shorten the implementation using lambdas (Java 8 and above)
JavaRDD<String> okLines = textFileRDD.filter(s -> s.split("\t")[5].equals("200"));
long count = okLines.count();
System.out.println("Number of OK lines is "+count);
10/31/2018 27
![Page 27: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/27.jpg)
Examples
# Make it parametrized by taking the response code as a command line argument
String inputFileName = args[0];
String desiredResponseCode = args[1];
...
JavaRDD<String> textFileRDD = spark.textFile(inputFileName);
JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String s) {
String code = s.split("\t")[5];
return code.equals(desiredResponseCode);
}
}); 10/31/2018 28
![Page 28: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/28.jpg)
Examples
# Count by response code
JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() {
@Override
public Tuple2<Integer, String> call(String s) {
String code = s.split("\t")[5];
return new Tuple2<Integer, String>(Integer.valueOf(code), s);
}
});
Map<Integer, Long> countByCode = linesByCode.countByKey();
System.out.println(countByCode);
10/31/2018 29
![Page 29: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart](https://reader036.vdocuments.mx/reader036/viewer/2022070801/5f029c577e708231d4051fbc/html5/thumbnails/29.jpg)
Further Reading
Spark home page: http://spark.apache.org/
Quick start:
http://spark.apache.org/docs/latest/quick-
start.html
RDD documentation:
http://spark.apache.org/docs/latest/rdd-
programming-guide.html
RDD Paper: Matei Zaharia et al. "Resilient Distributed
Datasets: A Fault-tolerant Abstraction for In-memory
Cluster Computing." NSDI’12
10/31/2018 30