cs014 introduction to data structures and...

10/31/2018 1

Spark RDD

10/31/2018 2

Where are we?

Distributed storage in HDFS

MapReduce query execution in Hadoop

High-level data manipulation using Pig

10/31/2018 3

A Step Back to MapReduce

Designed in the early 2000’s

Machines were unreliable

Focused on fault-tolerance

Addressed data-intensive applications

Limited memory

10/31/2018 4

Shuffle Reduce Map

MapReduce in Practice

10/31/2018 5

M

M

M

…

R

R

R

…

Map

M

M

M

…

Reduce

R

R

R

…

Can we improve on that?

Pig

Slightly improves disk I/O by consolidating

map-only jobs

10/31/2018 7

Shuffle Reduce Map

M2

M2

M2

…

R

R

R

…

Map

M1

M1

M1

…

Map

Pig


map-only jobs

10/31/2018 8

Shuffle Reduce

M2

M2

M2

…

R

R

R

…

M1

M1

M1

…

Pig


map-only jobs

10/31/2018 9

Shuffle Reduce Map

M1

M1

M1

…

R1

R1

R1

…

Map

M2

M2

M2

…

Pig


map-only jobs

10/31/2018 10

Shuffle Reduce Map

M1

M1

M1

…

R1

R1

R1

…

M2

M2

M2

…

Pig (at a higher level)

10/31/2018 11

FILTER FOREACH

FOREACH

JOIN

FILTER

GROUP

BY

RDD

Resilient Distributed Datasets

A distributed query processing engine

The Spark counterpart to MapReduce

Designed for in-memory processing

10/31/2018 12

In-memory Processing

The machine specs change

More reliable

Bigger memory

And the workload changed

Analytical queries

Iterative operations (like ML)

The main idea: Rather than storing

intermediate results to disk, keep them in

memory

How about fault tolerance?

10/31/2018 13

RDD Example

10/31/2018 14

FILTER FOREACH

FOREACH

JOIN

FILTER

GROUP

BY

Mem

Mem

Mem

Mem

RDD Abstraction

RDD is a pointer to a distributed dataset

Stores information about how to compute the

data rather than where the data is

Transformation: Converts an RDD to another

RDD

Action: Returns an answer of an operation

over an RDD

10/31/2018 15

Spark RDD Features

Lazy execution: Collect transformations and

execute on actions (Similar to Pig)

Lineage tracking: Keep track of the lineage of

each RDD for fault-tolerance

10/31/2018 16

RDD

10/31/2018 17

RDD RDD

Operation

Filter Operation

10/31/2018 18

RDD RDD

Filter

Similarly, projection operation

(ForEach in Pig)

Narrow dependency

GroupBy (Shuffle) Operation

10/31/2018 19

RDD RDD

Group By

Similar operation Join

Wide dependency

Types of Dependencies

Narrow dependencies

Wide dependencies

10/31/2018 20

Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies

Examples of Transformations

map

flatMap

reduceByKey

filter

sample

join

union

partitionBy

10/31/2018 21

Examples of Actions

count

collect

save(path)

persist

reduce

10/31/2018 22

How RDD can be helpful

Consolidate operations

Combine transformations

Iterative operations

Keep the output of an iteration in memory till the

next iteration

Data sharing

Reuse the same data without having to read it

multiple times

10/31/2018 23

Examples

# Initialize the Spark context

JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

10/31/2018 24

Examples

# Initialize the Spark context

JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

# Hello World! Example. Count the number of lines in the file

JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv");

long count = textFileRDD.count();

System.out.println("Number of lines is "+count);

10/31/2018 25

Examples

# Count the number of OK lines

JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) throws Exception {

String code = s.split("\t")[5];

return code.equals("200");

}

});

long count = okLines.count();

System.out.println("Number of OK lines is "+count);

10/31/2018 26

Examples

# Count the number of OK lines

# Shorten the implementation using lambdas (Java 8 and above)

JavaRDD<String> okLines = textFileRDD.filter(s -> s.split("\t")[5].equals("200"));

long count = okLines.count();

System.out.println("Number of OK lines is "+count);

10/31/2018 27

Examples

# Make it parametrized by taking the response code as a command line argument

String inputFileName = args[0];

String desiredResponseCode = args[1];

...

JavaRDD<String> textFileRDD = spark.textFile(inputFileName);

JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) {


return code.equals(desiredResponseCode);

}

}); 10/31/2018 28

Examples

# Count by response code

JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() {

@Override

public Tuple2<Integer, String> call(String s) {


return new Tuple2<Integer, String>(Integer.valueOf(code), s);

}

});

Map<Integer, Long> countByCode = linesByCode.countByKey();

System.out.println(countByCode);

10/31/2018 29

Further Reading

Spark home page: http://spark.apache.org/

Quick start:

http://spark.apache.org/docs/latest/quick-

start.html

RDD documentation:

http://spark.apache.org/docs/latest/rdd-

programming-guide.html

RDD Paper: Matei Zaharia et al. "Resilient Distributed

Datasets: A Fault-tolerant Abstraction for In-memory

Cluster Computing." NSDI’12

10/31/2018 30

http://spark.apache.org/

http://spark.apache.org/

http://spark.apache.org/docs/latest/quick-start.html




http://spark.apache.org/docs/latest/rdd-programming-guide.html






cs014 introduction to data structures and...

Documents