Distributed Systems
Distributed computation with Spark
Abraham Bernstein, Ph.D.
Course material based on:- Based on slides by Reynold Xin, Tudor Lapusan- Some additions by Johannes Schneider
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
How good is Map/Reduce?
• Abstraction• Simple?
• Automatic distribution of (data and) tasks
• Be platform agnostic
• Performance
•2
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Map/Reduce is not so simple…
• Not easy to program directly in Map/Reduce
• Most real applications require multiple steps...• Iterative algorithms (eg. PageRank): 10’s of steps
• Analytics query (eg. count & top K): 2-5
ÞEach step one map and reduce class
ÞBoilerplate code, spaghetti like…
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Higher level frameworks
• Simpler to use than Map/Reduce
• Examples• HiveQL, Pig, Spark
• Built on top of Hadoop• Use at least some parts of Hadoop
• (often can) generate Map/Reduce jobs
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Spark
• Simpler to program• Nicer syntax: no explicit map/reduce
• Faster execution
• How? Two key points:• Generalized directed acyclic graphs for computation
• Faster data sharing • Don’t write intermediate results to discs
• How to achieve faul-tolerance if data is in RAM?Þ RDD
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Spark Ecosystem
• Under development (Spark released 2014)
• This course: Spark Core Engine only
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Resilient Distributed Dataset (RDD)
• Collection of (data) elements• Held on disc or in RAM
• Can be distributed on different nodes
• Programmer can “persist/cache” RDDs• Kept in memory for faster access
• System can remove(delete) from RAM, if need space
• RDDs are immutable• Transformations: Create new RDDs from old one
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Operations on RDD
• Transformations• f(RDD) => RDD
• Lazy evaluation: not computed immediately
• Actions• Triggers computation
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Transformations and Actions
Type T to Type U
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Reminder: Java Syntax
• Assign a function to a variable• Pass functions as parameters• Functional Interfaces
Interface FlatMapFunction<T,R> {
Iterable<R> call (T t)}
FlatMapFunction<String, String> myFunc
= new FlatMapFunction<String, String>(){Iterable<String> call(String s)
{return Arrays.asList(s.split(“ “));}};
myFunc.apply(“This is first.”); => Iterator => “This”, “is”, “first”
public void flatMapSet(FlatMapFunction<String, String> mapper) {…};flapMapSet(myFunc);
Return type
Argument types
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example:Count lines with word “Error” in file
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application");JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("YOUR_SPARK_HOME/log.txt");
JavaRDD<String> linesWithError =
lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“Error”);}
});
long nLines = linesWithError.count();
System.out.println("Lines with Errors: “ + nLines);
}
}
Example: log.txt10:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Wordcount in Spark
• mapFunction operates per data item/“line”:• flatMap unifies to get one list
Example: This is first.This second.
This,is,first.,This,second.
(This,1),(is,1),(first.,1),(This,1),(second.,1)
(This,2),(is,1),(first.,1), (second.,1)
This,is,first.,This,second.
This,is,first.,This,second.
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
RDDs creation
• Create initial RDD from some data• Eg. from HDFS: “hdfs://myFile.txt”
lines = sc.textFile(“hdfs://myFile.txt”)
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
RDDs during computation
lines = sc.textFile(...)
linesWithError = lines.filter(new Function<String, Boolean>() {…}
linesWithError.count();
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example: Count error messages with “SQL”, “php”,…
JavaRDD<String> linesWithError = lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {return s.contains(“Error”);}} );
JavaRDD<List<String>> messages = linesWithError.map(new MapFunction<String, List<String>>() {
public List<String> call(String s) {return Arrays.asList(s.split(“|“)); }});messages.cache();
JavaRDD<String> msgsSQL = messages.filter(…s.contains(“SQL”)…);long nSQLMsgs = msgsSQL.count();JavaRDD<String> msgsPHP = messages.filter(…s.contains(“php”)…);long nPHPMsgs = msgsPHP.count();
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example: Count error messages with “SQL”, “php”,…
lines = sc.textFile(“hdfs://...”) RDD 110:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done
RDD 4Error SQL Syntax
RDD 5Error php 12
RDD 210:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 12
RDD 310:00Error SQL SyntaxTask 1 done 11:02Worker addedError php 12
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Example: Directed Acyclic Graph
• Dependencies among RDDsRDD 110:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done
RDD 4Error SQL Syntax
RDD 5Error php 12
RDD 210:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 12
RDD 310:00Error SQL SyntaxTask 1 done 11:02Worker addedError php 12
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
Directed Acyclic graphMap/Reduce vs Spark
• Dependencies • of map/reduce results …. RDDs
Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16
RDD Recreation
• Automatically recompute (parts of) RDD if lost • Due to deletion/removal of RDD by system (to get more RAM)
• Due to fault, eg. crash of machine
• Track transformations and used (parts of) RDDs in transformation• Start from last RDD stored on disc (Checkpoint)