leveraging parallel data processing frameworks with

52
Leveraging Parallel Data Processing Frameworks with Verified Lifting Maaz Ahmad Alvin Cheung

Upload: others

Post on 19-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leveraging Parallel Data Processing Frameworks with

Leveraging Parallel Data Processing Frameworks withVerified Lifting

Maaz Ahmad Alvin Cheung

Page 2: Leveraging Parallel Data Processing Frameworks with

Data

Motivation

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 3: Leveraging Parallel Data Processing Frameworks with

Data

Motivation

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 4: Leveraging Parallel Data Processing Frameworks with

Data

Motivation

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 5: Leveraging Parallel Data Processing Frameworks with

Data

Motivation

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 6: Leveraging Parallel Data Processing Frameworks with

Motivation

Data

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 7: Leveraging Parallel Data Processing Frameworks with

Motivation

Data

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 8: Leveraging Parallel Data Processing Frameworks with

Motivation

Data

Data Collection Tool Data Analytics Application(Sequential Java)

2

Page 9: Leveraging Parallel Data Processing Frameworks with

Motivation

Data

Data Collection Tool Data Analytics Application(Sequential Java)

I need something

faster.

2

Page 10: Leveraging Parallel Data Processing Frameworks with

Parallel Processing Frameworks

3

Page 11: Leveraging Parallel Data Processing Frameworks with

Parallel Processing Frameworks

Which one is right for me?

3

Page 12: Leveraging Parallel Data Processing Frameworks with

Parallel Processing Frameworks

Which one is right for me? How do I

program in this?

3

Page 13: Leveraging Parallel Data Processing Frameworks with

Parallel Processing Frameworks

Which one is right for me? How do I

program in this?

I will have to re-write my application!

3

Page 14: Leveraging Parallel Data Processing Frameworks with

Parallel Processing Frameworks

Which one is right for me? How do I

program in this?

I will have to re-write my application!

Re-write might introduce bugs.

3

Page 15: Leveraging Parallel Data Processing Frameworks with

How can we make life easier?

Java To Hadoop Compiler

4

Page 16: Leveraging Parallel Data Processing Frameworks with

How can we make life easier?

Java To Spark Compiler

5

Page 17: Leveraging Parallel Data Processing Frameworks with

Syntax Directed Rules

Hard to come up with rules

Brittle to code pattern changes

for(int i = 0; i < data; i++){

}

mapper(key, data){

}reducer(key, values){

}

6

Page 18: Leveraging Parallel Data Processing Frameworks with

Syntax Directed Rules

Hard to come up with rules

Brittle to code pattern changes

Syntax Directed Rules

for(int i = 0; i < data; i++){

}

fm(val) →

fr(val1, val2) →

output = reduce(map(data, fm), fr);

mapper(key, data){

}reducer(key, values){

}

6

Page 19: Leveraging Parallel Data Processing Frameworks with

Syntax Directed Rules

Hard to come up with rules

Brittle to code pattern changes

Verified Lifting Syntax Directed Rules

for(int i = 0; i < data; i++){

}

fm(val) →

fr(val1, val2) →

output = reduce(map(data, fm), fr);

mapper(key, data){

}reducer(key, values){

}

How do we do this?

- Program analysis

- Synthesis

- Theorem prover

6

Page 20: Leveraging Parallel Data Processing Frameworks with

Introducing CASPER

• Re-targets sequential Java code fragments to Hadoop/Spark frameworks.

• Input: Unannotated sequential Java application source code.

• Output: Translated application source code that runs on top of

Hadoop/Spark to leverage its parallel execution.

7

Page 21: Leveraging Parallel Data Processing Frameworks with

MapReduce Overview

8

InputData

Page 22: Leveraging Parallel Data Processing Frameworks with

MapReduce Overview

8

InputData

Data Split

Data Split

Data Split

Mapper

Mapper

Mapper

Page 23: Leveraging Parallel Data Processing Frameworks with

MapReduce Overview

8

InputData

Data Split

Data Split

Data Split

Mapper

Mapper

Mapper

GeneratedKey-Value Pairs

Reducerkey1

Reducerkey2

Reducerkey3 (Key3, value)

(Key2, value)

(Key1, value)

Page 24: Leveraging Parallel Data Processing Frameworks with

Verified Lifting

• Infer code semantics (summary) in a high level specification

• A summary describes the effect of code on the output variables

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

𝑑𝑎𝑡𝑎_𝑠𝑞𝑟 ≡

𝑖 = 0

𝑖 = 𝑑𝑎𝑡𝑎.𝑠𝑖𝑧𝑒() − 1

𝑑𝑎𝑡𝑎 𝑖 2

Java Code Fragment

Summary

9

Page 25: Leveraging Parallel Data Processing Frameworks with

Verified Lifting

• Infer code semantics (summary) in a high level specification

• A summary describes the effect of code on the output variables

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

𝑑𝑎𝑡𝑎_𝑠𝑞𝑟 ≡

𝑖 = 0

𝑖 = 𝑑𝑎𝑡𝑎.𝑠𝑖𝑧𝑒() − 1

𝑑𝑎𝑡𝑎 𝑖 2

Java Code Fragment

SummaryPost-condition

9

Page 26: Leveraging Parallel Data Processing Frameworks with

Verified Lifting

• Infer code semantics (summary) in a high level specification

• A summary describes the effect of code on the output variables

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

𝑑𝑎𝑡𝑎_𝑠𝑞𝑟 ≡

𝑖 = 0

𝑖 = 𝑑𝑎𝑡𝑎.𝑠𝑖𝑧𝑒() − 1

𝑑𝑎𝑡𝑎 𝑖 2

Java Code Fragment

SummaryPost-condition

9

Page 27: Leveraging Parallel Data Processing Frameworks with

Verified Lifting

• Infer code semantics (summary) in a high level specification

• A summary describes the effect of code on the output variables

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

𝑑𝑎𝑡𝑎_𝑠𝑞𝑟 ≡

𝑖 = 0

𝑖 = 𝑑𝑎𝑡𝑎.𝑠𝑖𝑧𝑒() − 1

𝑑𝑎𝑡𝑎 𝑖 2

Java Code Fragment

SummaryPost-condition • Specifications must be trivial

to translate.

• Program specification exhibits

good parallelism.9

Page 28: Leveraging Parallel Data Processing Frameworks with

Code Summaries in Casper

10

Page 29: Leveraging Parallel Data Processing Frameworks with

Code Summaries in Casper

∀𝑣 ∈ 𝑜𝑢𝑡𝑝𝑢𝑡𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠.

10

Page 30: Leveraging Parallel Data Processing Frameworks with

Code Summaries in Casper

𝑣 ≡ 𝑓𝑟𝑒𝑑𝑢𝑐𝑒(𝑣0, 𝑟𝑒𝑑𝑢𝑐𝑒 𝑚𝑎𝑝 𝑑𝑎𝑡𝑎, 𝑓𝑚𝑎𝑝 , 𝑓𝑟𝑒𝑑𝑢𝑐𝑒 )∀𝑣 ∈ 𝑜𝑢𝑡𝑝𝑢𝑡𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠.

10

Page 31: Leveraging Parallel Data Processing Frameworks with

Code Summaries in Casper

𝑣 ≡ 𝑓𝑟𝑒𝑑𝑢𝑐𝑒(𝑣0, 𝑟𝑒𝑑𝑢𝑐𝑒 𝑚𝑎𝑝 𝑑𝑎𝑡𝑎, 𝑓𝑚𝑎𝑝 , 𝑓𝑟𝑒𝑑𝑢𝑐𝑒 )∀𝑣 ∈ 𝑜𝑢𝑡𝑝𝑢𝑡𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠.

Where,

𝑚𝑎pand 𝑓 𝑟𝑒𝑑𝑢𝑐𝑒 are synthesized for each code fragment.

ons.

𝑓 𝑚𝑎𝑝 𝑚𝑎𝑝 𝑝 𝑚𝑎𝑝 and 𝑓𝑟𝑒𝑑𝑢𝑐𝑒 are synthesized for each code fragment.

10

Page 32: Leveraging Parallel Data Processing Frameworks with

Restricting Search Space

• Use Syntax-Guided Synthesis (SyGuS) to generate 𝑓𝑚𝑎𝑝 and 𝑓𝑟𝑒𝑑𝑢𝑐𝑒 .

• Use a grammar to specify a set of candidate summaries.

• Grammar is dynamically generated for each code fragment.

11

Page 33: Leveraging Parallel Data Processing Frameworks with

Grammar Generation: fmap

• The body of 𝑓𝑚𝑎𝑝 is just a sequence of emits.

• Begin with number of emits equal to number of output variables.

• Incrementally add emits statements up to a user-defined bound.

𝑀𝑎𝑝 → 𝑀𝑎𝑝 𝑀𝑎𝑝 | 𝐸𝑚𝑖𝑡

𝐸𝑚𝑖𝑡 → 𝑒𝑚𝑖𝑡 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 ; | 𝑖𝑓 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑒𝑚𝑖𝑡 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 ;

𝐾𝑒𝑦 → 𝐼𝑛𝑡𝐸𝑥𝑝 𝑆𝑡𝑟𝑖𝑛𝑔𝐸𝑥𝑝 𝐵𝑜𝑜𝑙𝐸𝑥𝑝 | …

𝑉𝑎𝑙𝑢𝑒 → 𝐼𝑛𝑡𝐸𝑥𝑝 𝑆𝑡𝑟𝑖𝑛𝑔𝐸𝑥𝑝 𝐵𝑜𝑜𝑙𝐸𝑥𝑝 | …

12

Page 34: Leveraging Parallel Data Processing Frameworks with

Grammar Generation: fmap

• The key and value for each emit are generated using expression

grammars.

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

Java Code Fragment

𝐼𝑛𝑡𝐸𝑥𝑝 → 𝐼𝑛𝑡𝐸𝑥𝑝 + 𝐼𝑛𝑡𝐸𝑥𝑝 𝐼𝑛𝑡𝐸𝑥𝑝 ∗ 𝐼𝑛𝑡𝐸𝑥𝑝 𝑑𝑎𝑡𝑎 𝐼𝑛𝑡𝐸𝑥𝑝 | 𝐼𝑛𝑡𝑉𝑎𝑙

𝐼𝑛𝑡𝑉𝑎𝑙 → 𝑑𝑎𝑡𝑎_𝑠𝑞𝑟 𝑖 𝑙𝑖𝑡𝑒𝑟𝑎𝑙

Integer Expression Grammar

13

Page 35: Leveraging Parallel Data Processing Frameworks with

Grammar Generation: freduce

• The body of 𝑓𝑟𝑒𝑑𝑢𝑐𝑒 implements a fold operation.

𝑅𝑒𝑑𝑢𝑐𝑒 → 𝑖𝑛𝑡 𝑟𝑒𝑠 = 𝑙𝑖𝑡𝑒𝑟𝑎𝑙; 𝑓𝑜𝑟 𝑣𝑎𝑙𝑢𝑒 ∶ 𝑣𝑎𝑙𝑢𝑒𝑠 𝑟𝑒𝑠 = 𝐹𝑜𝑙𝑑𝐸𝑥𝑝; 𝑒𝑚𝑖𝑡 𝑘𝑒𝑦, 𝑟𝑒𝑠 ;

𝐹𝑜𝑙𝑑𝐸𝑥𝑝 → 𝐹𝑜𝑙𝑑𝐸𝑥𝑝 + 𝐹𝑜𝑙𝑑𝐸𝑥𝑝 𝐹𝑜𝑙𝑑𝐸𝑥𝑝 ∗ 𝐹𝑜𝑙𝑑𝐸𝑥𝑝 𝐼𝑛𝑡𝑉𝑎𝑙

𝐼𝑛𝑡𝑉𝑎𝑙 → 𝑟𝑒𝑠 𝑣𝑎𝑙 𝑘𝑒𝑦 | 𝑙𝑖𝑡𝑒𝑟𝑎𝑙

Fold Expression Grammar

14

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

Java Code Fragment

Page 36: Leveraging Parallel Data Processing Frameworks with

Verifying Equivalence

• CASPER uses Hoare-style verification conditions.

• Verification conditions are the weakest pre-conditions for the post-

condition (code summary) to hold.

• Proving post-conditions for code fragments containing loops requires

loop-invariants.

15

Page 37: Leveraging Parallel Data Processing Frameworks with

Verifying Equivalence Pt. 2

16

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

Page 38: Leveraging Parallel Data Processing Frameworks with

Verifying Equivalence Pt. 2

16

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

Page 39: Leveraging Parallel Data Processing Frameworks with

Verifying Equivalence Pt. 2

16

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

Page 40: Leveraging Parallel Data Processing Frameworks with

Verifying Equivalence Pt. 2

16

data_sqr = 0;for(int i = 0; i < data.size(); i++){

data_sqr += data[i] * data[i];}

Page 41: Leveraging Parallel Data Processing Frameworks with

Formal Verification

• We have modelled the MapReduce library in Dafny.

• The generated summary is compiled down to Dafny code.

• Code annotations are automatically generated. These include:

• Verification conditions

• Proof lemmas

17

Page 42: Leveraging Parallel Data Processing Frameworks with

Lemma Example

lemma InductiveStep (data: seq<int>, i: int, data_sqr: int)

requires invariant(data, i, data_sqr) && i < |data|

ensures invariant(data, i + 1, data_sqr + (data[i] * data[i]));

{

assert map (data, i+1) == fmap(data, i) + map(data, i);

assert freduce(fmap(data, i), 0) == data[i] * data[i];

}

18

Page 43: Leveraging Parallel Data Processing Frameworks with

CASPER Architecture Diagram

Candidate Solution

Generator

Bounded Model

Checker

Candidate Summary

Counter-example

Input Examples(Random)

Failed Correct Solution

19

Page 44: Leveraging Parallel Data Processing Frameworks with

CASPER Architecture Diagram

Candidate Solution

Generator

Bounded Model

Checker

Candidate Summary

Counter-example

Input Examples(Random)

Correct Solution

Program Analyzer

Grammar

Failed

Failed

Original SourceCode

19

Page 45: Leveraging Parallel Data Processing Frameworks with

CASPER Architecture Diagram

Candidate Solution

Generator

Bounded Model

Checker

Candidate Summary

Counter-example

Input Examples(Random)

Program Analyzer

Grammar

Failed

Failed

Theorem Prover

Verified Summary

Candidate Summary

Failed

Original SourceCode

19

Page 46: Leveraging Parallel Data Processing Frameworks with

CASPER Architecture Diagram

Candidate Solution

Generator

Bounded Model

Checker

Candidate Summary

Counter-example

Input Examples(Random)

Program Analyzer

Grammar

Failed

Failed

Theorem Prover

Candidate Summary

Failed

Verified Summary Code

Generator

Hadoop / Spark Code

Original SourceCode

19

Page 47: Leveraging Parallel Data Processing Frameworks with

CASPER Architecture Diagram

Candidate Solution

Generator

Bounded Model

Checker

Candidate Summary

Counter-example

Input Examples(Random)

Program Analyzer

Grammar

Failed

Failed

Theorem Prover

Candidate Summary

Failed

SKETCH DafnyPolyglot

Verified Summary Code

Generator

Hadoop / Spark Code

Polyglot

Original SourceCode

19

Page 48: Leveraging Parallel Data Processing Frameworks with

Evaluation

• Compilation performance

• Run-time performance

• Five benchmarks:

- Summation

- Word Count

- String Search (Grep)

- Linear Regression

- 3D Histogram

20

Page 49: Leveraging Parallel Data Processing Frameworks with

Compilation Performance

21

BenchmarkProgram Analysis

Synthesis and BMC

# of grammar Iterations

Formal Verification

Summation < 1s 13s 1 2.8s

Word Count < 1s 44s 1 3.4s

String Match < 1s 1406s 2 3.3s

3D Histogram < 1s 2355s 2 4.2s

Linear Regression < 1s 1801s 2 4.8s

Page 50: Leveraging Parallel Data Processing Frameworks with

Runtime Performance

Benchmark: String Matching (Grep)

22

• Configuration:-

10 node cluster

8 vCPU, 15GB Memory

HDFS for data storage

Hadoop 2.7.2 and Spark 1.6.1

• Average Speedup:

6.1x on Spark

3.3x on Hadoop

Page 51: Leveraging Parallel Data Processing Frameworks with

Demo!

23

Page 52: Leveraging Parallel Data Processing Frameworks with

Data

Data

Data Collection Tool Data Analytics Application(Spark)

CASPER

SummaryWeb-page: http://tinyurl.com/casper-homepageMailing-list: http://tinyurl.com/casper-subscribe

24