fast, interactive, language-integrated cluster computingjfc/datamining/sp12/lecs/lec11.pdf · fast,...

52
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC BERKELEY

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Spark Fast, Interactive, Language-Integrated Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica

UC BERKELEY

Page 2: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Overview Spark is a parallel framework that provides:

»Efficient primitives for in-memory data sharing »Simple programming interface in Scala »High generality (superset of many existing models)

This talk will cover: »What it does »How people are using it (including surprises) »Future research directions

Page 3: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Motivation MapReduce democratized “big data” analysis by offering a simple programming model for large, unreliable clusters

But as soon as it got popular, users wanted more: »More complex, multi-stage applications »More interactive queries »More low-latency online processing

Page 4: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Motivation Complex jobs, interactive queries and online processing all need one thing that MR lacks:

Efficient primitives for data sharing

Sta

ge

1

Sta

ge

2

Sta

ge

3

Iterative job

Query 1

Query 2

Query 3

Interactive mining

Job

1

Job

2

Stream processing

Page 5: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Motivation MapReduce and related models are based on data flow from stable storage to stable storage

Map

Map

Map

Reduce

Reduce

Input Output

Page 6: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Motivation

Map

Map

Map

Reduce

Reduce

Input Output

Benefits of data flow: runtime can decide where to run tasks and can automatically

recover from failures

MapReduce and related models are based on data flow from stable storage to stable storage

Page 7: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Motivation

Map

Map

Map

Reduce

Reduce

Input Output Problem: the only abstraction for data sharing

is stable storage (slow!)

MapReduce and related models are based on data flow from stable storage to stable storage

Page 8: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Example: Iterative Apps

Input

iteration 1

iteration 2

iteration 3

result 1

result 2

result 3

. . .

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS read

HDFS write

HDFS read

HDFS write

Page 9: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Distributed memory

Input

iteration 1

iteration 2

iteration 3

. . .

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

one-time processing

10-20x faster than network & disk

Page 10: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Challenge

How to design a distributed memory abstraction that is both fault-tolerant and efficient?

Page 11: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Challenge Existing distributed storage abstractions have interfaces based on fine-grained updates

»Reads and writes to cells in a table »E.g. databases, key-value stores, distributed memory

Requires replicating data or logs across nodes for fault tolerance expensive!

»10-20x slower than memory write…

Page 12: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Solution: Resilient Distributed Datasets (RDDs)

Provide an interface based on coarse-grained operations (map, group-by, join, …)

Efficient fault recovery using lineage »Log one operation to apply to many elements »Recompute lost partitions on failure »No cost if nothing fails

Page 13: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Distributed memory

Input

iteration 1

iteration 2

iteration 3

. . .

iter. 1 iter. 2 . . .

Input

RDD Recovery

one-time processing

Page 14: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Generality of RDDs RDDs can express surprisingly many parallel algorithms

»These naturally apply the same operation to many items

Capture many current programming models »Data flow models: MapReduce, Dryad, SQL, … »Specialized models for iterative apps: Pregel, iterative

MapReduce, bulk incremental, … »New apps that these models don’t capture

Page 15: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Outline Programming interface

Examples

User applications

Implementation

Demo

Current work

Page 16: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Spark Programming Interface

Language-integrated API in Scala

Provides: »Resilient distributed datasets (RDDs)

• Partitioned collections with controllable caching

»Operations on RDDs

• Transformations (define RDDs), actions (compute results)

»Restricted shared variables (broadcast, accumulators)

Page 17: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 18: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Fault Recovery

RDDs track lineage information that can be used to efficiently reconstruct lost partitions

Ex:

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD filter

(func = _.contains(...)) map

(func = _.split(...))

Page 19: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Fault Recovery Results

119

57

56

58

58 8

1

57

59

57

59

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

Ite

ratr

ion

tim

e (

s)

Iteration

No FailureFailure in the 6th Iteration

Page 20: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Example: Logistic Regression

Goal: find best line separating two sets of points

target

random initial line

Page 21: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Page 22: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Logistic Regression Performance

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 5 10 20 30

Ru

nn

ing

Tim

e (

s)

Number of Iterations

Hadoop

Spark

127 s / iteration

first iteration 174 s further iterations 6 s

Page 23: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Example: Collaborative Filtering

Goal: predict users’ movie ratings based on past ratings of other movies

R =

1 ? ? 4 5 ? 3 ? ? 3 5 ? ? 3 5 ? 5 ? ? ? 1 4 ? ? ? ? 2 ?

Movies

Users

Page 24: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Model and Algorithm

Model R as product of user and movie feature matrices A and B of size U×K and M×K

Alternating Least Squares (ALS) » Start with random A & B » Optimize user vectors (A) based on movies » Optimize movie vectors (B) based on users » Repeat until converged

R A = BT

Page 25: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Serial ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) }

Range objects

Page 26: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Naïve Spark ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect() }

Problem: R re-sent

to all nodes in each

iteration

Page 27: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Efficient Spark ALS var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() }

Solution: mark R as broadcast

variable

Result: 3× performance improvement

Page 28: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Scaling Up Broadcast

Initial version (HDFS) Cornet P2P broadcast

0

50

100

150

200

250

10 30 60 90

Ite

rati

on

tim

e (

s)

Number of machines

Communication

Computation

0

50

100

150

200

250

10 30 60 90

Ite

rati

on

tim

e (

s)

Number of machines

Communication

Computation

[Chowdhury et al, SIGCOMM 2011]

Page 29: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Other RDD Operations

Transformations (define a new RDD)

map filter

sample groupByKey reduceByKey

cogroup

flatMap union join

cross mapValues

...

Actions (output a result)

collect reduce

take fold

count saveAsTextFile

saveAsHadoopFile ...

Page 30: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Outline Programming interface

Examples

User applications

Implementation

Demo

Current work

Page 31: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Spark Users

Page 32: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

User Applications EM alg. for traffic prediction (Mobile Millennium)

In-memory OLAP & anomaly detection (Conviva)

Interactive queries on streamed data (Quantifind)

Twitter spam classification (Monarch)

Time-series analysis

Page 33: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Mobile Millennium Project

Estimate city traffic using GPS observations from probe vehicles (e.g. SF taxis)

Page 34: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Sample Data

Tim Hunter, with the support of the Mobile Millennium team P.I. Alex Bayen (traffic.berkeley.edu)

Page 35: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Challenge Data is noisy and sparse (1 sample/minute)

Must infer path taken by each vehicle in addition to travel time distribution on each link

Page 36: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Challenge Data is noisy and sparse (1 sample/minute)

Must infer path taken by each vehicle in addition to travel time distribution on each link

Page 37: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Solution EM algorithm to estimate paths and travel time distributions simultaneously

observations

weighted path samples

link parameters

flatMap

groupByKey

broadcast

Page 38: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Results

3× speedup from caching, 4.5x from broadcast

[Hunter et al, SOCC 2011]

Page 39: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Outline Programming interface

Examples

User applications

Implementation

Demo

Current work

Page 40: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Implementation

Runs on the Mesos cluster manager, letting it share resources with Hadoop

Can read from any Hadoop input source (HDFS, S3, …)

Spark Hadoop MPI

Mesos

Node Node Node Node

No changes to Scala compiler

Easy to run locally and on EC2

Page 41: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Scheduler Dryad-like task DAG

Pipelines functions within a stage

Cache-aware for data reuse & locality

Partitioning-aware to avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached partition

Page 42: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Language Integration Scala closures are Serializable Java objects

»Serialize on driver, load & run on workers

Not quite enough »Nested closures may reference entire outer scope »May pull in non-Serializable variables not used inside

Solution: bytecode analysis + reflection

Page 43: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line

»Altered code generation to make each “line” typed have references to objects it depends on

»Added facility to ship generated classes to workers

Enables in-memory exploration of big data

Page 44: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Outline Programming interface

Examples

User applications

Implementation

Demo

Current work

Page 45: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

1. Generality of RDDs

RDDs can express many proposed data-parallel programming models:

»MapReduce, DryadLINQ »Bulk incremental processing »Pregel graph processing » Iterative MapReduce (e.g. Haloop) »SQL

Allow apps to efficiently intermix these models

Apply the same operation to multiple items

Page 46: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Models We Are Building Pregel on Spark (Bagel)

»200 lines of code

Haloop on Spark »200 lines of code

Hive on Spark (Shark) »3000 lines of code »Compatible with Apache Hive »Machine learning ops. in Scala

Page 47: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

2. Streaming Spark

Provide similar interface for stream processing

Leverage RDDs for efficient fault recovery

Intermix with batch & interactive jobs

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

Page 48: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

2. Streaming Spark

Provide similar interface for stream processing

Leverage RDDs for efficient fault recovery

Intermix with batch & interactive jobs

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

Challenges: latency, incremental operators, scalable scheduling, partial results

Page 49: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

3. Bridging Batch Processing and User-Facing Services

One of the surprising uses of Spark has been to answer live queries from web app users

» E.g. Quantifind data mining app ingests data periodically and builds an in-memory index

Makes sense: want to use the same data structures for back-end and front-end computation

How can we support this better? » Random access, replication for latency, inspection, …

Page 50: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Conclusion

Spark’s RDDs offer a simple and efficient programming model for a broad range of apps

Solid foundation for higher-level abstractions

Try our open source release:

www.spark-project.org

Page 51: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Related Work DryadLINQ, FlumeJava

» Similar “distributed collection” API, but cannot reuse datasets efficiently across queries

GraphLab, Piccolo, BigTable, RAMCloud » Fine-grained writes requiring replication or checkpoints

Iterative MapReduce (e.g. Twister, HaLoop) » Implicit data sharing for a fixed computation pattern

Relational databases » Lineage/provenance, logical logging, materialized views

Caching systems (e.g. Nectar) » Store data in files, no explicit control over what is cached

Page 52: Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Behavior with Not Enough RAM

68

.8

58.1

40

.7

29

.7

11.5

0

20

40

60

80

100

Cachedisabled

25% 50% 75% Fullycached

Ite

rati

on

tim

e (

s)

% of working set in memory