an enhanced mapreduce model (on bsp)

Weekly Seminar

An Enhanced MapReduce Model

Yu LIU@NII2013-02-04

mailto:LIU@NII

Architecture of MapReduce

This is the standard MapReduce processing flow:1. MAP2 Shuffle(omit sort)3 REDUCE

Suppose we have a 3-node cluster. Inside the cluster, there is a file which is spitted to 6 splits

The total slots for parallel MAP tasks are 3 (one per node)

When the MAP task Tm1 is finished, Tm4 will be spwaned at node1

batch-oriented


Bach executation model The entire output of each map and reduce task is

materialized to a local file before it can be consumed by the next stage

Such materialization is often argued to be inefficient

but it's a important part of MapReduce's FT strategy


Want to make some change ? If we introduce some barriers or some functions which make previous spawned tasks continuing, some MAP tasks might be blocked.

Long running MAP tasks actually changes the whole system behaviors: scheduling, fault tolerance and so on. It cannot be simply implemented.

Mapreduce online [NDSI'10]Pregel [SIGMOD'10]MapReduce vs BSP [ICCS'12]

Modifications and Alternatives

MapReduce online (HOP) Google's Pregale (BSP) Hadoop Hama [CloudCom'10] (BSP)

Long Running Jobs

HOP(Hadoop online prototype) Long running jobs data are pipelined between tasks and between jobs get approximation of results before jobs are finished retains the fault tolerance properties of Hadoop Programming interfaces are almost the same

HOP Details (inside a job)

MAP and REDUCE tasks are simultaneously exist

Piplines between MAP and REDUCE Send results form a MAP process to a RED process Output of MAP process are buffered in memory

Scheduling of MAP and REDUCE tasks Resolve the blocking problems (free slots and so

on) Omitted here

HOP Details (between jobs)

The reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next job, sidestepping the need for expensive fault-tolerant storage in HDFS

In some sense, “overlaps” the 1st REDUCE step and 2nd MAP (not really overlapped)

HOP Functionality

Online Aggregation Single job online aggregation (SQL query, ...) Multi-jobs online aggregation

Continuous Queries Process stream data (MapReduce jobs that run

continuously, accepting new data as it becomes available and analyzing it immediately)

Monitoring …

Evaluation

Omitted, in general for some problems are much faster

Paper: MapReduce Online [NDSI'10]

BPS-Style Frameworks

Pregel and Hama Different PI Long running services(tasks) Prefer in-memory processing (Pregel)

Hama Examples

Different with MapReduce, the main PI is a compute function (for a vertex)

Hama Examples

Or a bsp function (for iterative computation)

A Summary

HOP changes the Hadoop tasks' behaviors but keep almost the same programming interfaces and also programming patterns

map and reduce functions MAP* + REDUCE pattern

BSP provides different style Pis and also different programming patterns

compute and bsp functions (sync, sendMessage ...) Super-step pattern

My Proposal

More flexible MapReduce Combine advantages of both MapReduce ad

BSP a small step from the work of HOP a small step from the work of BSP

MapReduce （ +BSP ）

New patterns MAP* + REDUCEG*

REDUCEL* + MAP* + REDUCEG* MAP * = reveiveMsg + MAP + sendMsg + sync REDUCE* = reveiveMsg + REDUCE + sendMsg + sync

MapReduce style batch processing

BSP(Hama) style receive/sendMsg + sync

Long-running tasks

Executer:: map/reduce, Executer holds map and reduce functions

Indexed Executers (have Id, name)

Architecture of MapReduce*

Executors are long running processes

In MAP phases, each executor invokes its map method on each input item

While the map processing in progress, “messages” can be add to the “message box”

The messages are sent asynchronously, and a BSP style barrier assures that all messages are delivered and received before generate output (note that output could be nothing)


Similar with MAP phase, in the REDUCE phase, executors invoke reduce function on its input list

Still, messages can be sent and received


Programming patterns (need more analysis) Not necessary always MAP → REDUCE but also

REDUCE → MAP This REDUCE is a local REDUCE, usually we

use map to implement it in Hadoop, but acutally it is a local REDUCE.

MAP and REDUCE phases actually should be free With MapReduce*, logical MAP and REDUCE

won't cause heavy synchronization for memory to disk, we can freely arrange the MAP and REDUCE

Lightweight MAP/REDUCE Phases

For example, scan, Hadoop need two-phases MAP

1st MAP tasks computes (local) sums of each splits 2nd MAP tasks computes final result (I omit 1st

REDUCE)

With MapReduce* these two phases are computed by using the same Executors

No need to spawn new MAP tasks Even no need to re-read the input file (but in case

the we don't have enough memory, we can still simple re-open the input splits)

[Usually write-to-disk/transfer-through-network is more costive than read from local file system]


Make model/program transformations much easier

… need prove it. Current I implemented Scan/Accumulation and feel good.

Lower cost than original Hadoop/MapReduce Compatible to original Hadoop/MapReduce

programs Keep the compatibility in which level need to be

considered (future work)

Examples

A MAP task programming interface: map and addMsg (current impl. is just a prototype, still use some Hama APIs underground)

Context

A Summary

Combine both advantages of HOP and BSP Avoid the heavy “materialization” between MAP and

REDUCE Efficient communications between MAP tasks,

REDUCE tasks, and MAP tasks to REDUCE tasks Intermediate statements could be inherited form

MAP phase to REDUCE phase (through long-running Executors)

Messages also be materialized (for fault tolerance) No necessary in-memory (save memory and good

for FT)

A Summary(continue)

No harm to fault tolerance (as currently understanding)

Keep the programming interfaces of MapReduce (almost same)

More flex style than Hadoop/HOP Compatible to original Hadoop/MapReduce

programs (depend on impl.)

Current Status

Have a simplified prototype Implemented using Hama(message, sync) and

Hadoop(HDFS) Workable (tested with some examples and get good

performance)

Further work Programming pattern in theoretic analysis Implementation (1 month)

Performance

100 *2^20 items (200MB) 2-pass MR (Liu's impl.): 23s+24s 1-pass MR (Tung's impl.): 3-4 min (due to the

input data, job failed) MapReduce*: 22 s

I have test-results form bigger data sets Omitted here

an enhanced mapreduce model (on bsp)

Technology