an enhanced mapreduce model (on bsp)
TRANSCRIPT
Architecture of MapReduce
This is the standard MapReduce processing flow:1. MAP2 Shuffle(omit sort)3 REDUCE
Suppose we have a 3-node cluster. Inside the cluster, there is a file which is spitted to 6 splits
The total slots for parallel MAP tasks are 3 (one per node)
When the MAP task Tm1 is finished, Tm4 will be spwaned at node1
batch-oriented
Architecture of MapReduce
Bach executation model The entire output of each map and reduce task is
materialized to a local file before it can be consumed by the next stage
Such materialization is often argued to be inefficient
but it's a important part of MapReduce's FT strategy
Architecture of MapReduce
Want to make some change ? If we introduce some barriers or some functions which make previous spawned tasks continuing, some MAP tasks might be blocked.
Long running MAP tasks actually changes the whole system behaviors: scheduling, fault tolerance and so on. It cannot be simply implemented.
Mapreduce online [NDSI'10]Pregel [SIGMOD'10]MapReduce vs BSP [ICCS'12]
Modifications and Alternatives
MapReduce online (HOP) Google's Pregale (BSP) Hadoop Hama [CloudCom'10] (BSP)
Long Running Jobs
HOP(Hadoop online prototype) Long running jobs data are pipelined between tasks and between jobs get approximation of results before jobs are finished retains the fault tolerance properties of Hadoop Programming interfaces are almost the same
HOP Details (inside a job)
MAP and REDUCE tasks are simultaneously exist
Piplines between MAP and REDUCE Send results form a MAP process to a RED process Output of MAP process are buffered in memory
Scheduling of MAP and REDUCE tasks Resolve the blocking problems (free slots and so
on) Omitted here
HOP Details (between jobs)
The reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next job, sidestepping the need for expensive fault-tolerant storage in HDFS
In some sense, “overlaps” the 1st REDUCE step and 2nd MAP (not really overlapped)
HOP Functionality
Online Aggregation Single job online aggregation (SQL query, ...) Multi-jobs online aggregation
Continuous Queries Process stream data (MapReduce jobs that run
continuously, accepting new data as it becomes available and analyzing it immediately)
Monitoring …
Evaluation
Omitted, in general for some problems are much faster
Paper: MapReduce Online [NDSI'10]
BPS-Style Frameworks
Pregel and Hama Different PI Long running services(tasks) Prefer in-memory processing (Pregel)
Hama Examples
Different with MapReduce, the main PI is a compute function (for a vertex)
Hama Examples
Or a bsp function (for iterative computation)
A Summary
HOP changes the Hadoop tasks' behaviors but keep almost the same programming interfaces and also programming patterns
map and reduce functions MAP* + REDUCE pattern
BSP provides different style Pis and also different programming patterns
compute and bsp functions (sync, sendMessage ...) Super-step pattern
My Proposal
More flexible MapReduce Combine advantages of both MapReduce ad
BSP a small step from the work of HOP a small step from the work of BSP
MapReduce ( +BSP )
New patterns MAP* + REDUCEG*
REDUCEL* + MAP* + REDUCEG* MAP * = reveiveMsg + MAP + sendMsg + sync REDUCE* = reveiveMsg + REDUCE + sendMsg + sync
MapReduce style batch processing
BSP(Hama) style receive/sendMsg + sync
Long-running tasks
Executer:: map/reduce, Executer holds map and reduce functions
Indexed Executers (have Id, name)
Architecture of MapReduce*
Executors are long running processes
In MAP phases, each executor invokes its map method on each input item
While the map processing in progress, “messages” can be add to the “message box”
The messages are sent asynchronously, and a BSP style barrier assures that all messages are delivered and received before generate output (note that output could be nothing)
Architecture of MapReduce*
Similar with MAP phase, in the REDUCE phase, executors invoke reduce function on its input list
Still, messages can be sent and received
Architecture of MapReduce*
Programming patterns (need more analysis) Not necessary always MAP → REDUCE but also
REDUCE → MAP This REDUCE is a local REDUCE, usually we
use map to implement it in Hadoop, but acutally it is a local REDUCE.
MAP and REDUCE phases actually should be free With MapReduce*, logical MAP and REDUCE
won't cause heavy synchronization for memory to disk, we can freely arrange the MAP and REDUCE
Lightweight MAP/REDUCE Phases
For example, scan, Hadoop need two-phases MAP
1st MAP tasks computes (local) sums of each splits 2nd MAP tasks computes final result (I omit 1st
REDUCE)
With MapReduce* these two phases are computed by using the same Executors
No need to spawn new MAP tasks Even no need to re-read the input file (but in case
the we don't have enough memory, we can still simple re-open the input splits)
[Usually write-to-disk/transfer-through-network is more costive than read from local file system]
Architecture of MapReduce*
Make model/program transformations much easier
… need prove it. Current I implemented Scan/Accumulation and feel good.
Lower cost than original Hadoop/MapReduce Compatible to original Hadoop/MapReduce
programs Keep the compatibility in which level need to be
considered (future work)
Examples
A MAP task programming interface: map and addMsg (current impl. is just a prototype, still use some Hama APIs underground)
Context
A Summary
Combine both advantages of HOP and BSP Avoid the heavy “materialization” between MAP and
REDUCE Efficient communications between MAP tasks,
REDUCE tasks, and MAP tasks to REDUCE tasks Intermediate statements could be inherited form
MAP phase to REDUCE phase (through long-running Executors)
Messages also be materialized (for fault tolerance) No necessary in-memory (save memory and good
for FT)
A Summary(continue)
No harm to fault tolerance (as currently understanding)
Keep the programming interfaces of MapReduce (almost same)
More flex style than Hadoop/HOP Compatible to original Hadoop/MapReduce
programs (depend on impl.)
Current Status
Have a simplified prototype Implemented using Hama(message, sync) and
Hadoop(HDFS) Workable (tested with some examples and get good
performance)
Further work Programming pattern in theoretic analysis Implementation (1 month)
Performance
100 *2^20 items (200MB) 2-pass MR (Liu's impl.): 23s+24s 1-pass MR (Tung's impl.): 3-4 min (due to the
input data, job failed) MapReduce*: 22 s
I have test-results form bigger data sets Omitted here