on large clusters simplified relational data...

35
Map-Reduce-Merge Simplified Relational Data Processing on Large Clusters

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Map-Reduce-MergeSimplified Relational Data Processing

on Large Clusters

Page 2: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Contents1. Introduction2. Map-Reduce3. Map-Reduce-Merge4. Application to relational data processing5. Optimization6. Enhancements7. Case studies8. Conlusions

Page 3: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

IntroductionChallenge:

process and manage a vast amount of data collected from the entire World Wide Web.

Current Solutions:Customized parallel data processing systems Use large clusters of shared-nothing commodity nodes Google’s GFS, BigTable, MapReduce

Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop

Page 4: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

IntroductionHadoop: open source

refactor of data processing into two primitives:map + reduce

don't need to worry about the nuisance details of coordinating parallel sub-tasks and managing distributed file storage => increase productivity

Page 5: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

IntroductionMR is best at handling homogeneous datasets

Ex. joins --> calls for extra MR steps

Map-Reduce-Mergesimplified designrelational complete

Page 6: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Map-Reduce

Page 7: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Features and Principles

Low-cost unreliable commodity hardwareextremely scalable RAIN clusterfault-tolerant yet easy to administersimplified and restricted yet powerfulhighly parallel yet abstractedhigh throughputhigh performance by the largefunctional programming primitives......

Page 8: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Map-Reduce

Homogenization: for equi-join

Transform each dataset into (join key, data-source tag + payload)Then apply map-reduce to merge entries from different datasets

Problem: only equi-joins may take lots of extra disk space, incur excessive communications

Page 9: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Mape-Reduce-Merge

α, β, γ represent dataset lineagesReduce function produces a key/value list instead of just valuesMerge function reads data from both lineages

Page 10: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Mape-Reduce-Merge

Page 11: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Example

Page 12: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing
Page 13: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Merge Phase

Page 14: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing
Page 15: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Merge Phase

Partition Selector: Determine from which reducers this merger retrieves its input data based on the merger numberProcessors: 1.Process data from one source only 2.Users can define two processor functions Merger: Process two pairs of key/valuesConfigurable Iterators: 1. A merger has two logical iterators 2.Control their relative movement against each others

Page 16: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Merger

Page 17: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Configurable Iterators

example 1.

Page 18: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Configurable Iterators

example 2.

Page 19: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Configurable Iterators

example 3.

Page 20: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Application to relational data processingrelational completeprojection: mapaggregation: map + reducegeneralized selection: map --> where, reduce-->having, merger--> filtering condition involving more than one relationsjoins: to be discussed... set union/set intersection/set difference: easily handle it in mergercartersian product: nested looprename: trivial

Page 21: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Sort-Merge Join

Map: use range partitioner => records are partitioned into ordered buckets, each mutually exclusive

Reduce: sort data

Merge: reads from two sets of reducer outputs that cover the same key range

Page 22: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Hash Join

Map: use a common partitioner => records are partitioned into hashed buckets

Reduce: reads from every mapper for one designated partition, use the same hash function, records from these partitions can be grouped and aggregated using a hash table

Merge: reads from two sets of reducer outputs that share the same hashing buckets build/probe

Page 23: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing
Page 24: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Block Nested-Loop Join

Map: same as the one for the hash join

Reduce: same as the one for the hash join

Merge: almost the same as hash join, except for a nested-loop join is used instead

Page 25: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Optimizations

Optimal Reduce-Merge Connections Results of Reduce: partitioned and sorted

The selector of Merge can choose pertinent part of data

Page 26: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Optimizations

Combining PhasesReduceMap, MergeMap

Directly send output to new mappers

Reduce MergeCombine merger to reducer

ReduceMergeMapCombination of above two

Page 27: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Enhancements

Map-Reduce-Merge LibraryA library that contains commonly used

merger configurations like all kinds of joins

Page 28: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Enhancements

Map-Reduce-Merge WorkflowThe regular Map-Reduce workflow is very

strict Adding a new phase creates many workflow

combinations

Page 29: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Enhancements

Map-Reduce-Merge Workflow

Page 30: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Case Studies

Join Webgraphs| URL | inlinks | outlinks |Each column in a separate file

Goal: compute the intersection of inlinks and outlinks for each URL

Page 31: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Case Studies

Join WebgraphsReading all three columns into one Map-

Reduce can overflow buffer

Safer approach: 1) each URL as a row-id2) replicate row-id to each inlink and outlink3) produce <row-id, inoutlink> 4) then natural join <row-id, URL> with <row-

id, inoutlink>

Page 32: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Case Studies

Map-Reduce-Merge Workflow for TPC-H Q2

Page 33: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Case Studies

Map-Reduce-Merge Workflow for TPC-H Q2SQL: 5-way joins with aggregate and group by cluses

M-P-M: four 2-way joins then order by and sortings

Page 34: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Case Studies

Page 35: on Large Clusters Simplified Relational Data Processingweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations/MapReduce… · Introduction Hadoop: open source refactor of data processing

Conclusions

Map-Reduce-Merge supports joins of heterogeneous datasets

Thus, it can be used to implement many relational operators, particularly joins