efficient skyline computation in mapreduce

Efficient Skyline Computation in MapReduce Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of Southern Denmark

Upload: suki-reid

Post on 31-Dec-2015

47 views

Category:

Documents

0 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

Efficient Skyline Computation in MapReduce. Kasper Mullesgaard , Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of S outhern Denmark. Skyline Query. Application: multi-criteria decision Tuple dominance: t1 dominates t2 (t1 ⊰ t2) - PowerPoint PPT Presentation

TRANSCRIPT

Efficient Skyline Computation in MapReduce

Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu

Aalborg University

Yongluan Zhou

University of Southern Denmark

Page 2: Efficient Skyline Computation in MapReduce

Skyline Query

• Application: multi-criteria decision• Tuple dominance: t1 dominates t2 (t1 ⊰ t2)– Iff t1 is not worse than t2 in all dimensions, and– t1 is better than t2 in at least one dimension

• Skyline query:– Given a dataset, returns all tuples that are not

dominated by others

Page 3: Efficient Skyline Computation in MapReduce

Scaling Skyline Computation

• Customized solutions:– Require arbitrary inter-node communication– Need software stacks to hardness a large cluster– Unproved scalability– Lack of fault tolerance

• General MapReduce platforms– Availability of scalable systems, such as Hadoop– A strict communication/synchronization model

MapReduce

Challenges of Skyline Computation using MapReduce

• To maximize parallelization• Push more work to mappers, i.e. let mappers filter out

more non-skyline points• Ability to utilize multiple reducers

• However, global skylines cannot be determined by local information• Without global information, Mappers have very limited

capabilities to filter out non-skyline points

Page 6: Efficient Skyline Computation in MapReduce

Grid Partitioning and Bit String Representation

Partition Dominance: pi ⊰ pj iff pi.max ⊰ pj.min

2 5 8

1 4 7

0 3 6

BSR = 011110100

Page 7: Efficient Skyline Computation in MapReduce

Bit String Generation

Page 8: Efficient Skyline Computation in MapReduce

Determining Partitions Per Dimension (PPD)

• PPD is too high → very few tuples in each partition and too many partitions

• PPD is too low → too many tuples in each partition and less effective pruning

• Idea: generate bit strings for PPD from 2 to

– then choose the one with the most desirable number of tuples per partition

Page 9: Efficient Skyline Computation in MapReduce

Single Reducer

Page 10: Efficient Skyline Computation in MapReduce

Multi-Reducer

• The single reducer still performs significant work for detecting global skyline – limits the degree of parallelization

• Idea: independent partition group– Anti-Dominating Region (ADR):

– Independent Partition Group: A set of partitions Pi is an IPG iff holds

– One reducer is responsible for each IPG.

Page 11: Efficient Skyline Computation in MapReduce

Multi-Reducer

Page 12: Efficient Skyline Computation in MapReduce

Generation of I.P.G.

• Idea: a partition pm is a maximum partition iff ∀p, pm ∉ p.ADR

• Procedure:1. Find a maximum partition pm

2. Generate IPG = {pm} U pm.ADR

3. Remove pm and repeat 1

Page 13: Efficient Skyline Computation in MapReduce

Implementation Issues

• More independent groups than #reducers– Need allocate them to the reducers, two options:1. Load balancing 2. Minimizing duplicate data transmission

• Elimination of duplicated skyline outputs– A grid partition appears in multiple IPGs– Designate one IPG as the responsible group• Load balancing

Page 14: Efficient Skyline Computation in MapReduce

Experimental Setup

• 13 commodity machines• Datasets with independent and anti-

correlated distribution • Comparisons:– MR-BNL– MR-Angle

Page 15: Efficient Skyline Computation in MapReduce

#Dimensions

independent data, cardinality: 1×105

Page 16: Efficient Skyline Computation in MapReduce

#Dimensions

Anti-correlated data, cardinality: 1×105

Page 17: Efficient Skyline Computation in MapReduce

Cardinality (independent data)

Dimensions: 3 Dimensions: 8

Page 18: Efficient Skyline Computation in MapReduce

Cardinality (Anti-corr. data)

Dimensions: 3 Dimensions: 8

Page 19: Efficient Skyline Computation in MapReduce

Number of Reducers

Page 20: Efficient Skyline Computation in MapReduce

Summary

• Grid partitioning and bit strings– Choose an appropriate # partitioning

• Exploit independent groups to enable multiple reducers – Good for cases with large # skylines– Merging independent groups– Eliminate duplicate outputs

Scalable Skyline Computation Using Object-based Space Partitioning

Secure and Efficient Probabilistic Skyline Computation for

Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefﬁcient

MapReduce. MapReduce Outline MapReduce Architecture MapReduce Internals MapReduce Examples JobTracker Interface

Probabilistic n-of-N Skyline Computation over Uncertain ...yingz/...SkylineComputation.pdf · 1 Introduction Skyline analysis has been shown as a useful tool in multi-criterion decision

Upper and Lower Bound on the Cost of a MapReduce Computation

Top-k and Skyline Computation

Scaling Information Retrieval to the Webmooney/ir-course/slides/ScalingIR.pdf · Apache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System

A Model of Computation for MapReduce Karloff, Suri and Vassilvitskii (SODA ’ 10) Presented by Ning Xie

CrowdSky: Skyline Computation with Crowdsourcingpike.psu.edu/publications/edbt16.pdf · skyline algorithm with three pruning methods on top of the notion of a dominating set. To reduce

M2R: Enabling Stronger Privacy in MapReduce Computation

Parallel Computation of Skyline Queries Verification