efficient skyline computation in mapreduce
DESCRIPTION
Efficient Skyline Computation in MapReduce. Kasper Mullesgaard , Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of S outhern Denmark. Skyline Query. Application: multi-criteria decision Tuple dominance: t1 dominates t2 (t1 ⊰ t2) - PowerPoint PPT PresentationTRANSCRIPT
Efficient Skyline Computation in MapReduce
Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu
Aalborg University
Yongluan Zhou
University of Southern Denmark
Skyline Query
• Application: multi-criteria decision• Tuple dominance: t1 dominates t2 (t1 ⊰ t2)– Iff t1 is not worse than t2 in all dimensions, and– t1 is better than t2 in at least one dimension
• Skyline query:– Given a dataset, returns all tuples that are not
dominated by others
Scaling Skyline Computation
• Customized solutions:– Require arbitrary inter-node communication– Need software stacks to hardness a large cluster– Unproved scalability– Lack of fault tolerance
• General MapReduce platforms– Availability of scalable systems, such as Hadoop– A strict communication/synchronization model
MapReduce
Challenges of Skyline Computation using MapReduce
• To maximize parallelization• Push more work to mappers, i.e. let mappers filter out
more non-skyline points• Ability to utilize multiple reducers
• However, global skylines cannot be determined by local information• Without global information, Mappers have very limited
capabilities to filter out non-skyline points
Grid Partitioning and Bit String Representation
Partition Dominance: pi ⊰ pj iff pi.max ⊰ pj.min
2 5 8
1 4 7
0 3 6
BSR = 011110100
Bit String Generation
Determining Partitions Per Dimension (PPD)
• PPD is too high → very few tuples in each partition and too many partitions
• PPD is too low → too many tuples in each partition and less effective pruning
• Idea: generate bit strings for PPD from 2 to
– then choose the one with the most desirable number of tuples per partition
Single Reducer
Multi-Reducer
• The single reducer still performs significant work for detecting global skyline – limits the degree of parallelization
• Idea: independent partition group– Anti-Dominating Region (ADR):
– Independent Partition Group: A set of partitions Pi is an IPG iff holds
– One reducer is responsible for each IPG.
Multi-Reducer
Generation of I.P.G.
• Idea: a partition pm is a maximum partition iff ∀p, pm ∉ p.ADR
• Procedure:1. Find a maximum partition pm
2. Generate IPG = {pm} U pm.ADR
3. Remove pm and repeat 1
Implementation Issues
• More independent groups than #reducers– Need allocate them to the reducers, two options:1. Load balancing 2. Minimizing duplicate data transmission
• Elimination of duplicated skyline outputs– A grid partition appears in multiple IPGs– Designate one IPG as the responsible group• Load balancing
Experimental Setup
• 13 commodity machines• Datasets with independent and anti-
correlated distribution • Comparisons:– MR-BNL– MR-Angle
#Dimensions
independent data, cardinality: 1×105
#Dimensions
Anti-correlated data, cardinality: 1×105
Cardinality (independent data)
Dimensions: 3 Dimensions: 8
Cardinality (Anti-corr. data)
Dimensions: 3 Dimensions: 8
Number of Reducers
Summary
• Grid partitioning and bit strings– Choose an appropriate # partitioning
• Exploit independent groups to enable multiple reducers – Good for cases with large # skylines– Merging independent groups– Eliminate duplicate outputs