degraded-first scheduling for mapreduce in erasure-coded storage clusters runhui li, patrick p. c....

32
Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Upload: rosamund-ferguson

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage

Clusters

Runhui Li, Patrick P. C. Lee, Yuchong Hu

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Page 2: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

2

Outline

Introduction Background Motivating Example Design Of Degraded-first Scheduling Simulation Experiments Conclusion

Page 3: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

3

Introduction(1/5)

As a storage system scales, node failures are commonplace.

To ensure data availability at any time, traditional designs of GFS, HDFS, and Azure replicate each data block into three copies to provide double-fault tolerance.

However, as the volume of global data surges to the zettabyte scale, the 200% redundancy overhead of 3-way replication becomes a scalability bottleneck.

Page 4: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

4

Introduction(2/5)

Erasure coding costs less storage overhead than replication under the same fault tolerance.

Extensive efforts[13, 20, 29] have studied the use of erasure coding in clustered storage systems that provide data analytics services.

• [13] D. Ford, F. Labelle, F. I. Popovici, M. Stokel, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in Globally Distributed Storage Systems. In Proc. of USENIX OSDI, Oct 2010.

• [20] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. of USENIX ATC, Jun 2012.

• [29] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring Elephants: Novel Erasure Codes for Big Data. In Proc. of VLDB Endowment, pages 325–336, 2013.

Page 5: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

Introduction(3/5) In particular, when data is unavailable due to node

failures, reads are degraded in erasure-coded storage as they need to download data from surviving nodes to reconstruct the missing data.

Several studies [20, 22, 29] propose to optimize degraded reads in erasure-coded clustered storage systems, by reducing the amount of downloaded data for reconstruction.

5

• [20] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. of USENIX ATC, Jun 2012.

• [22] O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads. In Proc. of USENIX FAST, Feb 2012.

• [29] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring Elephants: Novel Erasure Codes for Big Data. In Proc. of VLDB Endowment, pages 325–336, 2013.

Page 6: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

6

Introduction(4/5)

Despite the extensive studies on erasure-coded clustered storage systems, it remains an open issue of how to customize the data analytics paradigm, such as MapReduce.

In this work, we explore Hadoop’s version of MapReduce on HDFS-RAID [18], a middleware layer that extends HDFS to support erasure coding.

• [18] HDFS-RAID. http://wiki.apache.org/hadoop/HDFS-RAID.

Page 7: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

7

Introduction(5/5)

Traditional MapReduce scheduling emphasizes locality, and implements locality-first scheduling.

MapReduce is designed with replication-based storage. In the presence of node failures, it re-schedules tasks to run on other nodes that hold the replicas.

A key motivation of this work is to customize MapReduce scheduling for erasure-coded storage in failure mode.

Page 8: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

Background(1/6)

Hadoop Hadoop runs on a distributed file system HDFS. HDFS divides a file into fixed-size blocks, which form

the basic units for read and write operations. HDFS uses replication to maintain data availability, such

that each block is replicated into multiple copies and distributed across different nodes.

8

Page 9: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

9

Background(2/6)

In typical deployment environments of MapReduce, network bandwidth is scarce.

MapReduce emphasizes data locality by trying to schedule a map task to run on a (slave) node that stores a replica of the data block, or a node that is located near the data block.

This saves the time of downloading blocks from other nodes over the network.

Page 10: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

10

Background(3/6)

A map task can be classified into three types: 1. Node-local : task processes a block stored in the same

node.

2. Rack-local : task downloads and processes a block stored in another node of the same rack.

3. Remote : task downloads and processes a block stored in another node of a different rack.

The default task scheduling scheme in Hadoop first assigns map slots to local tasks, followed by remote tasks.

Page 11: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

11

Background(4/6)

Erasure Coding To reduce the redundancy overhead due to replication,

erasure coding can be used. In replication, a read to a lost block can be re-directed to

another block replica. In erasure coding, reading a lost block requires a

degraded read, which reads the blocks from any k surviving nodes of the same stripe and reconstructs the lost blocks.

Page 12: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

Background(5/6)

MapReduce follows locality-first scheduling. HDFS-RAID reconstructs the lost block via a degraded

read. Degraded tasks : first read data from other surviving

nodes to reconstruct the lost block and then process the reconstructed block.

Degraded tasks are given the lowest priority in the default locality-first scheduling, and they are scheduled after local and remote tasks.

12

Page 13: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

13

Background(6/6)

Page 14: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

14

Motivating Example(1/2)

HDFS uses 3-way replication and places the three replicas. The first replica is placed in a random node. The second and third replicas are placed in two different

random nodes that are located in a different rack from the first replica.

This placement policy can tolerate an arbitrary double-node failure and an arbitrary single-rack failure.

Page 15: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

15

Motivating Example(2/2)

Page 16: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

16

Design Of Degraded-first Scheduling(1/7)

Main idea is to move part of degraded tasks to the earlier stage of the map phase.

The degraded tasks can take advantage of the unused network resources while the local tasks are running.

Avoid the network resource competition among degraded tasks at the end of the map phase.

Page 17: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

17

Design Of Degraded-first Scheduling(2/7)

Basic Design

• M : total number of all map tasks to be launched.

• Md : Total number of degraded tasks to be launched.

• m : number of all map tasks that have been launched.

• md : number of degraded tasks that have been launched.

Page 18: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

18

Design Of Degraded-first Scheduling(3/7)

Page 19: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

Design Of Degraded-first Scheduling(4/7)

Locality-first scheduling R

Degraded-first scheduling R

19

• T : processing time of a map task.• S : input block size.• W : download bandwidth of each rack.• F : total number of native blocks to be processed by MapReduce.• N : number of nodes.• R : number of racks.• L : number of map slots allocated for each node.• An (n, k) erasure code to encode k native blocks to generate n−k parity blocks.

Runtime of a MapReduce job without any node failure.

Number of degraded tasks in each rack.

Expected time for downloading blocks from other racks.

Page 20: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

20

Design Of Degraded-first Scheduling(5/7)

• N = 40, R = 4, L = 4, S = 128MB• W = 1Gbps, T = 20s, F = 1440• (n, k) = (16, 12)

(a) runtime reduction ranging from 15% to 32%.

(b) runtime reduction ranging from 25% to 28%.

(a) runtime reduction ranging from 18% to 43%.

Page 21: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

21

Design Of Degraded-first Scheduling(6/7)

Enhanced Design Locality preservation

Having additional remote tasks is clearly undesirable as they compete for network resources as degraded tasks do.

We implement locality preservation by restricting the launch of degraded tasks, such that we prevent the local map tasks from being unexpectedly assigned to other nodes.

Rack awareness In failure mode, launching multiple degraded tasks in the

same rack may result in competition for network resources, since the degraded tasks download data through the same top-of-rack switch.

Page 22: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

Design Of Degraded-first Scheduling(7/7)

22

• ts : processing time for the local map tasks of each slave s.

• E[ts] : expected processing time for the local map tasks across all slaves.

• tr : duration since the last degraded task is assigned to each rack r.

• E[tr] : expected duration across all racks.

Page 23: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

23

Simulation(1/5)

Compare enhanced degraded-first scheduling (EDF) with the locality-first scheduling (LF).

Compare the basic and enhanced versions of degraded-first scheduling (BDF and EDF).

MapReduce simulator is a C++-based discrete event simulator built on CSIM20 [8].

• [8] CSIM. http://www.mesquite.com/products/csim20.htm.

Page 24: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

24

Simulation(2/5)

Locality-First vs. Degraded-First Scheduling 40 nodes evenly grouped into four racks. The rack download bandwidth is 1Gbps. The block size is 128MB. Use (20,15) erasure codes. The total number of map tasks is 1440, while the number

of reduce tasks is fixed at 30.

Page 25: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

25

(a) 17.4% for (8, 6) to 32.9% for (20, 15). (b) 34.8% to 39.6% (c) 35.1% on average

when bandwidth is 500.

Page 26: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

26

Simulation(4/5)

(d) 33.2%, 22.3%, and 5.9% on average.

(e) EDF reduces LF by 20.0% to 33.2%.

(e) EDF reduces LF by 28.6% to 48.6%.

Page 27: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

27

Simulation(5/5) Basic vs. Enhanced Degraded-First Scheduling

Heterogeneous cluster : The same configuration as the homogeneous one. Half of the nodes have worse processing power with the mean

times of the map and reduce tasks set.

Page 28: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

28

Experiments(1/4)

Run experiments on a small-scale Hadoop cluster testbed composed of a single master node and 12 slave nodes.

The 12 slaves are grouped into three racks with four slaves each.

Three I/O-heavy MapReduce jobs : WordCount Grep LineCount

Page 29: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

29

Experiments(2/4)

HDFS block size as 64MB and use a (12,10) erasure code to provide failure-tolerance.

Generate 15GB of plain text data from the Gutenberg website [17].

The data is divided into 240 blocks and written to HDFS.

• [17] Gutenberg. http://www.gutenberg.org.

Page 30: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

30

Experiments(3/4)

(a) 27.0%, 26.1% and 24.8%

(b) 16.6%, 28.4%, and 22.6%

Page 31: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

31

Experiments(4/4)

Compare the average runtime of normal map tasks (local and remote tasks), degraded tasks and reduce tasks.

The runtime of a task includes data transmission time , and the data processing time.

Page 32: Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International

32

Conclusion

We present degraded-first scheduling, a new MapReduce scheduling scheme designed for improving MapReduce performance in erasure-coded clustered storage systems that run in failure mode.

Degraded-first scheduling can reduce the MapReduce runtime of locality-first scheduling by 27.0% for a single-job scenario and 28.4% for a multi-job scenario.