namenode and datanode couplingfor a power-proportional hadoop distributed file system

NAMENODE AND DATANODE COUPLING

FOR A POWER-PROPORTIONAL

HADOOP DISTRIBUTED FILE SYSTEM

Hieu Hanh Le, Satoshi Hikida and Haruo Yokota

Tokyo Institute of Technology

Appeared in DASFAA 2013

The 18th International Conference on Database Systems for Advanced Applications (Wuhan, China)

1

Background

Research Motivation

Goal and Approach

Proposals

Experimental Evaluation

Conclusion

Agenda 2

Background

Hadoop Distributed File System (HDFS) is widely

used as data storage for applications in the Cloud

Commercial Off-the-self-based system

Support MapReduce framework

Good scalability

Utilize a huge number of DataNodes to store huge amount

of data requested by data-intensive applications

Expand the power consumption of storage system

Power-aware file systems are moving towards

power-proportional design

3

[Background]

Power-proportional Storage System

System should consume energy in proportion to

amount of work performed [Barroso and Holzle, 2007]

Set system’s operation to multiple gears containing

different number of data nodes

Made possible by data placement methods

4

High Gear

Node

1

Node

2

D2

Node

3

D3 D1

Node

4

D4

Low Gear

Node

1

Node

4

Node

3

Node

2

D2 D3 D1 D4

D1 D4 migration

Research Motivation 5

Gear-shifting is vital in power-proportional system

The system needs to reflect updated data that was

modified in a lower gear to guarantee the higher

performance

Re-transfer the updated data according to the data

placement

The inefficient gear-shifting process in current methods

for the HDFS [Rabbit, Sierra]

Bottleneck in metadata access

High communication cost among nodes

Rabbit: Robust and Flexible Power-proportional Storage, ACM SOCC 2010

Sierra: Practical Power-proportionality for Data Center Storage, ACM EuroSys 2011

Gear-shifting in current HDFS-based methods [1/10]

6

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Gear Up

Eg: Rabbit, Sierra

D1

D2 D3

D4

D2 D3

D1 D4

Low Gear High Gear


7

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Gear Up

Eg: Rabbit, Sierra 1. Access metadata to

identify updated blocks

Congestion

D1

D2 D3

D4

D2 D3

D1 D4

Low Gear High Gear


8

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Gear Up 2. Transfer updated

blocks

Eg: Rabbit, Sierra Congestion

D1

D2 D3

D4

D2 D3

D1 D4

2.1 Command

issuance 2.2 Transfer

block

Low Gear High Gear

1. Access metadata to



9

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks


D1

D2 D3

D4

D2 D3

D1 D4

2.1 Command


block

Low Gear High Gear




10

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks


D1

D2 D3

D4

D2 D3

D1 D4

2.1 Command


block

Low Gear High Gear




11

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks


D1

D2 D3

D4

D2 D3

D1 D4

2.1 Command


block

Low Gear High Gear




12

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks


D1

D2 D3

D4

D2 D3

D1 D4

2.1 Command


block

Low Gear High Gear




13

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks


D1

D2 D3

D4

D2 D3

D1 D4

2.1 Command


block

Low Gear High Gear




14

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks


D1

D2 D3

D4

D2 D3

D1 D4 D1

2.1 Command


block

Low Gear High Gear




15

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node

Write Dataset

D = {D1, D2, D3, D4}

Data

Node1

Data

Node4

Data

Node2

Data

Node3

Name

Node


blocks

Eg: Rabbit, Sierra

Sequentially

(1 block/connection)

Congestion

Inefficiency D1

D2 D3

D4

D2 D3

D1 D4 D1 D4

2.1 Command


block

Low Gear High Gear



Goal and Approach

Goal

Propose a novel architecture for efficient gear-shifting for power-proportional HDFS

Approach

Utilize distributed metadata management (MDM) Eliminate the bottleneck of the centralized MDM

Coupling NameNode and DataNode (NDCouplingHDFS) Localize the range of updated blocks maintained by metadata

management Reduce the communication cost among nodes

Enable multiple blocks transfer to improve the efficiency in HDFS

16

[Proposals]

Distributed MDM

Distribute MDM to multiple nodes to decentralize the load during gear-shiftings

Require a distributed MDM that is update conscious

The MDM is transferred when the system shifts gears

Low cost of search/insert/delete operations

Inefficient distributed hash table based method

For each transferred file, the hash function is needed to be applied

Efficient range based method

For a range of files, all the metadata can be transferred within a limited structure transverses

Apply two range-based methods

Each node statically maintains a separate subnamespace (Static Directory Partition-SDP)

Parallel index technique with well concurrency control (Fat-Btree) [*]

17

[*] A Concurrency Control Protocol for Parallel B-tree structure without

latch-coupling for explosively growing digital content, EDBT 2008

[Proposals]

NDCouplingHDFS with Distributed MDM

Each node maintains a subnamespace of the whole namspace of the system

The mapping information [Node, Range] is managed by Distributed MDM

18

Data

Management

Distributed

MDM

ND1

Distributed

MDM

Data

Management

ND2

Distributed

MDM

Data

Management

ND3

Distributed

MDM

Data

Management

ND4

2. Forward request to

responsible nodes

3. Serve the request

and return the results

1. Send

request of 25

4. Return results

A NDCoulingHDFS

node

ND1: [1, 10]

ND2: [11,20]

ND3: [21, 30]

ND4: [31,~]

[Proposals]

Efficient Gear-shifting [1/6] 19

Data

Management

Distributed

MDM

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

A D B C

A1 B1 C1 D1

WOL

Log

WOL

Log

A

A1

D

D1

<File, Temp Node, Intended Node>

Reactivated Reactivated A1

B1 C1

D1 A1 D1

The process is distributed to multiple nodes

The command issuance from Disitributed MDM and Data Management is locally performed

Updated blocks are transferred in batch way (multiple blocks per connection)

[Proposals]


Data

Management

Distributed

MDM

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

A D B C

A1 B1 C1 D1

WOL

Log

WOL

Log

A

A1

D

D1



B1 C1

D1 A1 D1

1. Transfer updated

metadata

1. Transfer updated

metadata




[Proposals]


Data

Management

Distributed

MDM

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

A D B C

A1 B1 C1 D1

WOL

Log

WOL

Log

A

A1

D

D1



B1 C1

D1 A1 D1

1. Transfer updated

metadata

1. Transfer updated

metadata




2. Command issuance 2. Command issuance

[Proposals]


Data

Management

Distributed

MDM

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

A D B C

A1 B1 C1 D1

WOL

Log

WOL

Log

A

A1

D

D1



B1 C1

D1 A1 D1

1. Transfer updated

metadata

1. Transfer updated

metadata





3. Transfer blocks 3. Transfer blocks

[Proposals]


Data

Management

Distributed

MDM

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

A D B C

A1 B1 C1 D1

WOL

Log

WOL

Log

A

A1

D

D1



B1 C1

D1 A1 D1

1. Transfer updated

metadata

1. Transfer updated

metadata






4. Updated metadata 4. Updated metadata

[Proposals]


Data

Management

Distributed

MDM

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

Distributed

MDM

Data

Management

A D B C

A1 B1 C1 D1

WOL

Log

WOL

Log

A

A1

D

D1



B1 C1

D1 A1 D1

1. Transfer updated

metadata

1. Transfer updated

metadata






4. Updated metadata 4. Updated metadata

Parallelism

Reduce

network cost

Efficient block

transfer

Experiment Evaluation

Experiment 1

Verify the effectiveness of proposals in gear-shifting

process by comparing with the normal HDFS

Updated block reflection is the major cost

Coupling architecture, batch block transferring

Experiment 2

Evaluate the effectiveness of distributed index

technique to NDCouplingHDFS

SDP and Fat-Btree through changing the number of nodes

25

[Experiment 1]

Validity of NDCouplingHDFS in Gear-shifting 26

Updated Data Reflection

# Gears 2

# Active nodes at Low Gear 8

# Active nodes at High

Gear

16

# files 16000

File size 1MB

HDFS

Version 0.20.2

Maximum number of

transferred blocks

100

Heartbeat interval 1s

Compare the execution time of updated data

reflection the NDCouplingHDFS with the normal

HDFS based on five configurations

Combinations of architecture, distributed MDM (SDP,

Fat-Btree), command issuance, block transfer

Environment

0

5

10

15

20

25

30

35

40

45

0

10

20

30

40

50

60

70

NormalHDFS SSS SBS SBB FBB

Execution time

Number of communication connections[commnand issuance]

[Experiment 1]

Experimental Results 27

46% 41%

Configuration Normal

HDFS

SSS SBS SBB FBB

Architecture HDFS Coupling Coupling Coupling Coupling

MDM Central SDP SDP SDP Fat-Btree

Command

issuance

Sequential Sequential Batch

Batch

Batch

Block

transference

Sequential

Sequential

Sequential Batch Batch

Coupling architecture and

Batch block transferring highly

effected the performance

[s]

[Experiment 2]

Scalability of metadata operations

Evaluate SDP vs. Fat-Btree

Change the number of files and number of nodes

28

Machine

# 1, 2, 4, 8

CPU TM8600 1.0GHz

Memory DRAM 4GB

NIC 1000Mb/s

OS Linux 3.0 64bit

Java JDK-1.7.0

Fat-Btree

Fanout 16

Control

Concurrency

LCFB [Yoshihara, 2007]

Workload

#files 3000

File size 1MB

Fat-Btree gained better scalability when the number of nodes increases

The read throughput scaled well due to better search cost and concurrency control

The efficiency in write throughput is limited due to the synchronization cost in updating tree structure

[Experiment 2]

Experimental Results 29

0

50

100

150

200

250

300

350

1 2 4 8

SDP

Fat-Btree

0

5

10

15

20

25

30

1 2 4 8

SDP

FBT

Read T

hroug

hput

[opera

tion/

s]

Write

Thr

oug

hput

[opera

tions

/s]

A transaction: open/create metadata

and read/write files

Conclusion

Proposed NDCouplingHDFS for efficient gear-shifting in power-proportional HDFS

Significantly reduced at most 46% the execution time of reflecting updated data compare with the normal HDFS

Coupling architecture and batch block transferring

Improved the IO performance by applying distributed index technique to NDCouplingHDFS

NDCouplingHDFS

Maintains supporting MapReduce

Exptected to achieve real power-proportionality including power consumption of metadata management

30

NameNode and DataNode Coupling for a

Power-proportional Hadoop Distributed File System

Thank you for your attention! 31