ieeerepspec

8/18/2019 IEEERepSpec

1/14

A Study of Hadoop Distributed File System (HDFS)Subtitle as needed ( paper subtitle)

Authors Name/s per 1st Affiliation (Author)

line 1 (of Affiliation): dept. name of organizationline 2: name of organization, acronms acceptable

line !: "it, "ountr

line #: e$mail address if desired

Authors Name/s per 2nd Affiliation (Author)

line 1 (of Affiliation): dept. name of organizationline 2: name of organization, acronms acceptable

line !: "it, "ountr

line #: e$mail address if desired

Abstract —

The Hadoop Distributed File System (HDFS) is designed to run

on commodity hardware and can be used as a stand-alone

general purpose distributed file system (Hdfs user guide, !!")# $t

pro%ides the ability to access bul& data with high $' throughput#

s a result, this system is suitable for applications that ha%e large

$' data sets# Howe%er, the performance of HDFS decreases

dramatically when handling the operations of interaction-intensive

files, i#e#, files that ha%e relati%ely small si*e but are fre+uently

accessed# The paper analy*es the cause of throughput

degradation issue when accessing interaction-intensi%e files and

presents an enhanced HDFS architecture along with an

associated storage allocation algorithm that o%ercomes the

performance degradation problem# periments ha%e shown that

with the proposed architecture together with the associated

storage allocation algorithm, the HDFS throughput for

interaction-intensi%e files increases .!!/ on a%erage with only a

negligible performance decrease for large data set tas&s#

$# HDFS

%he &adoop 'istributed ile Sstem (&'S)a subpro*ectof the Apache &adoop pro*ectis a distributed, highl fault$tolerant file sstem designed to run on lo+$cost commodithard+are. &'S proides high$throughput access toapplication data and is suitable for applications +ith large datasets. %his article e-plores the primar features of &'S and

proides a high$leel ie+ of the &'S architecture.

%he &adoop 'istributed ile Sstem (&'S) is designed asa massie data storage frame+or and seres as the storagecomponent for the Apache &adoop platform. %his file sstem is

based on commodit hard+are and proides highl reliablestorage and global access +ith high throughput for large datasets 10. ecause of these adantages, the &'S is also usedas a stand$alone general purpose distributed file sstem andseres for non$&adoop applications 3.

&o+eer, the adantage of high throughput that the &'S proides diminishes 4uicl +hen handling interaction$intensie files, i.e., files that are of small sizes but are accessed

fre4uentl. %he reason is that before an 5/6 transmission starts,there are a fe+ necessar initialization steps +hich need to becompleted, such as data location retrieing and storage spaceallocation. 7hen transferring large data, this initializationoerhead becomes relatiel small and can be negligiblecompared +ith the data transmission itself. &o+eer, +hentransferring small size data, this oerhead becomes significant.5n addition to the initialization oerhead, files +ith high 5/6data access fre4uencies can also 4uicl oerburden theregulating component in the &'S, i.e., the single namenodethat superises and manages eer access to datanodes 1. 5f the number of datanodes is large, the single namenode can4uicl become a bottlenec +hen the fre4uenc of 5/6re4uests is high.

5n man sstems, fre4uent file access is unaoidable. or e-ample, log file updating, +hich is a common procedure inman applications.1 Since the &'S applies the rule of87rite$6nce9ead$;an


2/14

5n this paper, +e use the &'S as a stand$alone file sstemand present an integrated approach to addressing the &'S

performance degradation issue for interaction$intensie tass.5n particular, +e e-tend the &'S architecture b adding cachesupport and transforming single namenode to an e-tendedhierarchical namenode architecture. ased on the e-tendedarchitecture, +e deelop a ?article S+arm 6ptimization (?S6)$

based storage allocation algorithm to improe the &'Sthroughput for interaction$intensie tass.

%he rest of the paper is organized as follo+s. Section 2discusses the related +or focusing on the cause of throughputdegradation +hen handling interaction$intensie tass and

possible solutions deeloped for the problem. Section ! presents an enhanced &'S namenode structure. Section #describes the proposed ?S6$based storage allocationalgorithms to be deploed on the e-tended structure.@-perimental results are presented and analzed in Section 0.inall, in Section 3, +e conclude the paper +ith a summar of our findings and point out future +or.

2. elated +or As the application of the &'S increases, the pitfalls of the

&'S are also being discoered and studied. Among them isthe poor performance +hen the &'S handles small andfre4uentl interacted files. As iang et al. B pointed out, the&'S is designed for processing big data transmission rather than transferring a large number of small files, hence it is notsuitable for interaction$intensie tass.

Shacho et al. 10 notice that the &'S sacrifices theimmediate response time of indiidual 5/6 re4uests to gain

better throughput of the entire sstem. 5n other +ords, the&'S chooses the strateg of enlarging the data olume of a

single transmission to decrease the time ratio of datatransmission initialization oerhead in the oerall operation.%he transmission initialization includes permission erification,access authentication, locating e-isting file blocs for readingtass, allocating storage space for incoming +riting tass, andestablishing or maintaining transmission connections, etc.Another issue +ith the current &'S architecture is its singlenamenode structure +hich can 4uicl become the bottlenec in the presence of interaction$intensie tass and hence causeother incoming 5/6 re4uests to +ait for a long period of time

before being processed 1.

@fforts hae been made to oercome the shortcomings of the &'S. or instance, Ciu 1! combines seeral small filesinto a large file and then uses an e-tra inde- to record their offsets. &ence, the number of files is reduced and the efficiencof processing access re4uests is increased. %o enhance the&'SDs abilit of handling small files, "handrasear et al. !also use a strateg that merges small files into a larger file andrecords the location of file blocs inside the clientDs cache. %hisstrateg also allo+s the client application to circument thenamenode to directl contact the datanodes, hence alleiatesthe bottlenec issue on the namenode. 5n addition, this strateguses pre$fetching in its 5/6 process to further increase the

sstem throughput +hen dealing +ith small files. %hedra+bac of the strateg is that it re4uires client side codechange.

?reenting access re4uests from fre4uentl interacting +ithdatanodes at the bottom laer of the &'S can alsosignificantl improe the 5/6 performance for interaction$intensie tass. As illustrated in B, one +a to achiee this isto add caches to the original &'S structure. 5n 1=, a sharedcache strateg named &'"ache, +hich is built on the top of the &'S, is presented to improe the performance of smallfile access.

Ciao et al. 12 adapt the conentional hierarchicalstructure for the &'S to increase the capabilit of thenamenode component. orthaur 2 discusses in detail thearchitecture of the &'S and presents seeral possible +as toappl the hierarchical concept to the &'S. esehae et al.

proide a solution called @'S 0 +hich introducesmiddle+are and hard+are components bet+een the clientapplication and the &'S to implement a hierarchical storage

structure. %his structure uses multiple laers of resourceallocators to ease the 4uic saturation issue of the singlenamenode structure.

'esigning optimized storage allocation algorithms isanother effectie +a to improe the throughput of the &'S.5n =, a load re$balancing algorithm is deeloped to al+aseep each file in an optimal location in a dnamic &'Senironment to gain the best throughput. An algorithm for optimizing file placement in the 'ata Erid @nironment ('E@)is presented in >. %his algorithm consists of a data placementand migration strateg to help the regulating components (i.e.the namenode in the &'S) in a distributed data sharingsstem to place files in a large number of data serers.

As can be seen from the brief summar aboe, to oercomethe throughput degradation issue for interaction$intensie tass,most of the research in this area focuses on three mechanisms,i.e., optimizing metadata or adding cache, using an e-tendedstructure for the regulating component and designing ne+storage allocation algorithms. &o+eer, if these mechanismsare applied indiiduall, the bottlenec issues are onl shiftedfrom one place to the other, rather than be rooted out. %hesolution presented in this paper intends to integrate all threeapproaches and sole the performance degradation issue for interaction$intensie tass from the root. 6ur e-perimentresults hae sho+n successful eidences of the proposed

approach.

3. Extended namenode with cache support

5n this section, an e-tended namenode architecture for the&'S +ith cache support is specified. %he follo+ing t+osubsections describe the details of the structure and itsfunctionalities. !.1. @-tended namenode structure %he original&'S structure, +hich consists of a single namenode, isredesigned +ith a t+o$laer namenode structure as sho+n in


3/14

ig. 1. %he first laer contains a ;aster Name Node (;NN)+hich is actuall the namenode in the original &'S structure.%he second laer consists of a set of Secondar Name Nodes(SNNs), +hich are applied to eer rac of datanodes. eliablecaches, such as SS', are deploed to eer SNN. 5t is +orth

pointing out that the SNN does not hae to be a ne+ machineFit can be configured from an e-isting data node machine,though in our implementation, +e use a dedicated serer for each SNN. %o the ;NN, each SNN is its datanodeF to the

datanodes, the see the SNN in its rac as their namenode.&ence, the ne+ structure full preseres the original

&'S relations bet+een ;NN and SNN, and SNN anddatanode structures. %herefore all the original functions andmechanisms of the &'S are intact and the number of modifications needed for the structural e-tension is minimal.%he addition of the ne+ laer to the original &'S structurecauses changes on both the reading and the +riting procedures.Specificall, +hen a reading re4uest arries, the ;NN firstin4uires the SNNs and then naigates the client application tothe destination racs (Step 1 in ig. 1). After+ards, the SNN inthe rac checs its mapping table and continues naigating theclient application to the datanodes that hold the file blocs(Step 2). At the end, the client application contacts thedatanodes and accesses the file blocs (Step !$5). 5f a bloc is

cached, the SNN plas the role of a datanode and lets the clientapplication read the bloc from its cache (Step !$55). Similarl,to +rite data into a datenote, the ;NN first allocates one or more racs for the client application to +rite (Step A). %hen,the SNN further allocates datanodes for the client application(Step ). After+ards, the client application begins to +rite file

blocs into these datanodes (Step "$5). 5f the file bloc isassigned to the caches in that rac, then the SNN plas the roleof a datanode and lets the client application +rite the file bloc into its cache (Step "$55). 7hen the +riting is done, thedatanodes report their changes to the SNN and the SNNs reporttheir changes to the ;NN (Steps ' and @), replicas for eachincoming bloc are created in different datanodes (Step ) anda notification is sent to the client application

!.2. unctionalit of the SNN

As illustrated in ig. 1, the SNN laer is deploed bet+een the;NN laer and the datanode laer. @ach rac of datanodes hasan SNN that maintains a cache dedicated to this rac. %hislaer has three functionalities.

%he first functionalit is to eep the monitoring and messagingmechanisms bet+een the original namenode and the datanodeunchanged. %o maintain the original monitoring mechanismrunning on the ;NN, the SNN submits all the information

about its rac to the ;NN after receiing the ;NNDs heartbeatin4uir. 5n addition, the SNN sends its o+n heartbeat in4uir tothe datanodes in its rac in order to monitor their status.urthermore, in order to eep the &'SDs message mechanismintact, the SNN responds to all the messages generated b the;NN and sends its o+n control messages to the datanodes inits rac.

%he second functionalit of the SNN is to store interaction

intensie file blocs in its cache. 6nce a file is mared b the;NN as an interaction$intensie one and assigned to a rac,instead of relaing the +riting re4uest to the datanode laer, theSNN of this rac intercepts the re4uest and stores the incoming

blocs into its caches. As the 5/6 speed of a cache is muchfaster than the speed of hard diss used in the datanode, thethroughput of interaction$intensie files is improed.urthermore, as the number of SNNs is much smaller than thenumber of datanodes, file blocs in the cache can be locatedmuch faster and hence hae a shorter 5/6 response time.;oreoer, since the cost of transmission initialization on thedatanode is high, the sstem oerhead can be further reduced

b using the SNN to preent interaction$intensie tass fromfre4uentl accessing the datanode. 7hen a file is no longer interaction$intensie, the SNN +rites all of the cached file

blocs to datanodes.

%he third functionalit is to increase the ;NN capabilit of handling fre4uent re4uests. 5n the original structure, the singlenamenode becomes the bottlenec +hen the datanode pool

becomes large or +hen the re4uests come too fre4uentl. %hene+ laer turns the single namenode component into ahierarchical namenode structure and thus significantlincreases the capabilit of the &'S in managing a largeresource pool and handling large incoming re4uests. As a resultof the ne+ laer, the size of the datanode pool managed b the;NN and indiidual SNN is reduced. or the ;NN, the size of its datanode pool is the number of SNNs. or each SNN, thesize of its datanode pool is the size of the original datanode

pool diided b the number of SNNs.

4. Storage allocation algorithm

As presented in Section !, b changing the single namenode toa hierarchical namenode structure, the &'SDs capabilit of handling fre4uent re4uests increases. 5n addition, cachesintroduced in this structure proide the abilit of faster dataaccess +ith shorter response time. Gnder this ne+ structure, thethroughput degradation caused b the access to interaction$intensie files can be further reduced b appling an optimizedstorage allocation strateg. %he deelopment of this strateg isdone in three steps, i.e., for a gien file, +e need to determine(1) file blocDs interaction$intensit alue, (2) its re$allocationtriggering condition, and (!) file bloc storage location basedon the fileDs interaction$intensit.

4.1. Interaction-intensity

%he interaction$intensit of a file, denoted as the 5 alue, is ameasurement of ho+ fre4uent this file has been re4uested. %here4uest can be either a read or a +rite re4uest submitted fromone or multiple applications. All the blocs of a file share thesame 5 propert of this file. Since the 5 propert of a file ariesfrom time to time, its alue is represented as a function of time.

;ore precisel, for an e-isted file, its 5 alue at time instance tis defined as


4/14

5(f , H, t) I JJ (1)

+here H is a gien length of a time 4uantum during +hich theinteraction$intensie alue is calculated. 5t is a constant. %he set in @4. (1) is defined as belo+:

I K(fi, ti)Jfi I f ∧ ma-(L, t − H) ≤ ti ≤ tM (2)

+here t is the current sstem time and (fi, ti) denotes a fileaccess re4uest +ith file name fi and re4uest arriing time ti .

%he intuitie meaning of 5(f , H, t) is that for the file f at thetime instant t, the total number of re4uests of this file submittedto the &'S in the last time 4uantum.

4.. !ri""er condition of re-allocation

%he 5 alue of a file is dnamicF it determines the best storage place of this file, so re$allocating the file eer time its 5 aluechanges can improe the throughput for the +hole sstem.&o+eer, the re$allocation is a resource$consuming procedureFusing

%his procedure too fre4uentl +ill mae the performance een+orse. &ence, to set up the trigger conditions of re$allocation

for the e-isting interaction$intensie files is necessar.5ntuitiel, there are t+o triggering conditions for eachinteraction$intensie file:

5f a fileDs 5 alue is larger than its 5u, it indicates thatthe current storage node cannot proide enoughaailable 5/6 speed and thus this file should be moedto a faster storage node.

5f a fileDs 5 alue is smaller than its 5l , i.e. the lo+er bound of its 5 alue, it indicates that the aailable 5/6speed proided b the current storage node is higher

than the re4uirement of this file. 5n this case, this fileshould be moed to another storage node +ithrelatiel lo+er 5/6 speed, or it +ill cause resource+aste since there ma not be enough storage spacesfor the files +ith higher interaction$intensities.

Assume that there are n storage nodes. or each of them, theaailable 5/6 speed is denoted as S. Although both the access

pattern and the data distribution can impact the performance of

the hard dis, for simplicit, S is assumed to be the aeragecase. %he storage nodes are ordered b their aailable 5/6speed, i.e., Si O S* if i O *. 5n addition, assume that at timeinstance t, file f is at storage node * and the size of f is s , then+ithin the constant time 4uantum of H, the number of un$

processed re4uests G is defined as follo+s:

Suppose that the file is then re$allocated to the ne-t storagenode P*Q1 +hich proides a higher aailable 5/6 speed thannode P* does. Also, assume that the 5 alue of this file remainsunchanged in the ne-t time 4uantum, i.e., 5(t Q H) I 5(t). %hen,if the time cost of data migration (denoted as'm) of this file is

counted in, the number of e-pected un$processed re4uests G Rfor this file in the ne-t time 4uantum is as follo+s:

5n @4. (#), the alue of migration time cost 'm is calculatedas follo+s:

+here d represents the net+or band+idth bet+een node *and * Q 1.

&ence the improement rate (denoted as "r) based on theunprocessed re4uests bet+een a file on node or cache * and * Q1 is calculated as belo+:

comparing the "r alue +ith a gien non$negatie threshold, it can be determined +hether a file should be re$allocated inthe ne-t time 4uantum. 5n other +ords, if a fileDs current 5 alueis larger than its "r alue, then it should be re$allocated in thene-t time 4uantum, other+ise it should sta on the current node+ithout inoing the allocation algorithm for a ne+ storagedestination. %herefore, the first triggering condition is "r I .%hreshold is an empirical alue and is sensitie to thenet+or configuration. rom @4s. (#)9(3), at the time instance

t, the upper bound of a fileDs 5 alue 5u can be represented as

"learl if G(f , *, H, t) O L, the nodeDs or cacheDs 5/6 capacit islarger than the re4uests. Since the space of a high speed storagenode (such as the light$+orload cache) is limited, hence thefile should moe to a lo+er speed deice right after the ne-ttime 4uantum starts. 5n other +ords, the second triggering


5/14

condition is G I L, &ence, the lo+er bound of a fileDs 5 alue,i.e. 5l , can be calculated as belo+:

At the end of each time 4uantum, a fileDs 5 alue is updated. 5f the ne+ 5 alue is either larger than its 5u alue or smaller thanits 5l alue, all the blocs of this file are re$assigned to theincoming bloc 4ueue and then allocated to a ne+ storagedestination b the allocation algorithm.

#.!. #rief introduction to the $article S%arm &ptimi'ational"orithm

utilizing the 5 alue, +e present a ?article S+arm6ptimization (?S6)$based storage allocation algorithm todecide storage locations for incoming file blocs. %he conceptof the ?article S+arm 6ptimization (?S6) +as first introducedin 1L. As an eolutionar algorithm, the ?S6 algorithmdepends on the e-plorations of a group of independent agents(the particles) during their search for the optimum solution+ithin a gien solution domain (the search space). @ach

particle maes its o+n decision of the ne-t moement in thesearch space using both its o+n e-perience and the no+ledge

across the +hole s+arm. @entuall the s+arm as a +hole isliel to conerge to an optimum solution. %he ?S6 procedurestarts +ith an initial s+arm of randoml distributed particles,each +ith a random starting position and a random initialelocit. %he algorithm then iteratiel updates the position of each particle oer a time period to simulate the moement of the s+arm to its searching target. %he position of each particleis updated using its elocit ector as an increment. %hiselocit ector is generated from t+o factors 11. %he firstfactor is the best position a particle has eer reached througheach iteration. %he second factor is the best position the +holes+arm has eer detected. %he optimization procedure isterminated +hen the iteration time has reached a gien alue or all the particles hae conerged into an area +ithin a gienradius.

%o appl the ?S6 to the storage allocation problem +e needto (1) define the particle structure and the search spaceF (2)define the optimize ob*ectie functions for both ;NN andSNN laers andF (!) use the ?S6 approach to e-plore thesolution domain and eentuall derie a near optimal solutionector. %his solution ector is the allocation plan thatma-imizes the oerall throughput of an incoming batch of file

blocs.

6n both the ;NN and the SNN laers, a ?article S+arm6ptimization (?S6)$based storage allocation algorithm isapplied to search for the optimal allocation plans. %hefollo+ing t+o subsections define the solution ector as the

particle structure, the alue domain as the search space, and the

5/6 time cost function as the ob*ectie function for appling the?S6 in these t+o laers, respectiel.

4.4. Stora"e allocation at the layer

5n the ;NN laer, the ;NN onl allocates blocs to either racs or caches. %he solution ector (TUP; ) at the ;NNlaer is used as an allocation plan to indicate the storage placefor each incoming file bloc. 5n this section, +e first define thestructure of the solution ector (TUP; ), and then introducethe cost function to ealuate the 4ualit of a gien solutionector. 7ithin the solution ector TUP; , ; denotes a set of

5's of incoming file blocs, and the 4ueue H contains the 5'sof all the files +hose blocs are in ; . As the &'S uses thefirst$come9firstsere polic to schedule their tass, the 4ueueH is hence ordered b the blocDs arrial time. %he subset;5 ⊆ ; contains all the interaction$intensie blocs in ; .5f there are n racs at the SNN laer, to the ;NN, there are 2ndatanodes: the first n datanodes represent the datanode pool ineach rac and the other n datanodes represent the caches ineach rac. or e-ample, assume that the ;NN decides a bloc

be allocated to the th datanode, if V n, then this bloc is placed into the datanode pool of rac F other+ise, if W n, thatmeans this bloc is actuall allocated to the cache on the ( Tn)th rac. After+ards, +e define the solution ector at the;NN laer as TUP; . 5ts size is J; J. TUP; i representsthe location +here bloc H i is allocated. %he alue domainof entr i in ector TUP; is defined as follo+s:

s gien in @4. (B), for blocs in ;5 , the alue domain of their corresponding entries in TUP; is 1, 2n +hich indicates thatthe blocs in ;5 ma be stored in cache. or the rest of the

blocs, as their domain is in 1, n, the can onl be allocated

to datanodes in a rac. As an e-ample, assume that there are 0 racs, i.e., n I 0, andan incoming bloc H i is in set ;5 , then the alue domainof entr TUP; i is in the range of 1, 1L. 5f TUP; i I !,it indicates that the incoming bloc H i is allocated to rac !F +hile if TUP; i I B, the bloc is allocated to the cache of rac #.

%o compare the interaction$intensit of incoming files +iththose alread in the cache, the indicator X is introduced for each incoming file to represent the ratio of the fileDs current 5alue ersus the aerage 5 alue of all cached files. %hedefinition of indicator X for file f is gien belo+:

+here N is the total number of file blocs cached in the SNNlaer.

Eien the indicator X , assume that the incoming files areallocated to rac r, and there are Ar blocs on rac rF the costfunction "ost for the solution ector TUP; in the ;NN laer is defined as

+here the alue "ost is determined b t+o factors: theestimated cost of placing blocs into caches (Y), and theestimated cost of placing blocs into datanodes (Z) and 1 and2 are the +eight factors of these t+o components.

Allocating blocs +ith high (lo+) X alue into a storage placethat has lo+ (high) 5/6 +orload, such as an idle cache,reduces the "ost alueF +hile putting blocs +ith lo+ (high) X


6/14

alue to a lo+ (high) +orload storage place increases the "ostalue. As the smaller cost of 5/6 tass can bring larger throughput to the sstem, @4. (11) becomes the ob*ectiefunction for the ?S6$based algorithm.

5n @4.(11),Y(r)is defined as follo+s and +e leae thedefinition of Z(r) to the ne-t subsection:

+here arra records the size of the remaining cache spaceaailable in each rac, arras act (7act) and ag (7ag )record the number of actie reading (+riting) channelsconnected to the cache and aerage reading (+riting)throughput of the cache in each rac, respectielF the areobtained from the &'S. denotes the bloc size defined bthe sstem. "onstants ! and # are the +eight factors used tomeasure the ratio of reading and +riting fre4uencies of thecorresponding tas.

? is a coefficient used to introduce penalt into the cost

function for using caches. As the total space of caches is scarcecompared to the storage olume proided b datanodes, the

penalt ? ∗Ar r increases e-ponentiall +ith the ratio of re4uired cache space ersus remaining cache space. As a result,+hen the cache is nearl full, the ?S6 is more liel to allocatethe interaction$intensie blocs to the datanodes +ith lighter +orload rather than to the cache. urthermore, if the size of

blocs assigned to the cache of rac r is larger than its aailablespace, i.e., ∗ Ar W r, the alue of Y(r) +ill be greatlscaled up and this allocation plan is unliel to be chosen bthe ?S6.

4.*. Stora"e allocation at the S layer

or each allocation solution generated b the ;NN during

the ?S6 searching procedure, the SNNs calculate their o+nfeedbac factor Z. ased on Z, the ;NN can then ealuate the4ualit of this possible solution using @4. (11). 5n fact, thefactor Z itself is a 4ualit ealuation criterion for allocation

plans generated at the SNN laer. 5n other +ords, this is theob*ectie function for the ?S6 applied in this laer.

or the SNN in rac r, its solution ector is defined as TUP r S. Gse Sr to define the set of blocs assigned from the ;NN torac r, then the cardinalit of TUP r S is JSr J. Similar to thesolution ector TUP; of the ;NN, each entr in TUP r Sindicates the storage destination allocated to each bloc assigned from the ;NN to this rac. Gse 'r to denote thenumber of datanodes in rac r, the alue domain for each entrin TUP r S is 1, 'r, i.e., each bloc in the set Sr can be

placed into an datanode in this rac. or the SNN in rac r, itscost function Zr is defined as

+here @7 (i) and @(i) are the predicted +riting and readingspeed on node iF the are gien b the auto$regressieintegrated moing aerage prediction model presented in 13.! and # are the +eight coefficients for reading and +riting

time costs. i is used to record the number of blocs allocatedto datanode i. or the datanode i, let arras + and r recordthe aerage +riting and reading speed respectiel after eachallocation. Cet arras [ and \ record the "?G occupation rateand the aailable net+or band+idth of datanode i respectielafter each allocation. or the *th allocation of datanode i, @7(i) is calculated as follo+s:

+here @ R 7 (i) is the estimated +riting speed of the (* T 1)thallocation in datanode i. A' is the regression coefficient arrathat is computed based on past occurrences:

5n @4. (1#), the performance decrease is represented as @ R 7(i) T 7 i T 1 and e is the +eight factor for it. 7 (i) iscalculated in a similar +a as @4. (1#) in +hich @7 is replaced

b @, + is replaced b r and @ R 7 is replaced b @ R .#.3. Appl ?S6 to the storage allocation problem

5n the case of the storage allocation problem, a particle inthe ?S6 is a candidate allocation plan. %he structure of

particles in the ;NN and SNN laers are TUP; and TUPS ,respectiel.

%he particle moes +ithin the search space from one pointto another. %he edge of the search space is defined b the aluedomain of the solution ector. @ach point in the search spacerepresents an allocation combination. Since the incoming file

blocs hae different interaction$intensities and the storage places in the &'S hae different 5/6 performances, differentcombinations can proide different 5/6 throughput for

incoming batch files. 7hen one combination is selected (one particle moes to the point in the search space corresponding tothis combination), the estimated cost of this combination (the4ualit of the corresponding point) can be ealuated b theob*ect function. 5n our scenario, minimum cost indicatesma-imum throughput. urthermore, the terminating conditionof the ?S6 applied in this scenario is reaching a gien alue of the iteration time.

Cet arra p- represent the coordinates of a particleDscurrent location, arra b- represent the coordinates of the bestno+n position +ithin the histor of this particle, and arrag- denote the coordinates of the best no+n position +ithinthe histor of the entire s+arm. According to the ?S6

procedure, the moement

of a particle bet+een t+o iterations is determined b theelocit ector denoted b arra P- +hich has the same


7/14

cardinalit as the particle. 5n this arra, Pi represents the ithcomponent elocit, +hich is determined b

#.=. "onfigure the ?S6Ds parameters +ith eolutionar stateestimation Since the ?S6 applied to the storage allocationalgorithm has limited iteration time, the speed of particle

conergence is critical to the 4ualit of the final solution. 5naddition, the local optima phenomenon can greatl decrease the4ualit of the final result. %herefore, accelerating theconergence speed and aoiding the local optima hae becomethe t+o most important and appealing goals in the ?S6. %oimproe the 4ualit of the final solution, the @olutionar State@stimation (@S@) techni4ue 1> is introduced in the storageallocation algorithm.

%he basic concept of the @S@ is to use different parameter configurations in four different ?S6 search phases: e-ploration,e-ploitation, conergence and *umping out. ?articles indifferent phases should hae different behaiors, +hich aredetermined b the inertia +eight ] and t+o accelerationcoefficients "1 and "2. %o determine the search phase, an

eolutionar factor f is introduced. %his factor is based on the aggregation status of the +hole

s+arm. %ae the s+arm in the ;NN laer as an e-ample.'enote the mean distance of each particle i as diF it is definedas

+here N and J; J are the s+arm size and the number of dimensions, respectiel. 'enote di of the globall best particleas dg . "ompare all di Ds and determine the ma-imum andminimum distance dma- and dmin. %hen, the eolutionarfactor f is defined b

According to 1#,1>, the inertia +eight ] can be determinedfollo+ing f :

5n addition, based on the f factor, the current search phase can be determined. %hen, a proper combination of accelerate parameters can be chosen. 5n this paper, the determination of phase and the selection of accelerate parameters are defined asin %able 2.

%he ?S6$based algorithm +ith @S@ is illustrated in Algorithm1.

#.>. Allocation algorithm implementation

%he procedures of the 5$5 alue updating, the re$allocation bounds calculation and the storage allocation algorithm areimplemented on the different components of the e-tended

multilaer &'S structure. %he ;NN is in charge of updatinga fileDs 5$5 alue. Since all the incoming re4uests are handled bthe ;NN first, the ;NN has the no+ledge of access status of all the e-isted files. %hus, the ;NN can easil calculate the 5$5alue for files. 6n the other hand, recording the access countfor a file onl causes negligible oerhead. As the +orload of the ;NN is greatl alleiated in the e-tended structure, the;NN can proide enough computing capacit to support theupdating. 5n addition, the ;NN is used to calculate the re$allocation bounds for an interaction$intensie file at the

beginning of each time 4uantum. or the ;NN, a rac of datanodes is considered as a single storage node, so in the ie+of ;NN there are 2n nodes, i.e., n caches and n racs of datanodes. 6n the other hand, b sending the heartbeat report,each SNN submits both its cacheDs status and the aerage 5/6speed of the datanodes in its rac to the ;NN, hence the ;NNhas the no+ledge of the aailable 5/6 speed of all its 2n^nodesD. As the calculation of both the bounds for aninteraction$intensie file needs the 5/6 speed information of allthe storage nodes, the bound calculation procedure can onl beimplemented on the ;NN. %his implementation bringsnegligible +orload increase to the ;NN. or the ?S6$basedstorage allocation algorithm, both ;NN and SNN are inoled.5n each iteration of the ?S6 procedure, the ;NN is in chargeof calculating the "ost alue depicted in @4. (11) and theelocit P in @4. (13) for each particle. %o do so, both the Yalue in @4. (12) and the Z alue in @4. (1!) are needed. Sincethe ;NN can obtain the running status of caches on the SNNs,Y can be calculated locall on the ;NN. 6ther+ise, since the

datanodes onl send their heartbeat reports to a SNN, the ;NNis not a+are of the e-istence of datanode, hence the SNN is incharge of calculating the Z alue for each particle. 7hen a ne+iteration of ?S6 begins, the ;NN broadcasts the locations of each particle to all the SNNs +hile calculating the Y alue for each particle. 5n the mean time, SNNs calculate the Z alue for each particle +ith the particleDs location information gien bthe ;NN. 5t is eas to see that the ;NN and the SNNs are+oring in a parallel +a. ;oreoer, since one bloc can beonl allocated to one storage place, the calculation of Z oneach SNN is independent, that means all the SNNs are also


8/14

+oring in a parallel +a. As a result, the ?S6$based storageallocation algorithm runs er efficientl on the e-tended&'S structure.

After the Y alues and Z alues of all the particles are+ored out, the ;NN calculates the "ost alue, the elocit Pand the ne+ position of each particle. %hen, a ne+ iteration islaunched if the stop condition is not satisfied.

$$# SS012T$3 3D 45S

1) Hardware Failure

&ard+are failure is the norm rather than the e-ception. An

&'S instance ma consist of hundreds or thousands of serer

machines, each storing part of the file sstemDs data. %he factthat there are a huge number of components and that each

component has a non$triial probabilit of failure means that

some component of &'S is al+as non$functional. %herefore,

detection of faults and 4uic, automatic recoer from them isa core architectural goal of &'S.

2) Streaming Data Access

Applications that run on &'S need streaming access to their

data sets. %he are not general purpose applications thattpicall run on general purpose file sstems. &'S is

designed more for batch processing rather than interactie use

b users. %he emphasis is on high throughput of data access

rather than lo+ latenc of data access. ?6S5_ imposes manhard re4uirements that are

not needed for applications that are

targeted for &'S. ?6S5_ semantics in a fe+ e areas has

been traded to increase data throughput rates.

3) Large Data Sets

Applications that run on &'S hae large data sets. A tpical

file in &'S is gigabtes to terabtes in size. %hus, &'S is

tuned to support large files. 5t should proide high aggregate

data band+idth and scale to hundreds of nodes in a singlecluster. 5t should support tens of millions of files in a single

instance.

4) Simple Coherency odel

&'S applications need a +rite$once$read$man access modelfor files. A file once created, +ritten, and closed need not be

changed. %his assumption simplifies data coherenc issues andenables high throughput data access. A ;apeduce application

or a +eb cra+ler application fits perfectl +ith this model.%here is a plan to support appending$+rites to files in the

future.

!) "o#ing Computation isCheaper than o#ing Data$

A computation re4uested b an application is much more

efficient if it is e-ecuted near the data it operates on. %his isespeciall true +hen the size of the data set is huge. %his

minimizes net+or congestion and increases the oerall

throughput of the sstem. %he assumption is that it is often

better to migrate the computation closer to +here the data islocated rather than moing the data to +here the application is

running. &'S proides interfaces for applications to moe

themseles closer to +here the data is located.

%) &orta'ility AcrossHeterogeneous Hardware and So(tware

&lat(orms

&'S has been designed to be easil portable from one

platform to another. %his facilitates +idespread adoption of

&'S as a platform of choice for a large set of applications.

igure1: &o+ &'S architecture +ors`

%he NameNode and 'ataNode are pieces of soft+are designed

to run on commodit machines. %hese machines tpicall run a

ENG/Cinu- operating sstem (6S). &'S is built using the

aa languageF an machine that supports aa can run the

NameNode or the 'ata Node soft+are. Gsage of the highl

portable aa language means that &'S can be deploed on a

+ide range of machines. A tpical deploment has a dedicated

machine that runs onl the Name Node soft+are. @ach of the

other machines in the cluster runs one instance of the

'ataNode soft+are. %he architecture does not preclude runningmultiple 'ataNodes on the same machine but in a real

deploment that is rarel the case. %he e-istence of a single

NameNode in a cluster greatl simplifies the architecture of the

sstem. %he NameNode is the arbitrator and repositor for all

&'S metadata. %he sstem is designed in such a +a that

user data neer flo+s through the NameNode.


9/14

&'S represents an &'S namespace and enables user data to

be stored in bloc$based file chuns oer 'atanodes.

Specificall, a file is split into one or more blocs and these

blocs are stored in a set of 'atanodes. %he NameNode also

eeps reference of these blocs in a bloc pool. %he

NameNode e-ecutes &'S namespace operations lie

opening, closing, and renaming files and directories. %he

NameNode also maintains and find outs the mapping of blocsto 'atanodes. %he 'atanodes are accountable for performing

read and +rite re4uests from the file sstemDs clients. %he

'atanodes also perform bloc creation, deletion, andreplication upon receiing the instructions from the NameNode

.

%he NameNode stores the &'S namespace. %he NameNode

eeps a transaction log called the @dit Cog to continuousl

record eer modification that taes place in the file sstem

metadata. or instance, creating a ne+ file in &'S causes an

eent that ass the NameNode to insert a record into the @dit

Cog representing this action. Cie+ise, changing the replication

factor of a file causes another eent that ass the NameNode to

insert another ne+ record to be inserted into the @dit Cog

representing this action. %he NameNode uses a file in its local

host 6S file sstem to store the @dit Cog. %he entire file sstem

namespace, including the mapping of blocs to files and file

sstem properties, is stored in a file called the Silage. %he

Silage is stored as a file in the NameNodeDs local file sstem

too .

%he image of the entire file sstem namespace and file loc

map is ept in the NameNodeDs sstem memor (A;). %his

e metadata item is designed to be compact, such that a

NameNode +ith #E of A; is plent to support a large

number of files and directories. 7hen the NameNode starts up,

it reads the Silage and @dit Cog from dis, applies all the

transactions from the @dit Cog to the in$memor representation

of the Silage, and flushes out this ne+ ersion into a ne+

Silage on dis. 5t can then truncate the old @ditCog because its

transactions hae been applied to the persistent s5mage. %his

procedure is no+n as a checpoint. 5n the current

implementation, a checpoint onl occurs +hen the NameNode

starts up .

%he 'atanode stores &'S data in files in its local file sstem.

%he 'atanode has no no+ledge about &'S files. 5t stores

each bloc of &'S data in a separate file in its local file

sstem. %he 'atanode does not create all files in the same

director. 5nstead, it uses heuristics to determine the optimal

number of files per director and creates subdirectories

appropriatel. 5t is not optimal to create all local files in the

same director because the local file sstem might not be able

to efficientl support a huge number of files in a single

director. 7hen a 'atanode starts up, it scans through its local

file sstem, generates a list of all &'S data blocs that

correspond to each of these local files and sends this report to

the NameNode: this is the locreport .

%he NameNode does not receie the client re4uest to create a

file. 5n realit, the &'S client, initiall, caches the file datainto a temporar local file. Application +rites are transparentl

redirected to this temporar local file. 7hen the local file

accumulates data +orth oer one &'S bloc size, the client

contacts the NameNode. %he NameNode inserts the file name

into the file sstem hierarch and allocates a data bloc for it.

%he NameNode responds to the client re4uest +ith the identit

of the 'atanode and the destination data bloc. %hen the client

flushes the bloc of data from the local temporar file to the

specified 'atanode. 7hen a file is closed, the remaining

unflushed data in the temporar local file is transferred to the

'atanode. %he client then informs the NameNode that the file

is closed. At this point, the NameNode commits the file

creation operation into its persistent store. 5f the NameNode

dies before the file is closed, the file is lost .

Data rgani*ation

6#7 Data 8loc&s

&'S is designed to support er large files. Applications that

are compatible +ith &'S are those that deal +ith large data

sets. %hese applications +rite their data onl once but the read

it one or more times and re4uire these reads to be satisfied at

streaming speeds. &'S supports +rite$once$read$mansemantics on files. A tpical bloc size used b &'S is 3#

;. %hus, an &'S file is chopped up into 3# ; chuns,

and if possible, each chun +ill reside on a different 'ataNode

6# Staging

A client re4uest to create a file does not reach the NameNode

immediatel. 5n fact, initiall the &'S client caches the file

data into a temporar local file. Application +rites are

transparentl redirected to this temporar local file. 7hen the

local file accumulates data +orth oer one &'S bloc size,

the client contacts the NameNode. %he NameNode inserts thefile name into the file sstem hierarch and allocates a data

bloc for it. %he NameNode responds to the client re4uest +ith

the identit of the 'ataNode and the destination data bloc.

%hen the client flushes the bloc of data from the local

temporar file to the specified 'ataNode. 7hen a file is closed,

the remaining un$flushed data in the temporar local file is

transferred to the 'ataNode. %he client then tells the


10/14

NameNode that the file is closed. At this point, the NameNode

commits the file creation operation into a persistent store. 5f the

NameNode dies before the file is closed, the file is lost. %he

aboe approach has been adopted after careful consideration of

target applications that run on &'S. %hese applications need

streaming +rites to files. 5f a client +rites to a remote file

directl +ithout an client side buffering, the net+or speed

and the congestion in the net+or impacts throughputconsiderabl. %his approach is not +ithout precedent. @arlier

distributed file sstems, e.g. AS, hae used client side caching

to improe performance. A ?6S5_ re4uirement has been

rela-ed to achiee higher performance of data uploads.

6#. 9eplication 2ipelining

7hen a client is +riting data to an &'S file, its data is first

+ritten to a local file as e-plained in the preious section.

Suppose the &'S file has a replication factor of three. 7hen

the local file accumulates a full bloc of user data, the client

retriees a list of 'ataNodes from the NameNode. %his listcontains the 'ataNodes that +ill host a replica of that bloc.

%he client then flushes the data bloc to the first 'ataNode.

%he first 'ataNode starts receiing the data in small portions (#

), +rites each portion to its local repositor and transfers

that portion to the second 'ataNode in the list. %he second

'ataNode, in turn starts receiing each portion of the data

bloc, +rites that portion to its repositor and then flushes that

portion to the third 'ataNode. inall, the third 'ataNode

+rites the data to its local repositor. %hus, a 'ataNode can be

receiing data from the preious one in the pipeline and at the

same time for+arding data to the ne-t one in the pipeline.

%hus, the data is pipelined from one 'ataNode to the ne-t.

7! ccessibility

&'S can be accessed from applications in man different

+as. Natiel, &'S proides a aa A?5 for applications to

use. A " language +rapper for this aa A?5 is also aailable.

5n addition, an &%%? bro+ser can also be used to bro+se the

files of an &'S instance. 7or is in progress to e-pose &'S

through the 7eb'AP protocol.

10.1 S Shell

&'S allo+s user data to be organized in the form of files and

directories. 5t proides a commandline interface called S shell

that lets a user interact +ith the data in &'S. %he snta- of

this command set is similar to other shells (e.g. bash, csh) that

users are alread familiar +ith. &ere are some sample

action/command pairs:

S shell is targeted for applications that need a scripting

language to interact +ith the stored data.

10.2 DFSAdmin

The DFSAdmin command set is used for

administering an HDFS cluster. These are

commands that are used only by an HDFS

administrator. Here are some samle action!

command airs"

10.3 !rowser "nterface

A tpical &'S install configures a +eb serer to e-pose the

&'S namespace through a configurable %"? port. %his

allo+s a user to naigate the &'S namespace and ie+ the

contents of its files using a +eb bro+ser.

Space #eclamation

ile $eletes and %ndeletes

7hen a file is deleted b a user or an application, it is not

immediatel remoed from &'S. 5nstead, &'S first

renames it to a file in the /trash director. %he file can be

restored 4uicl as long as it remains in /trash. A file remains in

/trash for a configurable amount of time. After the e-pir of its

life in /trash, the NameNode deletes the file from the &'S

namespace. %he deletion of a file causes the blocs associated

+ith the file to be freed. Note that there could be an appreciable

time dela bet+een the time a file is deleted b a user and the

time of the corresponding increase in free space in &'S.

A user can Gndelete a file after deleting it as long as it remainsin the /trash director. 5f a user +ants to undelete a file that

he/she has deleted, he/she can naigate the /trash director and

retriee the file. %he /trash director contains onl the latest

cop of the file that +as deleted. %he /trash director is *ust lie

an other director +ith one special feature: &'S applies

specified policies to automaticall delete files from this

director. %he current default polic is to delete files from


11/14

/trash that are more than 3 hours old. 5n the future, this polic

+ill be configurable through a +ell defined interface.

$ecrease #eplication actor

7hen the replication factor of a file is reduced, the NameNode

selects e-cess replicas that can be deleted. %he ne-t &eartbeat

transfers this information to the 'ataNode. %he 'ataNode then

remoes the corresponding blocs and the corresponding free

space appears in the cluster. 6nce again, there might be a time

dela bet+een the completion of the set eplication A?5 call

and the appearance of free space in the cluster.

555. ?@?A@ 6G ?A?@ @6@ S%C5NE

efore ou begin to format our paper, first +rite and saethe content as a separate te-t file. eep our te-t and graphic

files separate until after the te-t has been formatted and stled.'o not use hard tabs, and limit use of hard returns to onl onereturn at the end of a paragraph. 'o not add an ind of

pagination an+here in the paper. 'o not number te-t heads$the template +ill do that for ou.

inall, complete content and organizational editing beforeformatting. ?lease tae note of the follo+ing items +hen

proofreading spelling and grammar:

A. Abbre+iations and Acronyms'efine abbreiations and acronms the first time the are

used in the te-t, een after the hae been defined in theabstract. Abbreiations such as 5@@@, S5, ;S, "ES, sc, dc,and rms do not hae to be defined. 'o not use abbreiations in

the title or heads unless the are unaoidable.

#. ,nits

• Gse either S5 (;S) or "ES as primar units. (S5 unitsare encouraged.) @nglish units ma be used assecondar units (in parentheses). An e-ception +ould

be the use of @nglish units as identifiers in trade, suchas 8!.0$inch dis drie


12/14

• %he prefi- 8non< is not a +ordF it should be *oined tothe +ord it modifies, usuall +ithout a hphen.

• %here is no period after the 8et< in the Catinabbreiation 8et al.


13/14

igure Cabels: Gse > point %imes Ne+ oman for igurelabels. Gse +ords rather than smbols or abbreiations +hen+riting igure a-is labels to aoid confusing the reader. As ane-ample, +rite the 4uantit 8;agnetizationB2, pp.3>9=!.

! 5. S. acobs and ". ?. ean, 8ine particles, thin films and e-changeanisotrop,< in ;agnetism, ol. 555, E. %. ado and &. Suhl, @ds. Ne+or: Academic, 1B3!, pp. 2=19!0L.

# . @lissa, 8%itle of paper if no+n,< unpublished.

0 . Nicole, 8%itle of paper +ith onl first +ord capitalized,< . NameStand. Abbre., in press.

3 . orozu, ;. &irano, . 6a, and . %aga+a, 8@lectron spectroscopstudies on magneto$optical media and plastic substrate interface,< 5@@@%ransl. . ;agn. apan, ol. 2, pp. =#L9=#1, August 1B>= 'igests BthAnnual "onf. ;agnetics apan, p. !L1, 1B>2.

= ;. oung, %he %echnical 7riterDs &andboo. ;ill Palle, "A:Gniersit Science, 1B>B.

7e suggest that ou use a te-t bo- to insert a graphicich is ideall a !LL dpi %5 or @?S file, +ith all fonts

bedded) because, in an ;S7 document, this method isme+hat more stable than directl inserting a picture.

%o hae non$isible rules on our frame, use the

7ord 8ormat< pull$do+n menu, select %e-t o- Wors and Cines to choose No ill and No Cine.


14/14

#$%

ieeerepspec

Documents