ieeerepspec
TRANSCRIPT
-
8/18/2019 IEEERepSpec
1/14
A Study of Hadoop Distributed File System (HDFS)Subtitle as needed ( paper subtitle)
Authors Name/s per 1st Affiliation (Author)
line 1 (of Affiliation): dept. name of organizationline 2: name of organization, acronms acceptable
line !: "it, "ountr
line #: e$mail address if desired
Authors Name/s per 2nd Affiliation (Author)
line 1 (of Affiliation): dept. name of organizationline 2: name of organization, acronms acceptable
line !: "it, "ountr
line #: e$mail address if desired
Abstract —
The Hadoop Distributed File System (HDFS) is designed to run
on commodity hardware and can be used as a stand-alone
general purpose distributed file system (Hdfs user guide, !!")# $t
pro%ides the ability to access bul& data with high $' throughput#
s a result, this system is suitable for applications that ha%e large
$' data sets# Howe%er, the performance of HDFS decreases
dramatically when handling the operations of interaction-intensive
files, i#e#, files that ha%e relati%ely small si*e but are fre+uently
accessed# The paper analy*es the cause of throughput
degradation issue when accessing interaction-intensi%e files and
presents an enhanced HDFS architecture along with an
associated storage allocation algorithm that o%ercomes the
performance degradation problem# periments ha%e shown that
with the proposed architecture together with the associated
storage allocation algorithm, the HDFS throughput for
interaction-intensi%e files increases .!!/ on a%erage with only a
negligible performance decrease for large data set tas&s#
$# HDFS
%he &adoop 'istributed ile Sstem (&'S)a subpro*ectof the Apache &adoop pro*ectis a distributed, highl fault$tolerant file sstem designed to run on lo+$cost commodithard+are. &'S proides high$throughput access toapplication data and is suitable for applications +ith large datasets. %his article e-plores the primar features of &'S and
proides a high$leel ie+ of the &'S architecture.
%he &adoop 'istributed ile Sstem (&'S) is designed asa massie data storage frame+or and seres as the storagecomponent for the Apache &adoop platform. %his file sstem is
based on commodit hard+are and proides highl reliablestorage and global access +ith high throughput for large datasets 10. ecause of these adantages, the &'S is also usedas a stand$alone general purpose distributed file sstem andseres for non$&adoop applications 3.
&o+eer, the adantage of high throughput that the &'S proides diminishes 4uicl +hen handling interaction$intensie files, i.e., files that are of small sizes but are accessed
fre4uentl. %he reason is that before an 5/6 transmission starts,there are a fe+ necessar initialization steps +hich need to becompleted, such as data location retrieing and storage spaceallocation. 7hen transferring large data, this initializationoerhead becomes relatiel small and can be negligiblecompared +ith the data transmission itself. &o+eer, +hentransferring small size data, this oerhead becomes significant.5n addition to the initialization oerhead, files +ith high 5/6data access fre4uencies can also 4uicl oerburden theregulating component in the &'S, i.e., the single namenodethat superises and manages eer access to datanodes 1. 5f the number of datanodes is large, the single namenode can4uicl become a bottlenec +hen the fre4uenc of 5/6re4uests is high.
5n man sstems, fre4uent file access is unaoidable. or e-ample, log file updating, +hich is a common procedure inman applications.1 Since the &'S applies the rule of87rite$6nce9ead$;an
-
8/18/2019 IEEERepSpec
2/14
5n this paper, +e use the &'S as a stand$alone file sstemand present an integrated approach to addressing the &'S
performance degradation issue for interaction$intensie tass.5n particular, +e e-tend the &'S architecture b adding cachesupport and transforming single namenode to an e-tendedhierarchical namenode architecture. ased on the e-tendedarchitecture, +e deelop a ?article S+arm 6ptimization (?S6)$
based storage allocation algorithm to improe the &'Sthroughput for interaction$intensie tass.
%he rest of the paper is organized as follo+s. Section 2discusses the related +or focusing on the cause of throughputdegradation +hen handling interaction$intensie tass and
possible solutions deeloped for the problem. Section ! presents an enhanced &'S namenode structure. Section #describes the proposed ?S6$based storage allocationalgorithms to be deploed on the e-tended structure.@-perimental results are presented and analzed in Section 0.inall, in Section 3, +e conclude the paper +ith a summar of our findings and point out future +or.
2. elated +or As the application of the &'S increases, the pitfalls of the
&'S are also being discoered and studied. Among them isthe poor performance +hen the &'S handles small andfre4uentl interacted files. As iang et al. B pointed out, the&'S is designed for processing big data transmission rather than transferring a large number of small files, hence it is notsuitable for interaction$intensie tass.
Shacho et al. 10 notice that the &'S sacrifices theimmediate response time of indiidual 5/6 re4uests to gain
better throughput of the entire sstem. 5n other +ords, the&'S chooses the strateg of enlarging the data olume of a
single transmission to decrease the time ratio of datatransmission initialization oerhead in the oerall operation.%he transmission initialization includes permission erification,access authentication, locating e-isting file blocs for readingtass, allocating storage space for incoming +riting tass, andestablishing or maintaining transmission connections, etc.Another issue +ith the current &'S architecture is its singlenamenode structure +hich can 4uicl become the bottlenec in the presence of interaction$intensie tass and hence causeother incoming 5/6 re4uests to +ait for a long period of time
before being processed 1.
@fforts hae been made to oercome the shortcomings of the &'S. or instance, Ciu 1! combines seeral small filesinto a large file and then uses an e-tra inde- to record their offsets. &ence, the number of files is reduced and the efficiencof processing access re4uests is increased. %o enhance the&'SDs abilit of handling small files, "handrasear et al. !also use a strateg that merges small files into a larger file andrecords the location of file blocs inside the clientDs cache. %hisstrateg also allo+s the client application to circument thenamenode to directl contact the datanodes, hence alleiatesthe bottlenec issue on the namenode. 5n addition, this strateguses pre$fetching in its 5/6 process to further increase the
sstem throughput +hen dealing +ith small files. %hedra+bac of the strateg is that it re4uires client side codechange.
?reenting access re4uests from fre4uentl interacting +ithdatanodes at the bottom laer of the &'S can alsosignificantl improe the 5/6 performance for interaction$intensie tass. As illustrated in B, one +a to achiee this isto add caches to the original &'S structure. 5n 1=, a sharedcache strateg named &'"ache, +hich is built on the top of the &'S, is presented to improe the performance of smallfile access.
Ciao et al. 12 adapt the conentional hierarchicalstructure for the &'S to increase the capabilit of thenamenode component. orthaur 2 discusses in detail thearchitecture of the &'S and presents seeral possible +as toappl the hierarchical concept to the &'S. esehae et al.
proide a solution called @'S 0 +hich introducesmiddle+are and hard+are components bet+een the clientapplication and the &'S to implement a hierarchical storage
structure. %his structure uses multiple laers of resourceallocators to ease the 4uic saturation issue of the singlenamenode structure.
'esigning optimized storage allocation algorithms isanother effectie +a to improe the throughput of the &'S.5n =, a load re$balancing algorithm is deeloped to al+aseep each file in an optimal location in a dnamic &'Senironment to gain the best throughput. An algorithm for optimizing file placement in the 'ata Erid @nironment ('E@)is presented in >. %his algorithm consists of a data placementand migration strateg to help the regulating components (i.e.the namenode in the &'S) in a distributed data sharingsstem to place files in a large number of data serers.
As can be seen from the brief summar aboe, to oercomethe throughput degradation issue for interaction$intensie tass,most of the research in this area focuses on three mechanisms,i.e., optimizing metadata or adding cache, using an e-tendedstructure for the regulating component and designing ne+storage allocation algorithms. &o+eer, if these mechanismsare applied indiiduall, the bottlenec issues are onl shiftedfrom one place to the other, rather than be rooted out. %hesolution presented in this paper intends to integrate all threeapproaches and sole the performance degradation issue for interaction$intensie tass from the root. 6ur e-perimentresults hae sho+n successful eidences of the proposed
approach.
3. Extended namenode with cache support
5n this section, an e-tended namenode architecture for the&'S +ith cache support is specified. %he follo+ing t+osubsections describe the details of the structure and itsfunctionalities. !.1. @-tended namenode structure %he original&'S structure, +hich consists of a single namenode, isredesigned +ith a t+o$laer namenode structure as sho+n in
-
8/18/2019 IEEERepSpec
3/14
ig. 1. %he first laer contains a ;aster Name Node (;NN)+hich is actuall the namenode in the original &'S structure.%he second laer consists of a set of Secondar Name Nodes(SNNs), +hich are applied to eer rac of datanodes. eliablecaches, such as SS', are deploed to eer SNN. 5t is +orth
pointing out that the SNN does not hae to be a ne+ machineFit can be configured from an e-isting data node machine,though in our implementation, +e use a dedicated serer for each SNN. %o the ;NN, each SNN is its datanodeF to the
datanodes, the see the SNN in its rac as their namenode.&ence, the ne+ structure full preseres the original
&'S relations bet+een ;NN and SNN, and SNN anddatanode structures. %herefore all the original functions andmechanisms of the &'S are intact and the number of modifications needed for the structural e-tension is minimal.%he addition of the ne+ laer to the original &'S structurecauses changes on both the reading and the +riting procedures.Specificall, +hen a reading re4uest arries, the ;NN firstin4uires the SNNs and then naigates the client application tothe destination racs (Step 1 in ig. 1). After+ards, the SNN inthe rac checs its mapping table and continues naigating theclient application to the datanodes that hold the file blocs(Step 2). At the end, the client application contacts thedatanodes and accesses the file blocs (Step !$5). 5f a bloc is
cached, the SNN plas the role of a datanode and lets the clientapplication read the bloc from its cache (Step !$55). Similarl,to +rite data into a datenote, the ;NN first allocates one or more racs for the client application to +rite (Step A). %hen,the SNN further allocates datanodes for the client application(Step ). After+ards, the client application begins to +rite file
blocs into these datanodes (Step "$5). 5f the file bloc isassigned to the caches in that rac, then the SNN plas the roleof a datanode and lets the client application +rite the file bloc into its cache (Step "$55). 7hen the +riting is done, thedatanodes report their changes to the SNN and the SNNs reporttheir changes to the ;NN (Steps ' and @), replicas for eachincoming bloc are created in different datanodes (Step ) anda notification is sent to the client application
!.2. unctionalit of the SNN
As illustrated in ig. 1, the SNN laer is deploed bet+een the;NN laer and the datanode laer. @ach rac of datanodes hasan SNN that maintains a cache dedicated to this rac. %hislaer has three functionalities.
%he first functionalit is to eep the monitoring and messagingmechanisms bet+een the original namenode and the datanodeunchanged. %o maintain the original monitoring mechanismrunning on the ;NN, the SNN submits all the information
about its rac to the ;NN after receiing the ;NNDs heartbeatin4uir. 5n addition, the SNN sends its o+n heartbeat in4uir tothe datanodes in its rac in order to monitor their status.urthermore, in order to eep the &'SDs message mechanismintact, the SNN responds to all the messages generated b the;NN and sends its o+n control messages to the datanodes inits rac.
%he second functionalit of the SNN is to store interaction
intensie file blocs in its cache. 6nce a file is mared b the;NN as an interaction$intensie one and assigned to a rac,instead of relaing the +riting re4uest to the datanode laer, theSNN of this rac intercepts the re4uest and stores the incoming
blocs into its caches. As the 5/6 speed of a cache is muchfaster than the speed of hard diss used in the datanode, thethroughput of interaction$intensie files is improed.urthermore, as the number of SNNs is much smaller than thenumber of datanodes, file blocs in the cache can be locatedmuch faster and hence hae a shorter 5/6 response time.;oreoer, since the cost of transmission initialization on thedatanode is high, the sstem oerhead can be further reduced
b using the SNN to preent interaction$intensie tass fromfre4uentl accessing the datanode. 7hen a file is no longer interaction$intensie, the SNN +rites all of the cached file
blocs to datanodes.
%he third functionalit is to increase the ;NN capabilit of handling fre4uent re4uests. 5n the original structure, the singlenamenode becomes the bottlenec +hen the datanode pool
becomes large or +hen the re4uests come too fre4uentl. %hene+ laer turns the single namenode component into ahierarchical namenode structure and thus significantlincreases the capabilit of the &'S in managing a largeresource pool and handling large incoming re4uests. As a resultof the ne+ laer, the size of the datanode pool managed b the;NN and indiidual SNN is reduced. or the ;NN, the size of its datanode pool is the number of SNNs. or each SNN, thesize of its datanode pool is the size of the original datanode
pool diided b the number of SNNs.
4. Storage allocation algorithm
As presented in Section !, b changing the single namenode toa hierarchical namenode structure, the &'SDs capabilit of handling fre4uent re4uests increases. 5n addition, cachesintroduced in this structure proide the abilit of faster dataaccess +ith shorter response time. Gnder this ne+ structure, thethroughput degradation caused b the access to interaction$intensie files can be further reduced b appling an optimizedstorage allocation strateg. %he deelopment of this strateg isdone in three steps, i.e., for a gien file, +e need to determine(1) file blocDs interaction$intensit alue, (2) its re$allocationtriggering condition, and (!) file bloc storage location basedon the fileDs interaction$intensit.
4.1. Interaction-intensity
%he interaction$intensit of a file, denoted as the 5 alue, is ameasurement of ho+ fre4uent this file has been re4uested. %here4uest can be either a read or a +rite re4uest submitted fromone or multiple applications. All the blocs of a file share thesame 5 propert of this file. Since the 5 propert of a file ariesfrom time to time, its alue is represented as a function of time.
;ore precisel, for an e-isted file, its 5 alue at time instance tis defined as
-
8/18/2019 IEEERepSpec
4/14
5(f , H, t) I JJ (1)
+here H is a gien length of a time 4uantum during +hich theinteraction$intensie alue is calculated. 5t is a constant. %he set in @4. (1) is defined as belo+:
I K(fi, ti)Jfi I f ∧ ma-(L, t − H) ≤ ti ≤ tM (2)
+here t is the current sstem time and (fi, ti) denotes a fileaccess re4uest +ith file name fi and re4uest arriing time ti .
%he intuitie meaning of 5(f , H, t) is that for the file f at thetime instant t, the total number of re4uests of this file submittedto the &'S in the last time 4uantum.
4.. !ri""er condition of re-allocation
%he 5 alue of a file is dnamicF it determines the best storage place of this file, so re$allocating the file eer time its 5 aluechanges can improe the throughput for the +hole sstem.&o+eer, the re$allocation is a resource$consuming procedureFusing
%his procedure too fre4uentl +ill mae the performance een+orse. &ence, to set up the trigger conditions of re$allocation
for the e-isting interaction$intensie files is necessar.5ntuitiel, there are t+o triggering conditions for eachinteraction$intensie file:
5f a fileDs 5 alue is larger than its 5u, it indicates thatthe current storage node cannot proide enoughaailable 5/6 speed and thus this file should be moedto a faster storage node.
5f a fileDs 5 alue is smaller than its 5l , i.e. the lo+er bound of its 5 alue, it indicates that the aailable 5/6speed proided b the current storage node is higher
than the re4uirement of this file. 5n this case, this fileshould be moed to another storage node +ithrelatiel lo+er 5/6 speed, or it +ill cause resource+aste since there ma not be enough storage spacesfor the files +ith higher interaction$intensities.
Assume that there are n storage nodes. or each of them, theaailable 5/6 speed is denoted as S. Although both the access
pattern and the data distribution can impact the performance of
the hard dis, for simplicit, S is assumed to be the aeragecase. %he storage nodes are ordered b their aailable 5/6speed, i.e., Si O S* if i O *. 5n addition, assume that at timeinstance t, file f is at storage node * and the size of f is s , then+ithin the constant time 4uantum of H, the number of un$
processed re4uests G is defined as follo+s:
Suppose that the file is then re$allocated to the ne-t storagenode P*Q1 +hich proides a higher aailable 5/6 speed thannode P* does. Also, assume that the 5 alue of this file remainsunchanged in the ne-t time 4uantum, i.e., 5(t Q H) I 5(t). %hen,if the time cost of data migration (denoted as'm) of this file is
counted in, the number of e-pected un$processed re4uests G Rfor this file in the ne-t time 4uantum is as follo+s:
5n @4. (#), the alue of migration time cost 'm is calculatedas follo+s:
+here d represents the net+or band+idth bet+een node *and * Q 1.
&ence the improement rate (denoted as "r) based on theunprocessed re4uests bet+een a file on node or cache * and * Q1 is calculated as belo+:
comparing the "r alue +ith a gien non$negatie threshold, it can be determined +hether a file should be re$allocated inthe ne-t time 4uantum. 5n other +ords, if a fileDs current 5 alueis larger than its "r alue, then it should be re$allocated in thene-t time 4uantum, other+ise it should sta on the current node+ithout inoing the allocation algorithm for a ne+ storagedestination. %herefore, the first triggering condition is "r I .%hreshold is an empirical alue and is sensitie to thenet+or configuration. rom @4s. (#)9(3), at the time instance
t, the upper bound of a fileDs 5 alue 5u can be represented as
"learl if G(f , *, H, t) O L, the nodeDs or cacheDs 5/6 capacit islarger than the re4uests. Since the space of a high speed storagenode (such as the light$+orload cache) is limited, hence thefile should moe to a lo+er speed deice right after the ne-ttime 4uantum starts. 5n other +ords, the second triggering
-
8/18/2019 IEEERepSpec
5/14
condition is G I L, &ence, the lo+er bound of a fileDs 5 alue,i.e. 5l , can be calculated as belo+:
At the end of each time 4uantum, a fileDs 5 alue is updated. 5f the ne+ 5 alue is either larger than its 5u alue or smaller thanits 5l alue, all the blocs of this file are re$assigned to theincoming bloc 4ueue and then allocated to a ne+ storagedestination b the allocation algorithm.
#.!. #rief introduction to the $article S%arm &ptimi'ational"orithm
utilizing the 5 alue, +e present a ?article S+arm6ptimization (?S6)$based storage allocation algorithm todecide storage locations for incoming file blocs. %he conceptof the ?article S+arm 6ptimization (?S6) +as first introducedin 1L. As an eolutionar algorithm, the ?S6 algorithmdepends on the e-plorations of a group of independent agents(the particles) during their search for the optimum solution+ithin a gien solution domain (the search space). @ach
particle maes its o+n decision of the ne-t moement in thesearch space using both its o+n e-perience and the no+ledge
across the +hole s+arm. @entuall the s+arm as a +hole isliel to conerge to an optimum solution. %he ?S6 procedurestarts +ith an initial s+arm of randoml distributed particles,each +ith a random starting position and a random initialelocit. %he algorithm then iteratiel updates the position of each particle oer a time period to simulate the moement of the s+arm to its searching target. %he position of each particleis updated using its elocit ector as an increment. %hiselocit ector is generated from t+o factors 11. %he firstfactor is the best position a particle has eer reached througheach iteration. %he second factor is the best position the +holes+arm has eer detected. %he optimization procedure isterminated +hen the iteration time has reached a gien alue or all the particles hae conerged into an area +ithin a gienradius.
%o appl the ?S6 to the storage allocation problem +e needto (1) define the particle structure and the search spaceF (2)define the optimize ob*ectie functions for both ;NN andSNN laers andF (!) use the ?S6 approach to e-plore thesolution domain and eentuall derie a near optimal solutionector. %his solution ector is the allocation plan thatma-imizes the oerall throughput of an incoming batch of file
blocs.
6n both the ;NN and the SNN laers, a ?article S+arm6ptimization (?S6)$based storage allocation algorithm isapplied to search for the optimal allocation plans. %hefollo+ing t+o subsections define the solution ector as the
particle structure, the alue domain as the search space, and the
5/6 time cost function as the ob*ectie function for appling the?S6 in these t+o laers, respectiel.
4.4. Stora"e allocation at the layer
5n the ;NN laer, the ;NN onl allocates blocs to either racs or caches. %he solution ector (TUP; ) at the ;NNlaer is used as an allocation plan to indicate the storage placefor each incoming file bloc. 5n this section, +e first define thestructure of the solution ector (TUP; ), and then introducethe cost function to ealuate the 4ualit of a gien solutionector. 7ithin the solution ector TUP; , ; denotes a set of
5's of incoming file blocs, and the 4ueue H contains the 5'sof all the files +hose blocs are in ; . As the &'S uses thefirst$come9firstsere polic to schedule their tass, the 4ueueH is hence ordered b the blocDs arrial time. %he subset;5 ⊆ ; contains all the interaction$intensie blocs in ; .5f there are n racs at the SNN laer, to the ;NN, there are 2ndatanodes: the first n datanodes represent the datanode pool ineach rac and the other n datanodes represent the caches ineach rac. or e-ample, assume that the ;NN decides a bloc
be allocated to the th datanode, if V n, then this bloc is placed into the datanode pool of rac F other+ise, if W n, thatmeans this bloc is actuall allocated to the cache on the ( Tn)th rac. After+ards, +e define the solution ector at the;NN laer as TUP; . 5ts size is J; J. TUP; i representsthe location +here bloc H i is allocated. %he alue domainof entr i in ector TUP; is defined as follo+s:
s gien in @4. (B), for blocs in ;5 , the alue domain of their corresponding entries in TUP; is 1, 2n +hich indicates thatthe blocs in ;5 ma be stored in cache. or the rest of the
blocs, as their domain is in 1, n, the can onl be allocated
to datanodes in a rac. As an e-ample, assume that there are 0 racs, i.e., n I 0, andan incoming bloc H i is in set ;5 , then the alue domainof entr TUP; i is in the range of 1, 1L. 5f TUP; i I !,it indicates that the incoming bloc H i is allocated to rac !F +hile if TUP; i I B, the bloc is allocated to the cache of rac #.
%o compare the interaction$intensit of incoming files +iththose alread in the cache, the indicator X is introduced for each incoming file to represent the ratio of the fileDs current 5alue ersus the aerage 5 alue of all cached files. %hedefinition of indicator X for file f is gien belo+:
+here N is the total number of file blocs cached in the SNNlaer.
Eien the indicator X , assume that the incoming files areallocated to rac r, and there are Ar blocs on rac rF the costfunction "ost for the solution ector TUP; in the ;NN laer is defined as
+here the alue "ost is determined b t+o factors: theestimated cost of placing blocs into caches (Y), and theestimated cost of placing blocs into datanodes (Z) and 1 and2 are the +eight factors of these t+o components.
Allocating blocs +ith high (lo+) X alue into a storage placethat has lo+ (high) 5/6 +orload, such as an idle cache,reduces the "ost alueF +hile putting blocs +ith lo+ (high) X
-
8/18/2019 IEEERepSpec
6/14
alue to a lo+ (high) +orload storage place increases the "ostalue. As the smaller cost of 5/6 tass can bring larger throughput to the sstem, @4. (11) becomes the ob*ectiefunction for the ?S6$based algorithm.
5n @4.(11),Y(r)is defined as follo+s and +e leae thedefinition of Z(r) to the ne-t subsection:
+here arra records the size of the remaining cache spaceaailable in each rac, arras act (7act) and ag (7ag )record the number of actie reading (+riting) channelsconnected to the cache and aerage reading (+riting)throughput of the cache in each rac, respectielF the areobtained from the &'S. denotes the bloc size defined bthe sstem. "onstants ! and # are the +eight factors used tomeasure the ratio of reading and +riting fre4uencies of thecorresponding tas.
? is a coefficient used to introduce penalt into the cost
function for using caches. As the total space of caches is scarcecompared to the storage olume proided b datanodes, the
penalt ? ∗Ar r increases e-ponentiall +ith the ratio of re4uired cache space ersus remaining cache space. As a result,+hen the cache is nearl full, the ?S6 is more liel to allocatethe interaction$intensie blocs to the datanodes +ith lighter +orload rather than to the cache. urthermore, if the size of
blocs assigned to the cache of rac r is larger than its aailablespace, i.e., ∗ Ar W r, the alue of Y(r) +ill be greatlscaled up and this allocation plan is unliel to be chosen bthe ?S6.
4.*. Stora"e allocation at the S layer
or each allocation solution generated b the ;NN during
the ?S6 searching procedure, the SNNs calculate their o+nfeedbac factor Z. ased on Z, the ;NN can then ealuate the4ualit of this possible solution using @4. (11). 5n fact, thefactor Z itself is a 4ualit ealuation criterion for allocation
plans generated at the SNN laer. 5n other +ords, this is theob*ectie function for the ?S6 applied in this laer.
or the SNN in rac r, its solution ector is defined as TUP r S. Gse Sr to define the set of blocs assigned from the ;NN torac r, then the cardinalit of TUP r S is JSr J. Similar to thesolution ector TUP; of the ;NN, each entr in TUP r Sindicates the storage destination allocated to each bloc assigned from the ;NN to this rac. Gse 'r to denote thenumber of datanodes in rac r, the alue domain for each entrin TUP r S is 1, 'r, i.e., each bloc in the set Sr can be
placed into an datanode in this rac. or the SNN in rac r, itscost function Zr is defined as
+here @7 (i) and @(i) are the predicted +riting and readingspeed on node iF the are gien b the auto$regressieintegrated moing aerage prediction model presented in 13.! and # are the +eight coefficients for reading and +riting
time costs. i is used to record the number of blocs allocatedto datanode i. or the datanode i, let arras + and r recordthe aerage +riting and reading speed respectiel after eachallocation. Cet arras [ and \ record the "?G occupation rateand the aailable net+or band+idth of datanode i respectielafter each allocation. or the *th allocation of datanode i, @7(i) is calculated as follo+s:
+here @ R 7 (i) is the estimated +riting speed of the (* T 1)thallocation in datanode i. A' is the regression coefficient arrathat is computed based on past occurrences:
5n @4. (1#), the performance decrease is represented as @ R 7(i) T 7 i T 1 and e is the +eight factor for it. 7 (i) iscalculated in a similar +a as @4. (1#) in +hich @7 is replaced
b @, + is replaced b r and @ R 7 is replaced b @ R .#.3. Appl ?S6 to the storage allocation problem
5n the case of the storage allocation problem, a particle inthe ?S6 is a candidate allocation plan. %he structure of
particles in the ;NN and SNN laers are TUP; and TUPS ,respectiel.
%he particle moes +ithin the search space from one pointto another. %he edge of the search space is defined b the aluedomain of the solution ector. @ach point in the search spacerepresents an allocation combination. Since the incoming file
blocs hae different interaction$intensities and the storage places in the &'S hae different 5/6 performances, differentcombinations can proide different 5/6 throughput for
incoming batch files. 7hen one combination is selected (one particle moes to the point in the search space corresponding tothis combination), the estimated cost of this combination (the4ualit of the corresponding point) can be ealuated b theob*ect function. 5n our scenario, minimum cost indicatesma-imum throughput. urthermore, the terminating conditionof the ?S6 applied in this scenario is reaching a gien alue of the iteration time.
Cet arra p- represent the coordinates of a particleDscurrent location, arra b- represent the coordinates of the bestno+n position +ithin the histor of this particle, and arrag- denote the coordinates of the best no+n position +ithinthe histor of the entire s+arm. According to the ?S6
procedure, the moement
of a particle bet+een t+o iterations is determined b theelocit ector denoted b arra P- +hich has the same
-
8/18/2019 IEEERepSpec
7/14
cardinalit as the particle. 5n this arra, Pi represents the ithcomponent elocit, +hich is determined b
#.=. "onfigure the ?S6Ds parameters +ith eolutionar stateestimation Since the ?S6 applied to the storage allocationalgorithm has limited iteration time, the speed of particle
conergence is critical to the 4ualit of the final solution. 5naddition, the local optima phenomenon can greatl decrease the4ualit of the final result. %herefore, accelerating theconergence speed and aoiding the local optima hae becomethe t+o most important and appealing goals in the ?S6. %oimproe the 4ualit of the final solution, the @olutionar State@stimation (@S@) techni4ue 1> is introduced in the storageallocation algorithm.
%he basic concept of the @S@ is to use different parameter configurations in four different ?S6 search phases: e-ploration,e-ploitation, conergence and *umping out. ?articles indifferent phases should hae different behaiors, +hich aredetermined b the inertia +eight ] and t+o accelerationcoefficients "1 and "2. %o determine the search phase, an
eolutionar factor f is introduced. %his factor is based on the aggregation status of the +hole
s+arm. %ae the s+arm in the ;NN laer as an e-ample.'enote the mean distance of each particle i as diF it is definedas
+here N and J; J are the s+arm size and the number of dimensions, respectiel. 'enote di of the globall best particleas dg . "ompare all di Ds and determine the ma-imum andminimum distance dma- and dmin. %hen, the eolutionarfactor f is defined b
According to 1#,1>, the inertia +eight ] can be determinedfollo+ing f :
5n addition, based on the f factor, the current search phase can be determined. %hen, a proper combination of accelerate parameters can be chosen. 5n this paper, the determination of phase and the selection of accelerate parameters are defined asin %able 2.
%he ?S6$based algorithm +ith @S@ is illustrated in Algorithm1.
#.>. Allocation algorithm implementation
%he procedures of the 5$5 alue updating, the re$allocation bounds calculation and the storage allocation algorithm areimplemented on the different components of the e-tended
multilaer &'S structure. %he ;NN is in charge of updatinga fileDs 5$5 alue. Since all the incoming re4uests are handled bthe ;NN first, the ;NN has the no+ledge of access status of all the e-isted files. %hus, the ;NN can easil calculate the 5$5alue for files. 6n the other hand, recording the access countfor a file onl causes negligible oerhead. As the +orload of the ;NN is greatl alleiated in the e-tended structure, the;NN can proide enough computing capacit to support theupdating. 5n addition, the ;NN is used to calculate the re$allocation bounds for an interaction$intensie file at the
beginning of each time 4uantum. or the ;NN, a rac of datanodes is considered as a single storage node, so in the ie+of ;NN there are 2n nodes, i.e., n caches and n racs of datanodes. 6n the other hand, b sending the heartbeat report,each SNN submits both its cacheDs status and the aerage 5/6speed of the datanodes in its rac to the ;NN, hence the ;NNhas the no+ledge of the aailable 5/6 speed of all its 2n^nodesD. As the calculation of both the bounds for aninteraction$intensie file needs the 5/6 speed information of allthe storage nodes, the bound calculation procedure can onl beimplemented on the ;NN. %his implementation bringsnegligible +orload increase to the ;NN. or the ?S6$basedstorage allocation algorithm, both ;NN and SNN are inoled.5n each iteration of the ?S6 procedure, the ;NN is in chargeof calculating the "ost alue depicted in @4. (11) and theelocit P in @4. (13) for each particle. %o do so, both the Yalue in @4. (12) and the Z alue in @4. (1!) are needed. Sincethe ;NN can obtain the running status of caches on the SNNs,Y can be calculated locall on the ;NN. 6ther+ise, since the
datanodes onl send their heartbeat reports to a SNN, the ;NNis not a+are of the e-istence of datanode, hence the SNN is incharge of calculating the Z alue for each particle. 7hen a ne+iteration of ?S6 begins, the ;NN broadcasts the locations of each particle to all the SNNs +hile calculating the Y alue for each particle. 5n the mean time, SNNs calculate the Z alue for each particle +ith the particleDs location information gien bthe ;NN. 5t is eas to see that the ;NN and the SNNs are+oring in a parallel +a. ;oreoer, since one bloc can beonl allocated to one storage place, the calculation of Z oneach SNN is independent, that means all the SNNs are also
-
8/18/2019 IEEERepSpec
8/14
+oring in a parallel +a. As a result, the ?S6$based storageallocation algorithm runs er efficientl on the e-tended&'S structure.
After the Y alues and Z alues of all the particles are+ored out, the ;NN calculates the "ost alue, the elocit Pand the ne+ position of each particle. %hen, a ne+ iteration islaunched if the stop condition is not satisfied.
$$# SS012T$3 3D 45S
1) Hardware Failure
&ard+are failure is the norm rather than the e-ception. An
&'S instance ma consist of hundreds or thousands of serer
machines, each storing part of the file sstemDs data. %he factthat there are a huge number of components and that each
component has a non$triial probabilit of failure means that
some component of &'S is al+as non$functional. %herefore,
detection of faults and 4uic, automatic recoer from them isa core architectural goal of &'S.
2) Streaming Data Access
Applications that run on &'S need streaming access to their
data sets. %he are not general purpose applications thattpicall run on general purpose file sstems. &'S is
designed more for batch processing rather than interactie use
b users. %he emphasis is on high throughput of data access
rather than lo+ latenc of data access. ?6S5_ imposes manhard re4uirements that are
not needed for applications that are
targeted for &'S. ?6S5_ semantics in a fe+ e areas has
been traded to increase data throughput rates.
3) Large Data Sets
Applications that run on &'S hae large data sets. A tpical
file in &'S is gigabtes to terabtes in size. %hus, &'S is
tuned to support large files. 5t should proide high aggregate
data band+idth and scale to hundreds of nodes in a singlecluster. 5t should support tens of millions of files in a single
instance.
4) Simple Coherency odel
&'S applications need a +rite$once$read$man access modelfor files. A file once created, +ritten, and closed need not be
changed. %his assumption simplifies data coherenc issues andenables high throughput data access. A ;apeduce application
or a +eb cra+ler application fits perfectl +ith this model.%here is a plan to support appending$+rites to files in the
future.
!) "o#ing Computation isCheaper than o#ing Data$
A computation re4uested b an application is much more
efficient if it is e-ecuted near the data it operates on. %his isespeciall true +hen the size of the data set is huge. %his
minimizes net+or congestion and increases the oerall
throughput of the sstem. %he assumption is that it is often
better to migrate the computation closer to +here the data islocated rather than moing the data to +here the application is
running. &'S proides interfaces for applications to moe
themseles closer to +here the data is located.
%) &orta'ility AcrossHeterogeneous Hardware and So(tware
&lat(orms
&'S has been designed to be easil portable from one
platform to another. %his facilitates +idespread adoption of
&'S as a platform of choice for a large set of applications.
igure1: &o+ &'S architecture +ors`
%he NameNode and 'ataNode are pieces of soft+are designed
to run on commodit machines. %hese machines tpicall run a
ENG/Cinu- operating sstem (6S). &'S is built using the
aa languageF an machine that supports aa can run the
NameNode or the 'ata Node soft+are. Gsage of the highl
portable aa language means that &'S can be deploed on a
+ide range of machines. A tpical deploment has a dedicated
machine that runs onl the Name Node soft+are. @ach of the
other machines in the cluster runs one instance of the
'ataNode soft+are. %he architecture does not preclude runningmultiple 'ataNodes on the same machine but in a real
deploment that is rarel the case. %he e-istence of a single
NameNode in a cluster greatl simplifies the architecture of the
sstem. %he NameNode is the arbitrator and repositor for all
&'S metadata. %he sstem is designed in such a +a that
user data neer flo+s through the NameNode.
-
8/18/2019 IEEERepSpec
9/14
&'S represents an &'S namespace and enables user data to
be stored in bloc$based file chuns oer 'atanodes.
Specificall, a file is split into one or more blocs and these
blocs are stored in a set of 'atanodes. %he NameNode also
eeps reference of these blocs in a bloc pool. %he
NameNode e-ecutes &'S namespace operations lie
opening, closing, and renaming files and directories. %he
NameNode also maintains and find outs the mapping of blocsto 'atanodes. %he 'atanodes are accountable for performing
read and +rite re4uests from the file sstemDs clients. %he
'atanodes also perform bloc creation, deletion, andreplication upon receiing the instructions from the NameNode
.
%he NameNode stores the &'S namespace. %he NameNode
eeps a transaction log called the @dit Cog to continuousl
record eer modification that taes place in the file sstem
metadata. or instance, creating a ne+ file in &'S causes an
eent that ass the NameNode to insert a record into the @dit
Cog representing this action. Cie+ise, changing the replication
factor of a file causes another eent that ass the NameNode to
insert another ne+ record to be inserted into the @dit Cog
representing this action. %he NameNode uses a file in its local
host 6S file sstem to store the @dit Cog. %he entire file sstem
namespace, including the mapping of blocs to files and file
sstem properties, is stored in a file called the Silage. %he
Silage is stored as a file in the NameNodeDs local file sstem
too .
%he image of the entire file sstem namespace and file loc
map is ept in the NameNodeDs sstem memor (A;). %his
e metadata item is designed to be compact, such that a
NameNode +ith #E of A; is plent to support a large
number of files and directories. 7hen the NameNode starts up,
it reads the Silage and @dit Cog from dis, applies all the
transactions from the @dit Cog to the in$memor representation
of the Silage, and flushes out this ne+ ersion into a ne+
Silage on dis. 5t can then truncate the old @ditCog because its
transactions hae been applied to the persistent s5mage. %his
procedure is no+n as a checpoint. 5n the current
implementation, a checpoint onl occurs +hen the NameNode
starts up .
%he 'atanode stores &'S data in files in its local file sstem.
%he 'atanode has no no+ledge about &'S files. 5t stores
each bloc of &'S data in a separate file in its local file
sstem. %he 'atanode does not create all files in the same
director. 5nstead, it uses heuristics to determine the optimal
number of files per director and creates subdirectories
appropriatel. 5t is not optimal to create all local files in the
same director because the local file sstem might not be able
to efficientl support a huge number of files in a single
director. 7hen a 'atanode starts up, it scans through its local
file sstem, generates a list of all &'S data blocs that
correspond to each of these local files and sends this report to
the NameNode: this is the locreport .
%he NameNode does not receie the client re4uest to create a
file. 5n realit, the &'S client, initiall, caches the file datainto a temporar local file. Application +rites are transparentl
redirected to this temporar local file. 7hen the local file
accumulates data +orth oer one &'S bloc size, the client
contacts the NameNode. %he NameNode inserts the file name
into the file sstem hierarch and allocates a data bloc for it.
%he NameNode responds to the client re4uest +ith the identit
of the 'atanode and the destination data bloc. %hen the client
flushes the bloc of data from the local temporar file to the
specified 'atanode. 7hen a file is closed, the remaining
unflushed data in the temporar local file is transferred to the
'atanode. %he client then informs the NameNode that the file
is closed. At this point, the NameNode commits the file
creation operation into its persistent store. 5f the NameNode
dies before the file is closed, the file is lost .
Data rgani*ation
6#7 Data 8loc&s
&'S is designed to support er large files. Applications that
are compatible +ith &'S are those that deal +ith large data
sets. %hese applications +rite their data onl once but the read
it one or more times and re4uire these reads to be satisfied at
streaming speeds. &'S supports +rite$once$read$mansemantics on files. A tpical bloc size used b &'S is 3#
;. %hus, an &'S file is chopped up into 3# ; chuns,
and if possible, each chun +ill reside on a different 'ataNode
6# Staging
A client re4uest to create a file does not reach the NameNode
immediatel. 5n fact, initiall the &'S client caches the file
data into a temporar local file. Application +rites are
transparentl redirected to this temporar local file. 7hen the
local file accumulates data +orth oer one &'S bloc size,
the client contacts the NameNode. %he NameNode inserts thefile name into the file sstem hierarch and allocates a data
bloc for it. %he NameNode responds to the client re4uest +ith
the identit of the 'ataNode and the destination data bloc.
%hen the client flushes the bloc of data from the local
temporar file to the specified 'ataNode. 7hen a file is closed,
the remaining un$flushed data in the temporar local file is
transferred to the 'ataNode. %he client then tells the
-
8/18/2019 IEEERepSpec
10/14
NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. 5f the
NameNode dies before the file is closed, the file is lost. %he
aboe approach has been adopted after careful consideration of
target applications that run on &'S. %hese applications need
streaming +rites to files. 5f a client +rites to a remote file
directl +ithout an client side buffering, the net+or speed
and the congestion in the net+or impacts throughputconsiderabl. %his approach is not +ithout precedent. @arlier
distributed file sstems, e.g. AS, hae used client side caching
to improe performance. A ?6S5_ re4uirement has been
rela-ed to achiee higher performance of data uploads.
6#. 9eplication 2ipelining
7hen a client is +riting data to an &'S file, its data is first
+ritten to a local file as e-plained in the preious section.
Suppose the &'S file has a replication factor of three. 7hen
the local file accumulates a full bloc of user data, the client
retriees a list of 'ataNodes from the NameNode. %his listcontains the 'ataNodes that +ill host a replica of that bloc.
%he client then flushes the data bloc to the first 'ataNode.
%he first 'ataNode starts receiing the data in small portions (#
), +rites each portion to its local repositor and transfers
that portion to the second 'ataNode in the list. %he second
'ataNode, in turn starts receiing each portion of the data
bloc, +rites that portion to its repositor and then flushes that
portion to the third 'ataNode. inall, the third 'ataNode
+rites the data to its local repositor. %hus, a 'ataNode can be
receiing data from the preious one in the pipeline and at the
same time for+arding data to the ne-t one in the pipeline.
%hus, the data is pipelined from one 'ataNode to the ne-t.
7! ccessibility
&'S can be accessed from applications in man different
+as. Natiel, &'S proides a aa A?5 for applications to
use. A " language +rapper for this aa A?5 is also aailable.
5n addition, an &%%? bro+ser can also be used to bro+se the
files of an &'S instance. 7or is in progress to e-pose &'S
through the 7eb'AP protocol.
10.1 S Shell
&'S allo+s user data to be organized in the form of files and
directories. 5t proides a commandline interface called S shell
that lets a user interact +ith the data in &'S. %he snta- of
this command set is similar to other shells (e.g. bash, csh) that
users are alread familiar +ith. &ere are some sample
action/command pairs:
S shell is targeted for applications that need a scripting
language to interact +ith the stored data.
10.2 DFSAdmin
The DFSAdmin command set is used for
administering an HDFS cluster. These are
commands that are used only by an HDFS
administrator. Here are some samle action!
command airs"
10.3 !rowser "nterface
A tpical &'S install configures a +eb serer to e-pose the
&'S namespace through a configurable %"? port. %his
allo+s a user to naigate the &'S namespace and ie+ the
contents of its files using a +eb bro+ser.
Space #eclamation
ile $eletes and %ndeletes
7hen a file is deleted b a user or an application, it is not
immediatel remoed from &'S. 5nstead, &'S first
renames it to a file in the /trash director. %he file can be
restored 4uicl as long as it remains in /trash. A file remains in
/trash for a configurable amount of time. After the e-pir of its
life in /trash, the NameNode deletes the file from the &'S
namespace. %he deletion of a file causes the blocs associated
+ith the file to be freed. Note that there could be an appreciable
time dela bet+een the time a file is deleted b a user and the
time of the corresponding increase in free space in &'S.
A user can Gndelete a file after deleting it as long as it remainsin the /trash director. 5f a user +ants to undelete a file that
he/she has deleted, he/she can naigate the /trash director and
retriee the file. %he /trash director contains onl the latest
cop of the file that +as deleted. %he /trash director is *ust lie
an other director +ith one special feature: &'S applies
specified policies to automaticall delete files from this
director. %he current default polic is to delete files from
-
8/18/2019 IEEERepSpec
11/14
/trash that are more than 3 hours old. 5n the future, this polic
+ill be configurable through a +ell defined interface.
$ecrease #eplication actor
7hen the replication factor of a file is reduced, the NameNode
selects e-cess replicas that can be deleted. %he ne-t &eartbeat
transfers this information to the 'ataNode. %he 'ataNode then
remoes the corresponding blocs and the corresponding free
space appears in the cluster. 6nce again, there might be a time
dela bet+een the completion of the set eplication A?5 call
and the appearance of free space in the cluster.
555. ?@?A@ 6G ?A?@ @6@ S%C5NE
efore ou begin to format our paper, first +rite and saethe content as a separate te-t file. eep our te-t and graphic
files separate until after the te-t has been formatted and stled.'o not use hard tabs, and limit use of hard returns to onl onereturn at the end of a paragraph. 'o not add an ind of
pagination an+here in the paper. 'o not number te-t heads$the template +ill do that for ou.
inall, complete content and organizational editing beforeformatting. ?lease tae note of the follo+ing items +hen
proofreading spelling and grammar:
A. Abbre+iations and Acronyms'efine abbreiations and acronms the first time the are
used in the te-t, een after the hae been defined in theabstract. Abbreiations such as 5@@@, S5, ;S, "ES, sc, dc,and rms do not hae to be defined. 'o not use abbreiations in
the title or heads unless the are unaoidable.
#. ,nits
• Gse either S5 (;S) or "ES as primar units. (S5 unitsare encouraged.) @nglish units ma be used assecondar units (in parentheses). An e-ception +ould
be the use of @nglish units as identifiers in trade, suchas 8!.0$inch dis drie
-
8/18/2019 IEEERepSpec
12/14
• %he prefi- 8non< is not a +ordF it should be *oined tothe +ord it modifies, usuall +ithout a hphen.
• %here is no period after the 8et< in the Catinabbreiation 8et al.
-
8/18/2019 IEEERepSpec
13/14
igure Cabels: Gse > point %imes Ne+ oman for igurelabels. Gse +ords rather than smbols or abbreiations +hen+riting igure a-is labels to aoid confusing the reader. As ane-ample, +rite the 4uantit 8;agnetizationB2, pp.3>9=!.
! 5. S. acobs and ". ?. ean, 8ine particles, thin films and e-changeanisotrop,< in ;agnetism, ol. 555, E. %. ado and &. Suhl, @ds. Ne+or: Academic, 1B3!, pp. 2=19!0L.
# . @lissa, 8%itle of paper if no+n,< unpublished.
0 . Nicole, 8%itle of paper +ith onl first +ord capitalized,< . NameStand. Abbre., in press.
3 . orozu, ;. &irano, . 6a, and . %aga+a, 8@lectron spectroscopstudies on magneto$optical media and plastic substrate interface,< 5@@@%ransl. . ;agn. apan, ol. 2, pp. =#L9=#1, August 1B>= 'igests BthAnnual "onf. ;agnetics apan, p. !L1, 1B>2.
= ;. oung, %he %echnical 7riterDs &andboo. ;ill Palle, "A:Gniersit Science, 1B>B.
7e suggest that ou use a te-t bo- to insert a graphicich is ideall a !LL dpi %5 or @?S file, +ith all fonts
bedded) because, in an ;S7 document, this method isme+hat more stable than directl inserting a picture.
%o hae non$isible rules on our frame, use the
7ord 8ormat< pull$do+n menu, select %e-t o- Wors and Cines to choose No ill and No Cine.
-
8/18/2019 IEEERepSpec
14/14
#$%