ieeerepspec

Upload: sufiyan-mohammad

Post on 06-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 IEEERepSpec

    1/14

     A Study of Hadoop Distributed File System (HDFS)Subtitle as needed ( paper subtitle)

    Authors Name/s per 1st Affiliation (Author)

    line 1 (of Affiliation): dept. name of organizationline 2: name of organization, acronms acceptable

    line !: "it, "ountr

    line #: e$mail address if desired

    Authors Name/s per 2nd Affiliation (Author)

    line 1 (of Affiliation): dept. name of organizationline 2: name of organization, acronms acceptable

    line !: "it, "ountr

    line #: e$mail address if desired

     Abstract  — 

    The Hadoop Distributed File System (HDFS) is designed to run

    on commodity hardware and can be used as a stand-alone

    general purpose distributed file system (Hdfs user guide, !!")# $t

    pro%ides the ability to access bul& data with high $' throughput#

    s a result, this system is suitable for applications that ha%e large

    $' data sets# Howe%er, the performance of HDFS decreases

    dramatically when handling the operations of interaction-intensive

     files, i#e#, files that ha%e relati%ely small si*e but are fre+uently

    accessed# The paper analy*es the cause of throughput

    degradation issue when accessing interaction-intensi%e files and

    presents an enhanced HDFS architecture along with an

    associated storage allocation algorithm that o%ercomes the

    performance degradation problem# periments ha%e shown that

    with the proposed architecture together with the associated

    storage allocation algorithm, the HDFS throughput for

    interaction-intensi%e files increases .!!/ on a%erage with only a

    negligible performance decrease for large data set tas&s#

    $# HDFS

    %he &adoop 'istributed ile Sstem (&'S)a subpro*ectof the Apache &adoop pro*ectis a distributed, highl fault$tolerant file sstem designed to run on lo+$cost commodithard+are. &'S proides high$throughput access toapplication data and is suitable for applications +ith large datasets. %his article e-plores the primar features of &'S and

     proides a high$leel ie+ of the &'S architecture.

    %he &adoop 'istributed ile Sstem (&'S) is designed asa massie data storage frame+or and seres as the storagecomponent for the Apache &adoop platform. %his file sstem is

     based on commodit hard+are and proides highl reliablestorage and global access +ith high throughput for large datasets 10. ecause of these adantages, the &'S is also usedas a stand$alone general purpose distributed file sstem andseres for non$&adoop applications 3.

    &o+eer, the adantage of high throughput that the &'S proides diminishes 4uicl +hen handling interaction$intensie files, i.e., files that are of small sizes but are accessed

    fre4uentl. %he reason is that before an 5/6 transmission starts,there are a fe+ necessar initialization steps +hich need to becompleted, such as data location retrieing and storage spaceallocation. 7hen transferring large data, this initializationoerhead becomes relatiel small and can be negligiblecompared +ith the data transmission itself. &o+eer, +hentransferring small size data, this oerhead becomes significant.5n addition to the initialization oerhead, files +ith high 5/6data access fre4uencies can also 4uicl oerburden theregulating component in the &'S, i.e., the single namenodethat superises and manages eer access to datanodes 1. 5f the number of datanodes is large, the single namenode can4uicl become a bottlenec +hen the fre4uenc of 5/6re4uests is high.

    5n man sstems, fre4uent file access is unaoidable. or e-ample, log file updating, +hich is a common procedure inman applications.1 Since the &'S applies the rule of87rite$6nce9ead$;an

  • 8/18/2019 IEEERepSpec

    2/14

    5n this paper, +e use the &'S as a stand$alone file sstemand present an integrated approach to addressing the &'S

     performance degradation issue for interaction$intensie tass.5n particular, +e e-tend the &'S architecture b adding cachesupport and transforming single namenode to an e-tendedhierarchical namenode architecture. ased on the e-tendedarchitecture, +e deelop a ?article S+arm 6ptimization (?S6)$

     based storage allocation algorithm to improe the &'Sthroughput for interaction$intensie tass.

    %he rest of the paper is organized as follo+s. Section 2discusses the related +or focusing on the cause of throughputdegradation +hen handling interaction$intensie tass and

     possible solutions deeloped for the problem. Section ! presents an enhanced &'S namenode structure. Section #describes the proposed ?S6$based storage allocationalgorithms to be deploed on the e-tended structure.@-perimental results are presented and analzed in Section 0.inall, in Section 3, +e conclude the paper +ith a summar of our findings and point out future +or.

    2. elated +or As the application of the &'S increases, the pitfalls of the

    &'S are also being discoered and studied. Among them isthe poor performance +hen the &'S handles small andfre4uentl interacted files. As iang et al. B pointed out, the&'S is designed for processing big data transmission rather than transferring a large number of small files, hence it is notsuitable for interaction$intensie tass.

    Shacho et al. 10 notice that the &'S sacrifices theimmediate response time of indiidual 5/6 re4uests to gain

     better throughput of the entire sstem. 5n other +ords, the&'S chooses the strateg of enlarging the data olume of a

    single transmission to decrease the time ratio of datatransmission initialization oerhead in the oerall operation.%he transmission initialization includes permission erification,access authentication, locating e-isting file blocs for readingtass, allocating storage space for incoming +riting tass, andestablishing or maintaining transmission connections, etc.Another issue +ith the current &'S architecture is its singlenamenode structure +hich can 4uicl become the bottlenec in the presence of interaction$intensie tass and hence causeother incoming 5/6 re4uests to +ait for a long period of time

     before being processed 1.

    @fforts hae been made to oercome the shortcomings of the &'S. or instance, Ciu 1! combines seeral small filesinto a large file and then uses an e-tra inde- to record their offsets. &ence, the number of files is reduced and the efficiencof processing access re4uests is increased. %o enhance the&'SDs abilit of handling small files, "handrasear et al. !also use a strateg that merges small files into a larger file andrecords the location of file blocs inside the clientDs cache. %hisstrateg also allo+s the client application to circument thenamenode to directl contact the datanodes, hence alleiatesthe bottlenec issue on the namenode. 5n addition, this strateguses pre$fetching in its 5/6 process to further increase the

    sstem throughput +hen dealing +ith small files. %hedra+bac of the strateg is that it re4uires client side codechange.

    ?reenting access re4uests from fre4uentl interacting +ithdatanodes at the bottom laer of the &'S can alsosignificantl improe the 5/6 performance for interaction$intensie tass. As illustrated in B, one +a to achiee this isto add caches to the original &'S structure. 5n 1=, a sharedcache strateg named &'"ache, +hich is built on the top of the &'S, is presented to improe the performance of smallfile access.

    Ciao et al. 12 adapt the conentional hierarchicalstructure for the &'S to increase the capabilit of thenamenode component. orthaur 2 discusses in detail thearchitecture of the &'S and presents seeral possible +as toappl the hierarchical concept to the &'S. esehae et al.

     proide a solution called @'S 0 +hich introducesmiddle+are and hard+are components bet+een the clientapplication and the &'S to implement a hierarchical storage

    structure. %his structure uses multiple laers of resourceallocators to ease the 4uic saturation issue of the singlenamenode structure.

    'esigning optimized storage allocation algorithms isanother effectie +a to improe the throughput of the &'S.5n =, a load re$balancing algorithm is deeloped to al+aseep each file in an optimal location in a dnamic &'Senironment to gain the best throughput. An algorithm for optimizing file placement in the 'ata Erid @nironment ('E@)is presented in >. %his algorithm consists of a data placementand migration strateg to help the regulating components (i.e.the namenode in the &'S) in a distributed data sharingsstem to place files in a large number of data serers.

    As can be seen from the brief summar aboe, to oercomethe throughput degradation issue for interaction$intensie tass,most of the research in this area focuses on three mechanisms,i.e., optimizing metadata or adding cache, using an e-tendedstructure for the regulating component and designing ne+storage allocation algorithms. &o+eer, if these mechanismsare applied indiiduall, the bottlenec issues are onl shiftedfrom one place to the other, rather than be rooted out. %hesolution presented in this paper intends to integrate all threeapproaches and sole the performance degradation issue for interaction$intensie tass from the root. 6ur e-perimentresults hae sho+n successful eidences of the proposed

    approach.

    3. Extended namenode with cache support

    5n this section, an e-tended namenode architecture for the&'S +ith cache support is specified. %he follo+ing t+osubsections describe the details of the structure and itsfunctionalities. !.1. @-tended namenode structure %he original&'S structure, +hich consists of a single namenode, isredesigned +ith a t+o$laer namenode structure as sho+n in

  • 8/18/2019 IEEERepSpec

    3/14

    ig. 1. %he first laer contains a ;aster Name Node (;NN)+hich is actuall the namenode in the original &'S structure.%he second laer consists of a set of Secondar Name Nodes(SNNs), +hich are applied to eer rac of datanodes. eliablecaches, such as SS', are deploed to eer SNN. 5t is +orth

     pointing out that the SNN does not hae to be a ne+ machineFit can be configured from an e-isting data node machine,though in our implementation, +e use a dedicated serer for each SNN. %o the ;NN, each SNN is its datanodeF to the

    datanodes, the see the SNN in its rac as their namenode.&ence, the ne+ structure full preseres the original

    &'S relations bet+een ;NN and SNN, and SNN anddatanode structures. %herefore all the original functions andmechanisms of the &'S are intact and the number of modifications needed for the structural e-tension is minimal.%he addition of the ne+ laer to the original &'S structurecauses changes on both the reading and the +riting procedures.Specificall, +hen a reading re4uest arries, the ;NN firstin4uires the SNNs and then naigates the client application tothe destination racs (Step 1 in ig. 1). After+ards, the SNN inthe rac checs its mapping table and continues naigating theclient application to the datanodes that hold the file blocs(Step 2). At the end, the client application contacts thedatanodes and accesses the file blocs (Step !$5). 5f a bloc is

    cached, the SNN plas the role of a datanode and lets the clientapplication read the bloc from its cache (Step !$55). Similarl,to +rite data into a datenote, the ;NN first allocates one or more racs for the client application to +rite (Step A). %hen,the SNN further allocates datanodes for the client application(Step ). After+ards, the client application begins to +rite file

     blocs into these datanodes (Step "$5). 5f the file bloc isassigned to the caches in that rac, then the SNN plas the roleof a datanode and lets the client application +rite the file bloc into its cache (Step "$55). 7hen the +riting is done, thedatanodes report their changes to the SNN and the SNNs reporttheir changes to the ;NN (Steps ' and @), replicas for eachincoming bloc are created in different datanodes (Step ) anda notification is sent to the client application

    !.2. unctionalit of the SNN

    As illustrated in ig. 1, the SNN laer is deploed bet+een the;NN laer and the datanode laer. @ach rac of datanodes hasan SNN that maintains a cache dedicated to this rac. %hislaer has three functionalities.

    %he first functionalit is to eep the monitoring and messagingmechanisms bet+een the original namenode and the datanodeunchanged. %o maintain the original monitoring mechanismrunning on the ;NN, the SNN submits all the information

    about its rac to the ;NN after receiing the ;NNDs heartbeatin4uir. 5n addition, the SNN sends its o+n heartbeat in4uir tothe datanodes in its rac in order to monitor their status.urthermore, in order to eep the &'SDs message mechanismintact, the SNN responds to all the messages generated b the;NN and sends its o+n control messages to the datanodes inits rac.

    %he second functionalit of the SNN is to store interaction

    intensie file blocs in its cache. 6nce a file is mared b the;NN as an interaction$intensie one and assigned to a rac,instead of relaing the +riting re4uest to the datanode laer, theSNN of this rac intercepts the re4uest and stores the incoming

     blocs into its caches. As the 5/6 speed of a cache is muchfaster than the speed of hard diss used in the datanode, thethroughput of interaction$intensie files is improed.urthermore, as the number of SNNs is much smaller than thenumber of datanodes, file blocs in the cache can be locatedmuch faster and hence hae a shorter 5/6 response time.;oreoer, since the cost of transmission initialization on thedatanode is high, the sstem oerhead can be further reduced

     b using the SNN to preent interaction$intensie tass fromfre4uentl accessing the datanode. 7hen a file is no longer interaction$intensie, the SNN +rites all of the cached file

     blocs to datanodes.

    %he third functionalit is to increase the ;NN capabilit of handling fre4uent re4uests. 5n the original structure, the singlenamenode becomes the bottlenec +hen the datanode pool

     becomes large or +hen the re4uests come too fre4uentl. %hene+ laer turns the single namenode component into ahierarchical namenode structure and thus significantlincreases the capabilit of the &'S in managing a largeresource pool and handling large incoming re4uests. As a resultof the ne+ laer, the size of the datanode pool managed b the;NN and indiidual SNN is reduced. or the ;NN, the size of its datanode pool is the number of SNNs. or each SNN, thesize of its datanode pool is the size of the original datanode

     pool diided b the number of SNNs.

    4. Storage allocation algorithm

    As presented in Section !, b changing the single namenode toa hierarchical namenode structure, the &'SDs capabilit of handling fre4uent re4uests increases. 5n addition, cachesintroduced in this structure proide the abilit of faster dataaccess +ith shorter response time. Gnder this ne+ structure, thethroughput degradation caused b the access to interaction$intensie files can be further reduced b appling an optimizedstorage allocation strateg. %he deelopment of this strateg isdone in three steps, i.e., for a gien file, +e need to determine(1) file blocDs interaction$intensit alue, (2) its re$allocationtriggering condition, and (!) file bloc storage location basedon the fileDs interaction$intensit.

    4.1. Interaction-intensity

    %he interaction$intensit of a file, denoted as the 5 alue, is ameasurement of ho+ fre4uent this file has been re4uested. %here4uest can be either a read or a +rite re4uest submitted fromone or multiple applications. All the blocs of a file share thesame 5 propert of this file. Since the 5 propert of a file ariesfrom time to time, its alue is represented as a function of time.

     ;ore precisel, for an e-isted file, its 5 alue at time instance tis defined as

  • 8/18/2019 IEEERepSpec

    4/14

     5(f , H, t) I JJ (1)

    +here H is a gien length of a time 4uantum during +hich theinteraction$intensie alue is calculated. 5t is a constant. %he set in @4. (1) is defined as belo+:

    I K(fi, ti)Jfi I f ∧ ma-(L, t − H) ≤ ti ≤ tM (2)

    +here t is the current sstem time and (fi, ti) denotes a fileaccess re4uest +ith file name fi and re4uest arriing time ti .

    %he intuitie meaning of 5(f , H, t) is that for the file f at thetime instant t, the total number of re4uests of this file submittedto the &'S in the last time 4uantum.

    4.. !ri""er condition of re-allocation

    %he 5 alue of a file is dnamicF it determines the best storage place of this file, so re$allocating the file eer time its 5 aluechanges can improe the throughput for the +hole sstem.&o+eer, the re$allocation is a resource$consuming procedureFusing

    %his procedure too fre4uentl +ill mae the performance een+orse. &ence, to set up the trigger conditions of re$allocation

    for the e-isting interaction$intensie files is necessar.5ntuitiel, there are t+o triggering conditions for eachinteraction$intensie file:

    5f a fileDs 5 alue is larger than its 5u, it indicates thatthe current storage node cannot proide enoughaailable 5/6 speed and thus this file should be moedto a faster storage node.

    5f a fileDs 5 alue is smaller than its 5l , i.e. the lo+er  bound of its 5 alue, it indicates that the aailable 5/6speed proided b the current storage node is higher 

    than the re4uirement of this file. 5n this case, this fileshould be moed to another storage node +ithrelatiel lo+er 5/6 speed, or it +ill cause resource+aste since there ma not be enough storage spacesfor the files +ith higher interaction$intensities.

    Assume that there are n storage nodes. or each of them, theaailable 5/6 speed is denoted as S. Although both the access

     pattern and the data distribution can impact the performance of 

    the hard dis, for simplicit, S is assumed to be the aeragecase. %he storage nodes are ordered b their aailable 5/6speed, i.e., Si O S* if i O *. 5n addition, assume that at timeinstance t, file f is at storage node * and the size of f is s , then+ithin the constant time 4uantum of H, the number of un$

     processed re4uests G is defined as follo+s:

    Suppose that the file is then re$allocated to the ne-t storagenode P*Q1 +hich proides a higher aailable 5/6 speed thannode P* does. Also, assume that the 5 alue of this file remainsunchanged in the ne-t time 4uantum, i.e., 5(t Q H) I 5(t). %hen,if the time cost of data migration (denoted as'm) of this file is

    counted in, the number of e-pected un$processed re4uests G Rfor this file in the ne-t time 4uantum is as follo+s:

    5n @4. (#), the alue of migration time cost 'm is calculatedas follo+s:

    +here d represents the net+or band+idth bet+een node *and * Q 1.

    &ence the improement rate (denoted as "r) based on theunprocessed re4uests bet+een a file on node or cache * and * Q1 is calculated as belo+:

    comparing the "r alue +ith a gien non$negatie threshold, it can be determined +hether a file should be re$allocated inthe ne-t time 4uantum. 5n other +ords, if a fileDs current 5 alueis larger than its "r alue, then it should be re$allocated in thene-t time 4uantum, other+ise it should sta on the current node+ithout inoing the allocation algorithm for a ne+ storagedestination. %herefore, the first triggering condition is "r I .%hreshold is an empirical alue and is sensitie to thenet+or configuration. rom @4s. (#)9(3), at the time instance

    t, the upper bound of a fileDs 5 alue 5u can be represented as

    "learl if G(f , *, H, t) O L, the nodeDs or cacheDs 5/6 capacit islarger than the re4uests. Since the space of a high speed storagenode (such as the light$+orload cache) is limited, hence thefile should moe to a lo+er speed deice right after the ne-ttime 4uantum starts. 5n other +ords, the second triggering

  • 8/18/2019 IEEERepSpec

    5/14

    condition is G I L, &ence, the lo+er bound of a fileDs 5 alue,i.e. 5l , can be calculated as belo+:

    At the end of each time 4uantum, a fileDs 5 alue is updated. 5f the ne+ 5 alue is either larger than its 5u alue or smaller thanits 5l alue, all the blocs of this file are re$assigned to theincoming bloc 4ueue and then allocated to a ne+ storagedestination b the allocation algorithm.

    #.!.  #rief introduction to the $article S%arm &ptimi'ational"orithm 

    utilizing the 5 alue, +e present a ?article S+arm6ptimization (?S6)$based storage allocation algorithm todecide storage locations for incoming file blocs. %he conceptof the ?article S+arm 6ptimization (?S6) +as first introducedin 1L. As an eolutionar algorithm, the ?S6 algorithmdepends on the e-plorations of a group of independent agents(the particles) during their search for the optimum solution+ithin a gien solution domain (the search space). @ach

     particle maes its o+n decision of the ne-t moement in thesearch space using both its o+n e-perience and the no+ledge

    across the +hole s+arm. @entuall the s+arm as a +hole isliel to conerge to an optimum solution. %he ?S6 procedurestarts +ith an initial s+arm of randoml distributed particles,each +ith a random starting position and a random initialelocit. %he algorithm then iteratiel updates the position of each particle oer a time period to simulate the moement of the s+arm to its searching target. %he position of each particleis updated using its elocit ector as an increment. %hiselocit ector is generated from t+o factors 11. %he firstfactor is the best position a particle has eer reached througheach iteration. %he second factor is the best position the +holes+arm has eer detected. %he optimization procedure isterminated +hen the iteration time has reached a gien alue or all the particles hae conerged into an area +ithin a gienradius.

    %o appl the ?S6 to the storage allocation problem +e needto (1) define the particle structure and the search spaceF (2)define the optimize ob*ectie functions for both ;NN andSNN laers andF (!) use the ?S6 approach to e-plore thesolution domain and eentuall derie a near optimal solutionector. %his solution ector is the allocation plan thatma-imizes the oerall throughput of an incoming batch of file

     blocs.

    6n both the ;NN and the SNN laers, a ?article S+arm6ptimization (?S6)$based storage allocation algorithm isapplied to search for the optimal allocation plans. %hefollo+ing t+o subsections define the solution ector as the

     particle structure, the alue domain as the search space, and the

    5/6 time cost function as the ob*ectie function for appling the?S6 in these t+o laers, respectiel.

    4.4. Stora"e allocation at the layer

    5n the ;NN laer, the ;NN onl allocates blocs to either racs or caches. %he solution ector (TUP; ) at the ;NNlaer is used as an allocation plan to indicate the storage placefor each incoming file bloc. 5n this section, +e first define thestructure of the solution ector (TUP; ), and then introducethe cost function to ealuate the 4ualit of a gien solutionector. 7ithin the solution ector TUP; , ; denotes a set of 

    5's of incoming file blocs, and the 4ueue H contains the 5'sof all the files +hose blocs are in ; . As the &'S uses thefirst$come9firstsere polic to schedule their tass, the 4ueueH is hence ordered b the blocDs arrial time. %he subset;5 ⊆ ; contains all the interaction$intensie blocs in ; .5f there are n racs at the SNN laer, to the ;NN, there are 2ndatanodes: the first n datanodes represent the datanode pool ineach rac and the other n datanodes represent the caches ineach rac. or e-ample, assume that the ;NN decides a bloc 

     be allocated to the th datanode, if V n, then this bloc is placed into the datanode pool of rac F other+ise, if W n, thatmeans this bloc is actuall allocated to the cache on the ( Tn)th rac. After+ards, +e define the solution ector at the;NN laer as TUP; . 5ts size is J; J. TUP; i representsthe location +here bloc H i is allocated. %he alue domainof entr i in ector TUP; is defined as follo+s:

    s gien in @4. (B), for blocs in ;5 , the alue domain of their corresponding entries in TUP; is 1, 2n +hich indicates thatthe blocs in ;5 ma be stored in cache. or the rest of the

     blocs, as their domain is in 1, n, the can onl be allocated

    to datanodes in a rac. As an e-ample, assume that there are 0 racs, i.e., n I 0, andan incoming bloc H i is in set ;5 , then the alue domainof entr TUP; i is in the range of 1, 1L. 5f TUP; i I !,it indicates that the incoming bloc H i is allocated to rac !F +hile if TUP; i I B, the bloc is allocated to the cache of rac #.

    %o compare the interaction$intensit of incoming files +iththose alread in the cache, the indicator X is introduced for each incoming file to represent the ratio of the fileDs current 5alue ersus the aerage 5 alue of all cached files. %hedefinition of indicator X for file f is gien belo+:

    +here N is the total number of file blocs cached in the SNNlaer.

    Eien the indicator X , assume that the incoming files areallocated to rac r, and there are Ar blocs on rac rF the costfunction "ost for the solution ector TUP; in the ;NN laer is defined as

    +here the alue "ost is determined b t+o factors: theestimated cost of placing blocs into caches (Y), and theestimated cost of placing blocs into datanodes (Z) and 1 and2 are the +eight factors of these t+o components.

     Allocating blocs +ith high (lo+) X alue into a storage placethat has lo+ (high) 5/6 +orload, such as an idle cache,reduces the "ost alueF +hile putting blocs +ith lo+ (high) X

  • 8/18/2019 IEEERepSpec

    6/14

    alue to a lo+ (high) +orload storage place increases the "ostalue. As the smaller cost of 5/6 tass can bring larger throughput to the sstem, @4. (11) becomes the ob*ectiefunction for the ?S6$based algorithm.

      5n @4.(11),Y(r)is defined as follo+s and +e leae thedefinition of Z(r) to the ne-t subsection:

    +here arra records the size of the remaining cache spaceaailable in each rac, arras act (7act) and ag (7ag )record the number of actie reading (+riting) channelsconnected to the cache and aerage reading (+riting)throughput of the cache in each rac, respectielF the areobtained from the &'S. denotes the bloc size defined bthe sstem. "onstants ! and # are the +eight factors used tomeasure the ratio of reading and +riting fre4uencies of thecorresponding tas.

     ? is a coefficient used to introduce penalt into the cost

    function for using caches. As the total space of caches is scarcecompared to the storage olume proided b datanodes, the

     penalt ? ∗Ar r increases e-ponentiall +ith the ratio of re4uired cache space ersus remaining cache space. As a result,+hen the cache is nearl full, the ?S6 is more liel to allocatethe interaction$intensie blocs to the datanodes +ith lighter +orload rather than to the cache. urthermore, if the size of 

     blocs assigned to the cache of rac r is larger than its aailablespace, i.e., ∗ Ar W r, the alue of Y(r) +ill be greatlscaled up and this allocation plan is unliel to be chosen bthe ?S6.

    4.*. Stora"e allocation at the S layer 

     or each allocation solution generated b the ;NN during

    the ?S6 searching procedure, the SNNs calculate their o+nfeedbac factor Z. ased on Z, the ;NN can then ealuate the4ualit of this possible solution using @4. (11). 5n fact, thefactor Z itself is a 4ualit ealuation criterion for allocation

     plans generated at the SNN laer. 5n other +ords, this is theob*ectie function for the ?S6 applied in this laer.

    or the SNN in rac r, its solution ector is defined as TUP r S. Gse Sr to define the set of blocs assigned from the ;NN torac r, then the cardinalit of TUP r S is JSr J. Similar to thesolution ector TUP; of the ;NN, each entr in TUP r Sindicates the storage destination allocated to each bloc assigned from the ;NN to this rac. Gse 'r to denote thenumber of datanodes in rac r, the alue domain for each entrin TUP r S is 1, 'r, i.e., each bloc in the set Sr can be

     placed into an datanode in this rac. or the SNN in rac r, itscost function Zr is defined as

    +here @7 (i) and @(i) are the predicted +riting and readingspeed on node iF the are gien b the auto$regressieintegrated moing aerage prediction model presented in 13.! and # are the +eight coefficients for reading and +riting

    time costs. i is used to record the number of blocs allocatedto datanode i. or the datanode i, let arras + and r recordthe aerage +riting and reading speed respectiel after eachallocation. Cet arras [ and \ record the "?G occupation rateand the aailable net+or band+idth of datanode i respectielafter each allocation. or the *th allocation of datanode i, @7(i) is calculated as follo+s:

    +here @ R 7 (i) is the estimated +riting speed of the (* T 1)thallocation in datanode i. A' is the regression coefficient arrathat is computed based on past occurrences:

    5n @4. (1#), the performance decrease is represented as @ R 7(i) T 7 i T 1 and e is the +eight factor for it. 7 (i) iscalculated in a similar +a as @4. (1#) in +hich @7 is replaced

     b @, + is replaced b r and @ R 7 is replaced b @ R .#.3. Appl ?S6 to the storage allocation problem

    5n the case of the storage allocation problem, a particle inthe ?S6 is a candidate allocation plan. %he structure of 

     particles in the ;NN and SNN laers are TUP; and TUPS ,respectiel.

     %he particle moes +ithin the search space from one pointto another. %he edge of the search space is defined b the aluedomain of the solution ector. @ach point in the search spacerepresents an allocation combination. Since the incoming file

     blocs hae different interaction$intensities and the storage places in the &'S hae different 5/6 performances, differentcombinations can proide different 5/6 throughput for 

    incoming batch files. 7hen one combination is selected (one particle moes to the point in the search space corresponding tothis combination), the estimated cost of this combination (the4ualit of the corresponding point) can be ealuated b theob*ect function. 5n our scenario, minimum cost indicatesma-imum throughput. urthermore, the terminating conditionof the ?S6 applied in this scenario is reaching a gien alue of the iteration time.

     Cet arra p- represent the coordinates of a particleDscurrent location, arra b- represent the coordinates of the bestno+n position +ithin the histor of this particle, and arrag- denote the coordinates of the best no+n position +ithinthe histor of the entire s+arm. According to the ?S6

     procedure, the moement

    of a particle bet+een t+o iterations is determined b theelocit ector denoted b arra P- +hich has the same

  • 8/18/2019 IEEERepSpec

    7/14

    cardinalit as the particle. 5n this arra, Pi represents the ithcomponent elocit, +hich is determined b

    #.=. "onfigure the ?S6Ds parameters +ith eolutionar stateestimation Since the ?S6 applied to the storage allocationalgorithm has limited iteration time, the speed of particle

    conergence is critical to the 4ualit of the final solution. 5naddition, the local optima phenomenon can greatl decrease the4ualit of the final result. %herefore, accelerating theconergence speed and aoiding the local optima hae becomethe t+o most important and appealing goals in the ?S6. %oimproe the 4ualit of the final solution, the @olutionar State@stimation (@S@) techni4ue 1> is introduced in the storageallocation algorithm.

     %he basic concept of the @S@ is to use different parameter configurations in four different ?S6 search phases: e-ploration,e-ploitation, conergence and *umping out. ?articles indifferent phases should hae different behaiors, +hich aredetermined b the inertia +eight ] and t+o accelerationcoefficients "1 and "2. %o determine the search phase, an

    eolutionar factor f is introduced. %his factor is based on the aggregation status of the +hole

    s+arm. %ae the s+arm in the ;NN laer as an e-ample.'enote the mean distance of each particle i as diF it is definedas

    +here N and J; J are the s+arm size and the number of dimensions, respectiel. 'enote di of the globall best particleas dg . "ompare all di Ds and determine the ma-imum andminimum distance dma- and dmin. %hen, the eolutionarfactor f is defined b

    According to 1#,1>, the inertia +eight ] can be determinedfollo+ing f :

    5n addition, based on the f factor, the current search phase can be determined. %hen, a proper combination of accelerate parameters can be chosen. 5n this paper, the determination of  phase and the selection of accelerate parameters are defined asin %able 2.

    %he ?S6$based algorithm +ith @S@ is illustrated in Algorithm1.

    #.>. Allocation algorithm implementation

    %he procedures of the 5$5 alue updating, the re$allocation bounds calculation and the storage allocation algorithm areimplemented on the different components of the e-tended

    multilaer &'S structure. %he ;NN is in charge of updatinga fileDs 5$5 alue. Since all the incoming re4uests are handled bthe ;NN first, the ;NN has the no+ledge of access status of all the e-isted files. %hus, the ;NN can easil calculate the 5$5alue for files. 6n the other hand, recording the access countfor a file onl causes negligible oerhead. As the +orload of the ;NN is greatl alleiated in the e-tended structure, the;NN can proide enough computing capacit to support theupdating. 5n addition, the ;NN is used to calculate the re$allocation bounds for an interaction$intensie file at the

     beginning of each time 4uantum. or the ;NN, a rac of datanodes is considered as a single storage node, so in the ie+of ;NN there are 2n nodes, i.e., n caches and n racs of datanodes. 6n the other hand, b sending the heartbeat report,each SNN submits both its cacheDs status and the aerage 5/6speed of the datanodes in its rac to the ;NN, hence the ;NNhas the no+ledge of the aailable 5/6 speed of all its 2n^nodesD. As the calculation of both the bounds for aninteraction$intensie file needs the 5/6 speed information of allthe storage nodes, the bound calculation procedure can onl beimplemented on the ;NN. %his implementation bringsnegligible +orload increase to the ;NN. or the ?S6$basedstorage allocation algorithm, both ;NN and SNN are inoled.5n each iteration of the ?S6 procedure, the ;NN is in chargeof calculating the "ost alue depicted in @4. (11) and theelocit P in @4. (13) for each particle. %o do so, both the Yalue in @4. (12) and the Z alue in @4. (1!) are needed. Sincethe ;NN can obtain the running status of caches on the SNNs,Y can be calculated locall on the ;NN. 6ther+ise, since the

    datanodes onl send their heartbeat reports to a SNN, the ;NNis not a+are of the e-istence of datanode, hence the SNN is incharge of calculating the Z alue for each particle. 7hen a ne+iteration of ?S6 begins, the ;NN broadcasts the locations of each particle to all the SNNs +hile calculating the Y alue for each particle. 5n the mean time, SNNs calculate the Z alue for each particle +ith the particleDs location information gien bthe ;NN. 5t is eas to see that the ;NN and the SNNs are+oring in a parallel +a. ;oreoer, since one bloc can beonl allocated to one storage place, the calculation of Z oneach SNN is independent, that means all the SNNs are also

  • 8/18/2019 IEEERepSpec

    8/14

    +oring in a parallel +a. As a result, the ?S6$based storageallocation algorithm runs er efficientl on the e-tended&'S structure.

     After the Y alues and Z alues of all the particles are+ored out, the ;NN calculates the "ost alue, the elocit Pand the ne+ position of each particle. %hen, a ne+ iteration islaunched if the stop condition is not satisfied.

    $$# SS012T$3 3D 45S

    1) Hardware Failure

    &ard+are failure is the norm rather than the e-ception. An

    &'S instance ma consist of hundreds or thousands of serer 

    machines, each storing part of the file sstemDs data. %he factthat there are a huge number of components and that each

    component has a non$triial probabilit of failure means that

    some component of &'S is al+as non$functional. %herefore,

    detection of faults and 4uic, automatic recoer from them isa core architectural goal of &'S.

    2) Streaming Data Access

    Applications that run on &'S need streaming access to their

    data sets. %he are not general purpose applications thattpicall run on general purpose file sstems. &'S is

    designed more for batch processing rather than interactie use

     b users. %he emphasis is on high throughput of data access

    rather than lo+ latenc of data access. ?6S5_ imposes manhard re4uirements that are

     

    not needed for applications that are

    targeted for &'S. ?6S5_ semantics in a fe+ e areas has

     been traded to increase data throughput rates.

    3) Large Data Sets

    Applications that run on &'S hae large data sets. A tpical

    file in &'S is gigabtes to terabtes in size. %hus, &'S is

    tuned to support large files. 5t should proide high aggregate

    data band+idth and scale to hundreds of nodes in a singlecluster. 5t should support tens of millions of files in a single

    instance.

    4) Simple Coherency odel 

    &'S applications need a +rite$once$read$man access modelfor files. A file once created, +ritten, and closed need not be

    changed. %his assumption simplifies data coherenc issues andenables high throughput data access. A ;apeduce application

    or a +eb cra+ler application fits perfectl +ith this model.%here is a plan to support appending$+rites to files in the

    future.

    !) "o#ing Computation isCheaper than o#ing Data$ 

    A computation re4uested b an application is much more

    efficient if it is e-ecuted near the data it operates on. %his isespeciall true +hen the size of the data set is huge. %his

    minimizes net+or congestion and increases the oerall

    throughput of the sstem. %he assumption is that it is often

     better to migrate the computation closer to +here the data islocated rather than moing the data to +here the application is

    running. &'S proides interfaces for applications to moe

    themseles closer to +here the data is located.

    %) &orta'ility AcrossHeterogeneous Hardware and So(tware

    &lat(orms

    &'S has been designed to be easil portable from one

     platform to another. %his facilitates +idespread adoption of 

    &'S as a platform of choice for a large set of applications.

    igure1: &o+ &'S architecture +ors`

    %he NameNode and 'ataNode are pieces of soft+are designed

    to run on commodit machines. %hese machines tpicall run a

    ENG/Cinu- operating sstem (6S). &'S is built using the

    aa languageF an machine that supports aa can run the

     NameNode or the 'ata Node soft+are. Gsage of the highl

     portable aa language means that &'S can be deploed on a

    +ide range of machines. A tpical deploment has a dedicated

    machine that runs onl the Name Node soft+are. @ach of the

    other machines in the cluster runs one instance of the

    'ataNode soft+are. %he architecture does not preclude runningmultiple 'ataNodes on the same machine but in a real

    deploment that is rarel the case. %he e-istence of a single

     NameNode in a cluster greatl simplifies the architecture of the

    sstem. %he NameNode is the arbitrator and repositor for all

    &'S metadata. %he sstem is designed in such a +a that

    user data neer flo+s through the NameNode.

  • 8/18/2019 IEEERepSpec

    9/14

    &'S represents an &'S namespace and enables user data to

     be stored in bloc$based file chuns oer 'atanodes.

    Specificall, a file is split into one or more blocs and these

     blocs are stored in a set of 'atanodes. %he NameNode also

    eeps reference of these blocs in a bloc pool. %he

     NameNode e-ecutes &'S namespace operations lie

    opening, closing, and renaming files and directories. %he

     NameNode also maintains and find outs the mapping of blocsto 'atanodes. %he 'atanodes are accountable for performing

    read and +rite re4uests from the file sstemDs clients. %he

    'atanodes also perform bloc   creation, deletion, andreplication upon receiing the instructions from the NameNode

    .

    %he NameNode stores the &'S namespace. %he NameNode

    eeps a transaction log called the @dit Cog to continuousl

    record eer modification that taes place in the file sstem

    metadata. or instance, creating a ne+ file in &'S causes an

    eent that ass the NameNode to insert a record into the @dit

    Cog representing this action. Cie+ise, changing the replication

    factor of a file causes another eent that ass the NameNode to

    insert another ne+ record to be inserted into the @dit Cog

    representing this action. %he NameNode uses a file in its local

    host 6S file sstem to store the @dit Cog. %he entire file sstem

    namespace, including the mapping of blocs to files and file

    sstem properties, is stored in a file called the Silage. %he

    Silage is stored as a file in the NameNodeDs local file sstem

    too .

    %he image of the entire file sstem namespace and file loc 

    map is ept in the NameNodeDs sstem memor (A;). %his

    e metadata item is designed to be compact, such that a

     NameNode +ith #E of A; is plent to support a large

    number of files and directories. 7hen the NameNode starts up,

    it reads the Silage and @dit Cog from dis, applies all the

    transactions from the @dit Cog to the in$memor representation

    of the Silage, and flushes out this ne+ ersion into a ne+

    Silage on dis. 5t can then truncate the old @ditCog because its

    transactions hae been applied to the persistent s5mage. %his

     procedure is no+n as a checpoint. 5n the current

    implementation, a checpoint onl occurs +hen the NameNode

    starts up .

    %he 'atanode stores &'S data in files in its local file sstem.

    %he 'atanode has no no+ledge about &'S files. 5t stores

    each bloc of &'S data in a separate file in its local file

    sstem. %he 'atanode does not create all files in the same

    director. 5nstead, it uses heuristics to determine the optimal

    number of files per director and creates subdirectories

    appropriatel. 5t is not optimal to create all local files in the

    same director because the local file sstem might not be able

    to efficientl support a huge number of files in a single

    director. 7hen a 'atanode starts up, it scans through its local

    file sstem, generates a list of all &'S data blocs that

    correspond to each of these local files and sends this report to

    the NameNode: this is the locreport .

    %he NameNode does not receie the client re4uest to create a

    file. 5n realit, the &'S client, initiall, caches the file datainto a temporar local file. Application +rites are transparentl

    redirected to this temporar local file. 7hen the local file

    accumulates data +orth oer one &'S bloc size, the client

    contacts the NameNode. %he NameNode inserts the file name

    into the file sstem hierarch and allocates a data bloc for it.

    %he NameNode responds to the client re4uest +ith the identit

    of the 'atanode and the destination data bloc. %hen the client

    flushes the bloc of data from the local temporar file to the

    specified 'atanode. 7hen a file is closed, the remaining

    unflushed data in the temporar local file is transferred to the

    'atanode. %he client then informs the NameNode that the file

    is closed. At this point, the NameNode commits the file

    creation operation into its persistent store. 5f the NameNode

    dies before the file is closed, the file is lost .

    Data rgani*ation 

    6#7 Data 8loc&s

     &'S is designed to support er large files. Applications that

    are compatible +ith &'S are those that deal +ith large data

    sets. %hese applications +rite their data onl once but the read

    it one or more times and re4uire these reads to be satisfied at

    streaming speeds. &'S supports +rite$once$read$mansemantics on files. A tpical bloc size used b &'S is 3#

    ;. %hus, an &'S file is chopped up into 3# ; chuns,

    and if possible, each chun +ill reside on a different 'ataNode

    6# Staging

    A client re4uest to create a file does not reach the NameNode

    immediatel. 5n fact, initiall the &'S client caches the file

    data into a temporar local file. Application +rites are

    transparentl redirected to this temporar local file. 7hen the

    local file accumulates data +orth oer one &'S bloc size,

    the client contacts the NameNode. %he NameNode inserts thefile name into the file sstem hierarch and allocates a data

     bloc for it. %he NameNode responds to the client re4uest +ith

    the identit of the 'ataNode and the destination data bloc.

    %hen the client flushes the bloc of data from the local

    temporar file to the specified 'ataNode. 7hen a file is closed,

    the remaining un$flushed data in the temporar local file is

    transferred to the 'ataNode. %he client then tells the

  • 8/18/2019 IEEERepSpec

    10/14

     NameNode that the file is closed. At this point, the NameNode

    commits the file creation operation into a persistent store. 5f the

     NameNode dies before the file is closed, the file is lost. %he

    aboe approach has been adopted after careful consideration of 

    target applications that run on &'S. %hese applications need

    streaming +rites to files. 5f a client +rites to a remote file

    directl +ithout an client side buffering, the net+or speed

    and the congestion in the net+or impacts throughputconsiderabl. %his approach is not +ithout precedent. @arlier 

    distributed file sstems, e.g. AS, hae used client side caching

    to improe performance. A ?6S5_ re4uirement has been

    rela-ed to achiee higher performance of data uploads.

    6#. 9eplication 2ipelining

     7hen a client is +riting data to an &'S file, its data is first

    +ritten to a local file as e-plained in the preious section.

    Suppose the &'S file has a replication factor of three. 7hen

    the local file accumulates a full bloc of user data, the client

    retriees a list of 'ataNodes from the NameNode. %his listcontains the 'ataNodes that +ill host a replica of that bloc.

    %he client then flushes the data bloc to the first 'ataNode.

    %he first 'ataNode starts receiing the data in small portions (#

    ), +rites each portion to its local repositor and transfers

    that portion to the second 'ataNode in the list. %he second

    'ataNode, in turn starts receiing each portion of the data

     bloc, +rites that portion to its repositor and then flushes that

     portion to the third 'ataNode. inall, the third 'ataNode

    +rites the data to its local repositor. %hus, a 'ataNode can be

    receiing data from the preious one in the pipeline and at the

    same time for+arding data to the ne-t one in the pipeline.

    %hus, the data is pipelined from one 'ataNode to the ne-t.

    7! ccessibility

     &'S can be accessed from applications in man different

    +as. Natiel, &'S proides a aa A?5 for applications to

    use. A " language +rapper for this aa A?5 is also aailable.

    5n addition, an &%%? bro+ser can also be used to bro+se the

    files of an &'S instance. 7or is in progress to e-pose &'S

    through the 7eb'AP protocol.

    10.1 S Shell

    &'S allo+s user data to be organized in the form of files and

    directories. 5t proides a commandline interface called S shell

    that lets a user interact +ith the data in &'S. %he snta- of 

    this command set is similar to other shells (e.g. bash, csh) that

    users are alread familiar +ith. &ere are some sample

    action/command pairs:

    S shell is targeted for applications that need a scripting

    language to interact +ith the stored data.

    10.2 DFSAdmin

     The DFSAdmin command set is used for

    administering an HDFS cluster. These are

    commands that are used only by an HDFS

    administrator. Here are some samle action!

    command airs"

    10.3 !rowser "nterface

     A tpical &'S install configures a +eb serer to e-pose the

    &'S namespace through a configurable %"? port. %his

    allo+s a user to naigate the &'S namespace and ie+ the

    contents of its files using a +eb bro+ser.

     Space #eclamation

    ile $eletes and %ndeletes

    7hen a file is deleted b a user or an application, it is not

    immediatel remoed from &'S. 5nstead, &'S first

    renames it to a file in the /trash director. %he file can be

    restored 4uicl as long as it remains in /trash. A file remains in

    /trash for a configurable amount of time. After the e-pir of its

    life in /trash, the NameNode deletes the file from the &'S

    namespace. %he deletion of a file causes the blocs associated

    +ith the file to be freed. Note that there could be an appreciable

    time dela bet+een the time a file is deleted b a user and the

    time of the corresponding increase in free space in &'S.

    A user can Gndelete a file after deleting it as long as it remainsin the /trash director. 5f a user +ants to undelete a file that

    he/she has deleted, he/she can naigate the /trash director and

    retriee the file. %he /trash director contains onl the latest

    cop of the file that +as deleted. %he /trash director is *ust lie

    an other director +ith one special feature: &'S applies

    specified policies to automaticall delete files from this

    director. %he current default polic is to delete files from

  • 8/18/2019 IEEERepSpec

    11/14

    /trash that are more than 3 hours old. 5n the future, this polic

    +ill be configurable through a +ell defined interface.

     $ecrease #eplication actor 

    7hen the replication factor of a file is reduced, the NameNode

    selects e-cess replicas that can be deleted. %he ne-t &eartbeat

    transfers this information to the 'ataNode. %he 'ataNode then

    remoes the corresponding blocs and the corresponding free

    space appears in the cluster. 6nce again, there might be a time

    dela bet+een the completion of the set eplication A?5 call

    and the appearance of free space in the cluster.

    555. ?@?A@ 6G  ?A?@  @6@ S%C5NE

    efore ou begin to format our paper, first +rite and saethe content as a separate te-t file. eep our te-t and graphic

    files separate until after the te-t has been formatted and stled.'o not use hard tabs, and limit use of hard returns to onl onereturn at the end of a paragraph. 'o not add an ind of 

     pagination an+here in the paper. 'o not number te-t heads$the template +ill do that for ou.

    inall, complete content and organizational editing beforeformatting. ?lease tae note of the follo+ing items +hen

     proofreading spelling and grammar:

     A. Abbre+iations and Acronyms'efine abbreiations and acronms the first time the are

    used in the te-t, een after the hae been defined in theabstract. Abbreiations such as 5@@@, S5, ;S, "ES, sc, dc,and rms do not hae to be defined. 'o not use abbreiations in

    the title or heads unless the are unaoidable.

     #. ,nits

    • Gse either S5 (;S) or "ES as primar units. (S5 unitsare encouraged.) @nglish units ma be used assecondar units (in parentheses). An e-ception +ould

     be the use of @nglish units as identifiers in trade, suchas 8!.0$inch dis drie

  • 8/18/2019 IEEERepSpec

    12/14

    • %he prefi- 8non< is not a +ordF it should be *oined tothe +ord it modifies, usuall +ithout a hphen.

    • %here is no period after the 8et< in the Catinabbreiation 8et al.

  • 8/18/2019 IEEERepSpec

    13/14

    igure Cabels: Gse > point %imes Ne+ oman for igurelabels. Gse +ords rather than smbols or abbreiations +hen+riting igure a-is labels to aoid confusing the reader. As ane-ample, +rite the 4uantit 8;agnetizationB2, pp.3>9=!.

    ! 5. S. acobs and ". ?. ean, 8ine particles, thin films and e-changeanisotrop,< in ;agnetism, ol. 555, E. %. ado and &. Suhl, @ds. Ne+or: Academic, 1B3!, pp. 2=19!0L.

    # . @lissa, 8%itle of paper if no+n,< unpublished.

    0 . Nicole, 8%itle of paper +ith onl first +ord capitalized,< . NameStand. Abbre., in press.

    3 . orozu, ;. &irano, . 6a, and . %aga+a, 8@lectron spectroscopstudies on magneto$optical media and plastic substrate interface,< 5@@@%ransl. . ;agn. apan, ol. 2, pp. =#L9=#1, August 1B>= 'igests BthAnnual "onf. ;agnetics apan, p. !L1, 1B>2.

    = ;. oung, %he %echnical 7riterDs &andboo. ;ill Palle, "A:Gniersit Science, 1B>B.

    7e suggest that ou use a te-t bo- to insert a graphicich is ideall a !LL dpi %5 or @?S file, +ith all fonts

    bedded) because, in an ;S7 document, this method isme+hat more stable than directl inserting a picture.

    %o hae non$isible rules on our frame, use the

    7ord 8ormat< pull$do+n menu, select %e-t o- Wors and Cines to choose No ill and No Cine.

  • 8/18/2019 IEEERepSpec

    14/14

    #$%