mapreduce - cornell · pdf filemapreduce simplified data processing on large clusters...

Download MapReduce - Cornell  · PDF fileMapReduce Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

If you can't read please download the document

Upload: dangdiep

Post on 06-Feb-2018

222 views

Category:

Documents


3 download

TRANSCRIPT

  • MapReduce SimplifiedDataProcessingonLargeClusters

    (WithouttheAgonizingPain)

    PresentedbyAaronNathan

  • TheProblem

    Massiveamountsofdata>100TB(theinternet)Needssimpleprocessing

    ComputersarentperfectSlowUnreliableMisconfigured

    Requirescomplex(i.e.bugprone)code

  • MapReducetotheRescue!

    CommonFuncKonalProgrammingModelMapStep

    map (in_key, in_value) -> list(out_key, intermediate_value) Splitaproblemintoalotofsmallersubproblems

    ReduceStepreduce (out_key, list(intermediate_value)) -> list(out_value) Combinetheoutputsofthesubproblemstogivetheoriginalproblemsanswer

    EachfuncKonisindependent HighlyParallelizable

  • Answer!Answer!

    AlgorithmPicture

    MAP

    REDUCE

    Answer!

    MAPMAP MAP MAP

    DATA

    K1:vK1:vK2:v K2:v K1:v K2:vK3:v K3:v

    K1:v,v,v K2:v,v,v K3:v,v,v

    REDUCE REDUCE

    aggregator

  • SomeExampleCodemap(String input_key, String input_value): // input_key: document name

    // input_value: document contents

    for each word w in input_value: EmitIntermediate(w, "1");

    reduce(String output_key, Iterator intermediate_values): // output_key: a word

    // output_values: a list of counts int result = 0; for each v in intermediate_values:

    result += ParseInt(v);

    Emit(AsString(result));

  • SomeExampleApplicaKons

    DistributedGrep URLAccessFrequencyCounter ReverseWebLinkGraph TermVectorperHost DistributedSort InvertedIndex

  • TheImplementaKon

    GoogleClusters100s1000sDualCorex86CommodityMachinesCommodityNetworking(100mbps/1Gbps)GFS

    GoogleJobScheduler Librarylinkedinc++

  • ExecuKon

  • TheMaster

    MaintainsthestateandidenKfyofallworkers Managesintermediatevalues ReceivessignalsfromMapworkersuponcompleKon

    BroadcastssignalstoReduceworkersastheywork

    CanretaskcompletedMapworkerstoReduceworkers.

  • InCaseofFailure

    PeriodicPingsfromMaster>WorkersOnfailureresetsstateofassignedtaskofdeadworker

    SimplesystemprovesresilientWorksincaseofa80simultaneousmachinefailures!

    Masterfailureisunhandled. WorkerFailuredoesnteffectoutput

    (outputidenKcalwhetherfailureoccursornot) Eachmapwritestolocaldiskonly Ifamapperislost,thedataisjustreprocessedNondeterminisKcmapfuncKonsarentguaranteed

  • PreservingBandwidth

    MachinesareinrackswithsmallinterconnectsUselocaKoninformaKonfromGFSAhemptstoputtasksforworkersandinputslicesonthesamerack

    UsuallyresultsinLOCALreads!

  • BackupExecuKonTasks

    Whatifonemachineisslow? CandelaythecompleKonoftheenKreMROperaKon!

    Answer:Backup(Redundant)ExecuKonsWhoeverfinishesfirstcompletesthetask!Enabledtowardstheendofprocessing

  • ParKKoning

    M=numberofMapTasks(thenumberofinputsplits)

    R=numberofReduceTasks(thenumberofintermediatekeysplits)

    W=numberofworkercomputers InGeneral:

    M=sizeof(Input)/64MB R=W*n(wherenisasmallnumber)

    TypicalScenario:InputSize=12TB,M=200,000,R=5000W=2000

  • CustomParKKoning

    DefaultParKKonedonintermediatekeyHash(intermediate_key)modR

    Whatifuserhasaprioriknowledgeaboutthekey?AllowforuserdefinedhashingfuncKonEx.Hash(Hostname(url_key))

  • TheCombiner

    IfreducerisassociaKveandcommuniviKve (2+5)+4=11or2+(5+4)=11 (15+x)+2=2+(15+x)

    RepeatedintermediatekeyscanbemergedSavesnetworkbandwidthEssenKallylikealocalreducetask

  • I/OAbstracKons

    HowtogetiniKalkeyvaluepairstomap?DefineaninputformatMakesuresplitsoccurinreasonableplacesEx:Text

    Eachlineisakey/pair CancomefromGFS,bigTable,oranywherereally!

    Outputworksanalogously

  • SkippingBadRecords

    Whatifausermakesamistakeinmap/reduce Andonlyapparentonfewjobs..

    WorkersendsmessagetoMaster Skiprecordon>1workerfailureandtellotherstoignorethisrecord

  • RemovingUnnecessaryDevelopmentPain

    LocalMapReduceImplementaKonthatrunsondevelopmentmachine

    MasterhasHTTPpagewithstatusofenKreoperaKonShowsbadrecords

    ProvideaCounterFacilityMasteraggregatescountsanddisplayedonMasterHTTPpage

  • AlookattheUI(in1994)

    h6p://labs.google.com/papers/mapreduceosdi04slides/indexauto0013.html

  • PerformanceBenchmarks

    SorBng AND Searching

  • Search(Grep)

    Scanthrough1010100byterecords(1TB)

    M=15000,R=1 StartupKme

    GFSLocalizaKonProgramPropagaKon

    Peak>30GB/sec!

  • Sort

    50linesofcode Map>key+textline Reduce>IdenKty M=15000,R=4000

    ParKKononinitbytesofintermediatekey

    Sortsin891sec!

  • WhataboutBackupTasks?

  • Andwaititsuseful!

    NB:August2004

  • OpenSourceImplementaKon

    Hadoop hhp://hadoop.apache.org/core/

    ReliesonHDFSAllinterfaceslookalmostexactlylikeMapReducepaper

    Thereisevenatalkaboutittoday!4:15B17CSColloquium:MikeCafarella(Uwash)

  • AcBveDisksforLargeScaleDataProcessing

  • TheConcept

    UseaggregateprocessingpowerNetworkeddisksallowforhigherthroughput

    WhynotmovepartoftheapplicaKonontothediskdevice?Reducedatatraffic Increaseparallelismfurther

  • ShrinkingSupportHardware

  • ExampleApplicaKons

    MediaDatabaseFindsimilarmediadatabyfingerprint

    RealTimeApplicaKonsCollectmulKplesensordataquickly

    DataMiningPOSAnalysisrequiredadhocdatabasequeries

  • Approach

    Leveragetheparallelismavailableinsystemswithmanydisks

    Operatewithasmallamountofstate,processingdataasitstreamsoffthedisk

    ExecuterelaKvelyfewinstrucKonsperbyteofdata

  • ResultsNearestNeighborSearch

    Problem:DeterminekitemsclosesttoaparKculariteminadatabasePerformcomparisonsonthedriveReturnsthedisksclosestmatchesServerdoesfinalmerge

  • MediaMiningExample

    Performlowlevelimagetasksonthedisk!

    EdgeDetecKonperformedondiskSenttoserverasedgeimage

    Serverdoeshigherlevelprocessing

  • WhynotjustuseabunchofPCs?

    Theperformanceinceaseissimilar Infact,thepaperessenKallyusedthissetuptoactuallybenchmarktheirresults!

    Supposedlythiscouldbecheaper ThepaperdoesntreallygiveagoodargumentforthisPossiblyreducedbandwidthondiskIOchannelButwhocares?

  • SomeQuesKons

    Whatcouldadiskpossiblydobeherthanthehostprocessor?

    WhataddedcostisassociatedwiththismediocreprocessorontheHDD?

    Arenewdependenciesareintroducedonhardwareandsosware?

    Perhapsother(beher)placestodothistypeoflocalparallelprocessing?

    Maybein2001thismademoresense?