cs 378 – big data programmingdfranke/courses/2017fall/lecture5.pdf · cs 378 – big data...

CS378–BigDataProgramming

Lecture5Summariza9onPa:erns

CS378–Fall2017 BigDataProgramming 1

Review

•  Assignment2–Ques9ons?

•  mrunit–Howdoyoutestmap()orreduce()callsthatproducemul9pleoutputs?

•  Issueswithcalcula9ngvariance


Summariza9on

•  Othersummariza9onsofinterest– Min,max,median

•  Supposeweareinterestedinthesemetricsforparagraphlength(Assignment2data)–  Ifparagraphlengthsarenormallydistributed,thenthemedianwillbeverynearthemean

–  Ifthedistribu9onofparagraphlengthsisskewed,thenthemeanandmedianwillbeverydifferent


Summariza9on

•  MinandmaxarestraighWorward•  Foreachparagraph,outputtwovalues– Minlength(thelengthofthecurrentparagraph)– Maxlength(thelengthofthecurrentparagraph)–  Key?

•  Combinerwillgetakeyandlistofvaluespair–  Selectthemin,maxfromthelist,outputthevalues–  Key?

•  Reducerdoesthesame


Summariza9on

•  Median–  Getallthevalues,sortthem,thenfindthemiddle

•  Sinceourcomputa9onisdistributed,nomapperseesallthevalues

•  Shouldwesendthemalltoonereducer?–  Notu9lizingmap-reduce(computa9onnotdistributed)–  Datasizeslikelytoolargetokeepinmemory


Summariza9on

•  Median–  Keeptheuniqueparagraphlengths,and–  Thefrequencyofeachlength

•  Mapoutput:–  <paragraphlength,1>

•  Combinergetsalistofthesepairsandupdatesthecountforrecurringlengths

•  Reducerdoesthesame,thencomputesthemedian


Summariza9on

•  Median–  HadoopprovidestheSortedMapWritableclass–  Canassociateafrequencycountwithaparagraphlength–  Keepsthelengthsinsortedorder

•  SeetheexampleinChapter2ofMap-ReduceDesignPa1erns

•  Howcouldwecomputeallinonepassoverthedata? –  min,max,median


Counters

•  Hadoopmap-reduceinfrastructureprovidescounters–  Accessedbygroupname–  Cannothavealargenumberofcounters

•  Forexample,can’tusecounterstosolveWordCount

– Afewtensofcounterscanbeused

•  CountersarestoredinmemoryonJobTracker


CountersFigure2-6,MapReduceDesignPa:erns


HowHadoopMapReduceWorks

•  We’veseensometermslike:–  Job,JobTracker,TaskTracker(MapReduce1)–  Job,ResourceManager,NodeManager(YARN,MapReduce2)

•  Let’slookatwhattheydo

•  DetailsfromChapter7,Hadoop:TheDefini9veGuide4thEdi9on


HowHadoopMapReduceWorksFigure7-1,Hadoop:TheDefini9veGuide4thEdi9on


JobSubmission

•  Jobsubmission–  Inputfilesexist?Cansplitsbecomputed?– Outputdirectoryexist?

•  Ifyes,itfails.Hadoopexpectstocreatethisdirectory– CopyresourcestoHDFS

•  JARfiles•  Configura9onfile•  Computedfilesplits


ResourceManager•  Createstasks(worktobedone)

–  Maptaskforeachinputsplit–  Requestednumberofreducertasks–  Jobsetup,jobcleanuptasks

•  Maptasksareassignedtotasktrackersthatare“close”totheinputsplitloca9on–  Datalocalpreferred–  Racklocalnext

•  Reducetaskcangoanywhere.Why?•  Schedulingalgorithmordersthetasks


TaskExecu9on

•  Configuredforseveralmapandreducetasks•  Eachtaskhasstatusinfo(state,progress,counters)•  PeriodicallysendsinfotoMRAppMaster –  Running,successfulcomple9on,failed–  Progress(%complete)

•  Foranewtask–  Copyfilestolocalfilesystem(JAR,configura9on)–  LaunchanewJVM(YarnChild drivesexecu9on)–  Loadthemapper/reducerclassandrunthetask


TaskProgress

•  Readininputpair(mapperorreducer)•  Writeanoutputpair(mapperorreducer)•  Setthestatusdescrip9on•  Incrementacounter•  Repor9ngprogress


TaskProgress

•  Mapper–straighWorward– Howmuchoftheinputhasbeenprocessed

•  Reducer–morecomplicated– Sort,shuffleandreduceareconsideredhere– Progressisanes9mateofhowmuchofthetotalworkhasbeendone

– One-thirdallocatedtoeach


ShuffleFigure7-4,Hadoop:TheDefini9veGuide4thEdi9on


MapReduceinHadoopFigure2.4,Hadoop-TheDefini9veGuide

The number of reduce tasks is not governed by the size of the input, but instead isspecified independently. In “The Default MapReduce Job” on page 227, you will seehow to choose the number of reduce tasks for a given job.

When there are multiple reducers, the map tasks partition their output, each creatingone partition for each reduce task. There can be many keys (and their associated values)in each partition, but the records for any given key are all in a single partition. Thepartitioning can be controlled by a user-defined partitioning function, but normally thedefault partitioner—which buckets keys using a hash function—works very well.

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4.This diagram makes it clear why the data flow between map and reduce tasks is collo-quially known as “the shuffle,” as each reduce task is fed by many map tasks. Theshuffle is more complicated than this diagram suggests, and tuning it can have a bigimpact on job execution time, as you will see in “Shuffle and Sort” on page 208.

Figure 2-4. MapReduce data flow with multiple reduce tasks

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when youdon’t need the shuffle because the processing can be carried out entirely in parallel (afew examples are discussed in “NLineInputFormat” on page 247). In this case, theonly off-node data transfer is when the map tasks write to HDFS (see Figure 2-5).

Combiner FunctionsMany MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a combiner function to be run on the map output, and the combiner

Scaling Out | 33


cs 378 – big data programmingdfranke/courses/2017fall/lecture5.pdf · cs 378 – big data...

Documents