cs 378 – big data programmingdfranke/courses/2017fall/lecture5.pdf · cs 378 – big data...

18
CS 378 – Big Data Programming Lecture 5 Summariza9on Pa:erns CS 378 – Fall 2017 Big Data Programming 1

Upload: vuongphuc

Post on 14-May-2018

230 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

CS378–BigDataProgramming

Lecture5Summariza9onPa:erns

CS378–Fall2017 BigDataProgramming 1

Page 2: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Review

•  Assignment2–Ques9ons?

•  mrunit–Howdoyoutestmap()orreduce()callsthatproducemul9pleoutputs?

•  Issueswithcalcula9ngvariance

CS378–Fall2017 BigDataProgramming 2

Page 3: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Summariza9on

•  Othersummariza9onsofinterest– Min,max,median

•  Supposeweareinterestedinthesemetricsforparagraphlength(Assignment2data)–  Ifparagraphlengthsarenormallydistributed,thenthemedianwillbeverynearthemean

–  Ifthedistribu9onofparagraphlengthsisskewed,thenthemeanandmedianwillbeverydifferent

CS378–Fall2017 BigDataProgramming 3

Page 4: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Summariza9on

•  MinandmaxarestraighWorward•  Foreachparagraph,outputtwovalues– Minlength(thelengthofthecurrentparagraph)– Maxlength(thelengthofthecurrentparagraph)–  Key?

•  Combinerwillgetakeyandlistofvaluespair–  Selectthemin,maxfromthelist,outputthevalues–  Key?

•  Reducerdoesthesame

CS378–Fall2017 BigDataProgramming 4

Page 5: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Summariza9on

•  Median–  Getallthevalues,sortthem,thenfindthemiddle

•  Sinceourcomputa9onisdistributed,nomapperseesallthevalues

•  Shouldwesendthemalltoonereducer?–  Notu9lizingmap-reduce(computa9onnotdistributed)–  Datasizeslikelytoolargetokeepinmemory

CS378–Fall2017 BigDataProgramming 5

Page 6: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Summariza9on

•  Median–  Keeptheuniqueparagraphlengths,and–  Thefrequencyofeachlength

•  Mapoutput:–  <paragraphlength,1>

•  Combinergetsalistofthesepairsandupdatesthecountforrecurringlengths

•  Reducerdoesthesame,thencomputesthemedian

CS378–Fall2017 BigDataProgramming 6

Page 7: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Summariza9on

•  Median–  HadoopprovidestheSortedMapWritableclass–  Canassociateafrequencycountwithaparagraphlength–  Keepsthelengthsinsortedorder

•  SeetheexampleinChapter2ofMap-ReduceDesignPa1erns

•  Howcouldwecomputeallinonepassoverthedata? –  min,max,median

CS378–Fall2017 BigDataProgramming 7

Page 8: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

Counters

•  Hadoopmap-reduceinfrastructureprovidescounters–  Accessedbygroupname–  Cannothavealargenumberofcounters

•  Forexample,can’tusecounterstosolveWordCount

– Afewtensofcounterscanbeused

•  CountersarestoredinmemoryonJobTracker

CS378–Fall2017 BigDataProgramming 8

Page 9: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

CountersFigure2-6,MapReduceDesignPa:erns

CS378–Fall2017 BigDataProgramming 9

Page 10: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

HowHadoopMapReduceWorks

•  We’veseensometermslike:–  Job,JobTracker,TaskTracker(MapReduce1)–  Job,ResourceManager,NodeManager(YARN,MapReduce2)

•  Let’slookatwhattheydo

•  DetailsfromChapter7,Hadoop:TheDefini9veGuide4thEdi9on

CS378–Fall2017 BigDataProgramming 10

Page 11: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

HowHadoopMapReduceWorksFigure7-1,Hadoop:TheDefini9veGuide4thEdi9on

CS378–Fall2017 BigDataProgramming 11

Page 12: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

JobSubmission

•  Jobsubmission–  Inputfilesexist?Cansplitsbecomputed?– Outputdirectoryexist?

•  Ifyes,itfails.Hadoopexpectstocreatethisdirectory– CopyresourcestoHDFS

•  JARfiles•  Configura9onfile•  Computedfilesplits

CS378–Fall2017 BigDataProgramming 12

Page 13: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

ResourceManager•  Createstasks(worktobedone)

–  Maptaskforeachinputsplit–  Requestednumberofreducertasks–  Jobsetup,jobcleanuptasks

•  Maptasksareassignedtotasktrackersthatare“close”totheinputsplitloca9on–  Datalocalpreferred–  Racklocalnext

•  Reducetaskcangoanywhere.Why?•  Schedulingalgorithmordersthetasks

CS378–Fall2017 BigDataProgramming 13

Page 14: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

TaskExecu9on

•  Configuredforseveralmapandreducetasks•  Eachtaskhasstatusinfo(state,progress,counters)•  PeriodicallysendsinfotoMRAppMaster –  Running,successfulcomple9on,failed–  Progress(%complete)

•  Foranewtask–  Copyfilestolocalfilesystem(JAR,configura9on)–  LaunchanewJVM(YarnChild drivesexecu9on)–  Loadthemapper/reducerclassandrunthetask

CS378–Fall2017 BigDataProgramming 14

Page 15: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

TaskProgress

•  Readininputpair(mapperorreducer)•  Writeanoutputpair(mapperorreducer)•  Setthestatusdescrip9on•  Incrementacounter•  Repor9ngprogress

CS378–Fall2017 BigDataProgramming 15

Page 16: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

TaskProgress

•  Mapper–straighWorward– Howmuchoftheinputhasbeenprocessed

•  Reducer–morecomplicated– Sort,shuffleandreduceareconsideredhere– Progressisanes9mateofhowmuchofthetotalworkhasbeendone

– One-thirdallocatedtoeach

CS378–Fall2017 BigDataProgramming 16

Page 17: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

ShuffleFigure7-4,Hadoop:TheDefini9veGuide4thEdi9on

CS378–Fall2017 BigDataProgramming 17

Page 18: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager

MapReduceinHadoopFigure2.4,Hadoop-TheDefini9veGuide

The number of reduce tasks is not governed by the size of the input, but instead isspecified independently. In “The Default MapReduce Job” on page 227, you will seehow to choose the number of reduce tasks for a given job.

When there are multiple reducers, the map tasks partition their output, each creatingone partition for each reduce task. There can be many keys (and their associated values)in each partition, but the records for any given key are all in a single partition. Thepartitioning can be controlled by a user-defined partitioning function, but normally thedefault partitioner—which buckets keys using a hash function—works very well.

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4.This diagram makes it clear why the data flow between map and reduce tasks is collo-quially known as “the shuffle,” as each reduce task is fed by many map tasks. Theshuffle is more complicated than this diagram suggests, and tuning it can have a bigimpact on job execution time, as you will see in “Shuffle and Sort” on page 208.

Figure 2-4. MapReduce data flow with multiple reduce tasks

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when youdon’t need the shuffle because the processing can be carried out entirely in parallel (afew examples are discussed in “NLineInputFormat” on page 247). In this case, theonly off-node data transfer is when the map tasks write to HDFS (see Figure 2-5).

Combiner FunctionsMany MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a combiner function to be run on the map output, and the combiner

Scaling Out | 33

CS378–Fall2017 BigDataProgramming 18