the-eye.euthe-eye.eu/public/site-dumps/index-of/index-of.co.uk/big-data... · table of contents...

LearningHadoop2

TableofContents

LearningHadoop2

Credits

AbouttheAuthors

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.Introduction

Anoteonversioning

ThebackgroundofHadoop

ComponentsofHadoop

Commonbuildingblocks

Storage

Computation

Bettertogether

Hadoop2–what’sthebigdeal?

StorageinHadoop2

ComputationinHadoop2

DistributionsofApacheHadoop

Adualapproach

AWS–infrastructureondemandfromAmazon

SimpleStorageService(S3)

ElasticMapReduce(EMR)

Gettingstarted

ClouderaQuickStartVM

AmazonEMR

CreatinganAWSaccount

Signingupforthenecessaryservices

UsingElasticMapReduce

GettingHadoopupandrunning

HowtouseEMR

AWScredentials

TheAWScommand-lineinterface

Runningtheexamples

DataprocessingwithHadoop

WhyTwitter?

Buildingourfirstdataset

Oneservice,multipleAPIs

AnatomyofaTweet

Twittercredentials

ProgrammaticaccesswithPython

Summary

2.Storage

TheinnerworkingsofHDFS

Clusterstartup

NameNodestartup

DataNodestartup

Blockreplication

Command-lineaccesstotheHDFSfilesystem

ExploringtheHDFSfilesystem

Protectingthefilesystemmetadata

SecondaryNameNodenottotherescue

Hadoop2NameNodeHA

KeepingtheHANameNodesinsync

Clientconfiguration

Howafailoverworks

ApacheZooKeeper–adifferenttypeoffilesystem

ImplementingadistributedlockwithsequentialZNodes

ImplementinggroupmembershipandleaderelectionusingephemeralZNodes

JavaAPI

Buildingblocks

Furtherreading

AutomaticNameNodefailover

HDFSsnapshots

Hadoopfilesystems

Hadoopinterfaces

JavaFileSystemAPI

Libhdfs

Thrift

Managingandserializingdata

TheWritableinterface

Introducingthewrapperclasses

Arraywrapperclasses

TheComparableandWritableComparableinterfaces

Storingdata

SerializationandContainers

Compression

General-purposefileformats

Column-orienteddataformats

RCFile

ORC

Parquet

Avro

UsingtheJavaAPI

Summary

3.Processing–MapReduceandBeyond

MapReduce

JavaAPItoMapReduce

TheMapperclass

TheReducerclass

TheDriverclass

Combiner

Partitioning

Theoptionalpartitionfunction

Hadoop-providedmapperandreducerimplementations

Sharingreferencedata

WritingMapReduceprograms

Gettingstarted

Runningtheexamples

Localcluster

ElasticMapReduce

WordCount,theHelloWorldofMapReduce

Wordco-occurrences

Trendingtopics

TheTopNpattern

Sentimentofhashtags

Textcleanupusingchainmapper

WalkingthrougharunofaMapReducejob

Startup

Splittingtheinput

Taskassignment

Taskstartup

OngoingJobTrackermonitoring

Mapperinput

Mapperexecution

Mapperoutputandreducerinput

Reducerinput

Reducerexecution

Reduceroutput

Shutdown

Input/Output

InputFormatandRecordReader

Hadoop-providedInputFormat

Hadoop-providedRecordReader

OutputFormatandRecordWriter

Hadoop-providedOutputFormat

Sequencefiles

YARN

YARNarchitecture

ThecomponentsofYARN

AnatomyofaYARNapplication

LifecycleofaYARNapplication

Faulttoleranceandmonitoring

Thinkinginlayers

Executionmodels

YARNintherealworld–ComputationbeyondMapReduce

TheproblemwithMapReduce

Tez

Hive-on-tez

ApacheSpark

ApacheSamza

YARN-independentframeworks

YARNtodayandbeyond

Summary

4.Real-timeComputationwithSamza

StreamprocessingwithSamza

HowSamzaworks

Samzahigh-levelarchitecture

Samza’sbestfriend–ApacheKafka

YARNintegration

Anindependentmodel

HelloSamza!

Buildingatweetparsingjob

Theconfigurationfile

GettingTwitterdataintoKafka

RunningaSamzajob

SamzaandHDFS

Windowingfunctions

Multijobworkflows

Tweetsentimentanalysis

Bootstrapstreams

Statefultasks

Summary

5.IterativeComputationwithSpark

ApacheSpark

Clustercomputingwithworkingsets

ResilientDistributedDatasets(RDDs)

Actions

Deployment

SparkonYARN

SparkonEC2

GettingstartedwithSpark

Writingandrunningstandaloneapplications

ScalaAPI

JavaAPI

WordCountinJava

PythonAPI

TheSparkecosystem

SparkStreaming

GraphX

MLlib

SparkSQL

ProcessingdatawithApacheSpark

Buildingandrunningtheexamples

RunningtheexamplesonYARN

Findingpopulartopics

Assigningasentimenttotopics

Dataprocessingonstreams

Statemanagement

DataanalysiswithSparkSQL

SQLondatastreams

ComparingSamzaandSparkStreaming

Summary

6.DataAnalysiswithApachePig

AnoverviewofPig

Gettingstarted

RunningPig

Grunt–thePiginteractiveshell

ElasticMapReduce

FundamentalsofApachePig

ProgrammingPig

Pigdatatypes

Pigfunctions

Load/store

Eval

Thetuple,bag,andmapfunctions

Themath,string,anddatetimefunctions

Dynamicinvokers

Macros

Workingwithdata

Filtering

Aggregation

Foreach

Join

ExtendingPig(UDFs)

ContributedUDFs

Piggybank

ElephantBird

ApacheDataFu

AnalyzingtheTwitterstream

Prerequisites

Datasetexploration

Tweetmetadata

Datapreparation

Topnstatistics

Datetimemanipulation

Sessions

Capturinguserinteractions

Linkanalysis

Influentialusers

Summary

7.HadoopandSQL

WhySQLonHadoop

OtherSQL-on-Hadoopsolutions

Prerequisites

OverviewofHive

ThenatureofHivetables

Hivearchitecture

Datatypes

DDLstatements

Fileformatsandstorage

JSON

Avro

Columnarstores

Queries

StructuringHivetablesforgivenworkloads

Partitioningatable

Overwritingandupdatingdata

Bucketingandsorting

Samplingdata

Writingscripts

HiveandAmazonWebServices

HiveandS3

HiveonElasticMapReduce

ExtendingHiveQL

Programmaticinterfaces

JDBC

Thrift

Stingerinitiative

Impala

ThearchitectureofImpala

Co-existingwithHive

Adifferentphilosophy

Drill,Tajo,andbeyond

Summary

8.DataLifecycleManagement

Whatdatalifecyclemanagementis

Importanceofdatalifecyclemanagement

Toolstohelp

Buildingatweetanalysiscapability

Gettingthetweetdata

IntroducingOozie

AnoteonHDFSfilepermissions

Makingdevelopmentalittleeasier

ExtractingdataandingestingintoHive

Anoteonworkflowdirectorystructure

IntroducingHCatalog

UsingHCatalog

TheOoziesharelib

HCatalogandpartitionedtables

Producingderiveddata

Performingmultipleactionsinparallel

Callingasubworkflow

Addingglobalsettings

Challengesofexternaldata

Datavalidation

Validationactions

Handlingformatchanges

HandlingschemaevolutionwithAvro

FinalthoughtsonusingAvroschemaevolution

Onlymakeadditivechanges

Manageschemaversionsexplicitly

Thinkaboutschemadistribution

Collectingadditionaldata

Schedulingworkflows

OtherOozietriggers

Pullingitalltogether

Othertoolstohelp

Summary

9.MakingDevelopmentEasier

Choosingaframework

Hadoopstreaming

StreamingwordcountinPython

Differencesinjobswhenusingstreaming

Findingimportantwordsintext

Calculatetermfrequency

Calculatedocumentfrequency

Puttingitalltogether–TF-IDF

KiteData

DataCore

DataHCatalog

DataHive

DataMapReduce

DataSpark

DataCrunch

ApacheCrunch

Gettingstarted

Concepts

Dataserialization

Dataprocessingpatterns

Aggregationandsorting

Joiningdata

Pipelinesimplementationandexecution

SparkPipeline

MemPipeline

Crunchexamples

Wordco-occurrence

TF-IDF

KiteMorphlines

Concepts

Morphlinecommands

Summary

10.RunningaHadoopCluster

I’madeveloper–Idon’tcareaboutoperations!

HadoopandDevOpspractices

ClouderaManager

Topayornottopay

ClustermanagementusingClouderaManager

ClouderaManagerandothermanagementtools

MonitoringwithClouderaManager

Findingconfigurationfiles

ClouderaManagerAPI

ClouderaManagerlock-in

Ambari–theopensourcealternative

OperationsintheHadoop2world

Sharingresources

Buildingaphysicalcluster

Physicallayout

Rackawareness

Servicelayout

Upgradingaservice

BuildingaclusteronEMR

Considerationsaboutfilesystems

GettingdataintoEMR

EC2instancesandtuning

Clustertuning

JVMconsiderations

Thesmallfilesproblem

Mapandreduceoptimizations

Security

EvolutionoftheHadoopsecuritymodel

Beyondbasicauthorization

ThefutureofHadoopsecurity

Consequencesofusingasecuredcluster

Monitoring

Hadoop–wherefailuresdon’tmatter

Monitoringintegration

Application-levelmetrics

Troubleshooting

Logginglevels

Accesstologfiles

ResourceManager,NodeManager,andApplicationManager

Applications

Nodes

Scheduler

MapReduce

MapReducev1

MapReducev2(YARN)

JobHistoryServer

NameNodeandDataNode

Summary

11.WheretoGoNext

Alternativedistributions

ClouderaDistributionforHadoop

HortonworksDataPlatform

MapR

Andtherest…

Choosingadistribution

Othercomputationalframeworks

ApacheStorm

ApacheGiraph

ApacheHAMA

Otherinterestingprojects

HBase

Sqoop

Whir

Mahout

Hue

Otherprogrammingabstractions

Cascading

AWSresources

SimpleDBandDynamoDB

Kinesis

DataPipeline

Sourcesofinformation

Sourcecode

Mailinglistsandforums

LinkedIngroups

HUGs

Conferences

Summary

Index

LearningHadoop2

LearningHadoop2Copyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1060215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78328-551-8

www.packtpub.com

http://www.packtpub.com

CreditsAuthors

GarryTurkington

GabrieleModena

Reviewers

AtdheBuja

AmitGurdasani

JakobHoman

JamesLampton

DavideSetti

ValerieParham-Thompson

CommissioningEditor

EdwardGordon

AcquisitionEditor

JoanneFitzpatrick

ContentDevelopmentEditor

VaibhavPawar

TechnicalEditors

IndrajitA.Das

MenzaMathew

CopyEditors

RoshniBanerjee

SarangChari

PranjaliChury

ProjectCoordinator

KrantiBerde

Proofreaders

SimranBhogal

MartinDiver

LawrenceA.Herman

PaulHindle

Indexer

HemanginiBari

Graphics

AbhinashSahu

ProductionCoordinator

NiteshThakur

CoverWork

NiteshThakur

AbouttheAuthorsGarryTurkingtonhasover15yearsofindustryexperience,mostofwhichhasbeenfocusedonthedesignandimplementationoflarge-scaledistributedsystems.InhiscurrentroleastheCTOatImproveDigital,heisprimarilyresponsiblefortherealizationofsystemsthatstore,process,andextractvaluefromthecompany’slargedatavolumes.BeforejoiningImproveDigital,hespenttimeatAmazon.co.uk,whereheledseveralsoftwaredevelopmentteams,buildingsystemsthatprocesstheAmazoncatalogdataforeveryitemworldwide.Priortothis,hespentadecadeinvariousgovernmentpositionsinboththeUKandtheUSA.

HehasBScandPhDdegreesinComputerSciencefromQueensUniversityBelfastinNorthernIreland,andaMaster’sdegreeinEngineeringinSystemsEngineeringfromStevensInstituteofTechnologyintheUSA.HeistheauthorofHadoopBeginnersGuide,publishedbyPacktPublishingin2013,andisacommitterontheApacheSamzaproject.

IwouldliketothankmywifeLeaandmotherSarahfortheirsupportandpatiencethroughthewritingofanotherbookandmydaughterMayaforfrequentlycheeringmeupandaskingmehardquestions.IwouldalsoliketothankGabrieleforbeingsuchanamazingco-authoronthisproject.

GabrieleModenaisadatascientistatImproveDigital.Inhiscurrentposition,heusesHadooptomanage,process,andanalyzebehavioralandmachine-generateddata.Gabrieleenjoysusingstatisticalandcomputationalmethodstolookforpatternsinlargeamountsofdata.PriortohiscurrentjobinadtechheheldanumberofpositionsinAcademiaandIndustrywherehedidresearchinmachinelearningandartificialintelligence.

HeholdsaBScdegreeinComputerSciencefromtheUniversityofTrento,ItalyandaResearchMScdegreeinArtificialIntelligence:LearningSystems,fromtheUniversityofAmsterdamintheNetherlands.

Firstandforemost,IwanttothankLauraforhersupport,constantencouragementandendlesspatienceputtingupwithfartoomany“can’tdo,I’mworkingontheHadoopbook”.SheismyrockandIdedicatethisbooktoher.

AspecialthankyougoestoAmit,Atdhe,Davide,Jakob,JamesandValerie,whoseinvaluablefeedbackandcommentarymadethisworkpossible.

Finally,I’dliketothankmyco-author,Garry,forbringingmeonboardwiththisproject;ithasbeenapleasureworkingtogether.

AbouttheReviewersAtdheBujaisacertifiedethicalhacker,DBA(MCITP,OCA11g),anddeveloperwithgoodmanagementskills.HeisaDBAattheAgencyforInformationSociety/MinistryofPublicAdministration,wherehealsomanagessomeprojectsofe-governanceandhasmorethan10years’experienceworkingonSQLServer.

AtdheisaregularcolumnistforUBTNews.Currently,heholdsanMScdegreeincomputerscienceandengineeringandhasabachelor’sdegreeinmanagementandinformation.Hespecializesinandiscertifiedinmanytechnologies,suchasSQLServer(allversions),Oracle11g,CEH,WindowsServer,MSProject,SCOM2012R2,BizTalk,andintegrationbusinessprocesses.

Hewasthereviewerofthebook,MicrosoftSQLServer2012withHadoop,publishedbyPacktPublishing.Hiscapabilitiesgobeyondtheaforementionedknowledge!

IthankDonikaandmyfamilyforalltheencouragementandsupport.

AmitGurdasaniisasoftwareengineeratAmazon.Hearchitectsdistributedsystemstoprocessproductcataloguedata.Priortobuildinghigh-throughputsystemsatAmazon,hewasworkingontheentiresoftwarestack,bothasasystems-leveldeveloperatEricssonandIBMaswellasanapplicationdeveloperatManhattanAssociates.Hemaintainsastronginterestinbulkdataprocessing,datastreaming,andservice-orientedsoftwarearchitectures.

JakobHomanhasbeeninvolvedwithbigdataandtheApacheHadoopecosystemformorethan5years.HeisaHadoopcommitteraswellasacommitterfortheApacheGiraph,Spark,Kafka,andTajoprojects,andisaPMCmember.HehasworkedinbringingallthesesystemstoscaleatYahoo!andLinkedIn.

JamesLamptonisaseasonedpractitionerofallthingsdata(bigorsmall)with10yearsofhands-onexperienceinbuildingandusinglarge-scaledatastorageandprocessingplatforms.Heisabelieverinholisticapproachestosolvingproblemsusingtherighttoolfortherightjob.HisfavoritetoolsincludePython,Java,Hadoop,Pig,Storm,andSQL(whichsometimesIlikeandsometimesIdon’t).HehasrecentlycompletedhisPhDfromtheUniversityofMarylandwiththereleaseofPigSqueal:amechanismforrunningPigscriptsonStorm.

Iwouldliketothankmyspouse,Andrea,andmyson,Henry,forgivingmetimetoreadwork-relatedthingsathome.IwouldalsoliketothankGarry,Gabriele,andthefolksatPacktPublishingfortheopportunitytoreviewthismanuscriptandfortheirpatienceandunderstanding,asmyfreetimewasconsumedwhenwritingmydissertation.

DavideSetti,aftergraduatinginphysicsfromtheUniversityofTrento,joinedtheSoNetresearchunitattheFondazioneBrunoKesslerinTrento,whereheappliedlarge-scaledataanalysistechniquestounderstandpeople’sbehaviorsinsocialnetworksandlargecollaborativeprojectssuchasWikipedia.

In2010,DavidemovedtoFondazione,whereheledthedevelopmentofdataanalytic

toolstosupportresearchoncivicmedia,citizenjournalism,anddigitalmedia.

In2013,DavidebecametheCTOofSpazioDati,whereheleadsthedevelopmentoftoolstoperformsemanticanalysisofmassiveamountsofdatainthebusinessinformationsector.

Whennotsolvinghardproblems,Davideenjoystakingcareofhisfamilyvineyardandplayingwithhistwochildren.

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

http://www.PacktPub.com


mailto:[email protected]


https://www2.packtpub.com/books/subscription/packtlib

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.


PrefaceThisbookwilltakeyouonahands-onexplorationofthewonderfulworldthatisHadoop2anditsrapidlygrowingecosystem.Buildingonthesolidfoundationfromtheearlierversionsoftheplatform,Hadoop2allowsmultipledataprocessingframeworkstobeexecutedonasingleHadoopcluster.

Togiveanunderstandingofthissignificantevolution,wewillexplorebothhowthesenewmodelsworkandalsoshowtheirapplicationsinprocessinglargedatavolumeswithbatch,iterative,andnear-real-timealgorithms.

WhatthisbookcoversChapter1,Introduction,givesthebackgroundtoHadoopandtheBigDataproblemsitlookstosolve.WealsohighlighttheareasinwhichHadoop1hadroomforimprovement.

Chapter2,Storage,delvesintotheHadoopDistributedFileSystem,wheremostdataprocessedbyHadoopisstored.WeexaminetheparticularcharacteristicsofHDFS,showhowtouseit,anddiscusshowithasimprovedinHadoop2.WealsointroduceZooKeeper,anotherstoragesystemwithinHadoop,uponwhichmanyofitshigh-availabilityfeaturesrely.

Chapter3,Processing–MapReduceandBeyond,firstdiscussesthetraditionalHadoopprocessingmodelandhowitisused.WethendiscusshowHadoop2hasgeneralizedtheplatformtousemultiplecomputationalmodels,ofwhichMapReduceismerelyone.

Chapter4,Real-timeComputationwithSamza,takesadeeperlookatoneofthesealternativeprocessingmodelsenabledbyHadoop2.Inparticular,welookathowtoprocessreal-timestreamingdatawithApacheSamza.

Chapter5,IterativeComputationwithSpark,delvesintoaverydifferentalternativeprocessingmodel.Inthischapter,welookathowApacheSparkprovidesthemeanstodoiterativeprocessing.

Chapter6,DataAnalysiswithPig,demonstrateshowApachePigmakesthetraditionalcomputationalmodelofMapReduceeasiertousebyprovidingalanguagetodescribedataflows.

Chapter7,HadoopandSQL,looksathowthefamiliarSQLlanguagehasbeenimplementedatopdatastoredinHadoop.ThroughtheuseofApacheHiveanddescribingalternativessuchasClouderaImpala,weshowhowBigDataprocessingcanbemadepossibleusingexistingskillsandtools.

Chapter8,DataLifecycleManagement,takesalookatthebiggerpictureofjusthowtomanageallthatdatathatistobeprocessedinHadoop.UsingApacheOozie,weshowhowtobuildupworkflowstoingest,process,andmanagedata.

Chapter9,MakingDevelopmentEasier,focusesonaselectionoftoolsaimedathelpingadevelopergetresultsquickly.ThroughtheuseofHadoopstreaming,ApacheCrunchandKite,weshowhowtheuseoftherighttoolcanspeedupthedevelopmentlooporprovidenewAPIswithrichersemanticsandlessboilerplate.

Chapter10,RunningaHadoopCluster,takesalookattheoperationalsideofHadoop.Byfocusingontheareasofinteresttodevelopers,suchasclustermanagement,monitoring,andsecurity,thischaptershouldhelpyoutoworkbetterwithyouroperationsstaff.

Chapter11,WheretoGoNext,takesyouonawhirlwindtourthroughanumberofotherprojectsandtoolsthatwefeelareuseful,butcouldnotcoverindetailinthebookduetospaceconstraints.Wealsogivesomepointersonwheretofindadditionalsourcesofinformationandhowtoengagewiththevariousopensourcecommunities.

WhatyouneedforthisbookBecausemostpeopledon’thavealargenumberofsparemachinessittingaround,weusetheClouderaQuickStartvirtualmachineformostoftheexamplesinthisbook.ThisisasinglemachineimagewithallthecomponentsofafullHadoopclusterpre-installed.ItcanberunonanyhostmachinesupportingeithertheVMwareortheVirtualBoxvirtualizationtechnology.

WealsoexploreAmazonWebServicesandhowsomeoftheHadooptechnologiescanberunontheAWSElasticMapReduceservice.TheAWSservicescanbemanagedthroughawebbrowseroraLinuxcommand-lineinterface.

WhothisbookisforThisbookisprimarilyaimedatapplicationandsystemdevelopersinterestedinlearninghowtosolvepracticalproblemsusingtheHadoopframeworkandrelatedcomponents.Althoughweshowexamplesinafewprogramminglanguages,astrongfoundationinJavaisthemainprerequisite.

Dataengineersandarchitectsmightalsofindthematerialconcerningdatalifecycle,fileformats,andcomputationalmodelsuseful.

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduce.jarfiletoourenvironmentbeforeaccessingindividualfields.”

Ablockofcodeissetasfollows:

topic_edges_grouped=FOREACHtopic_edges_grouped{

GENERATE

group.topic_idastopic,

group.source_idassource,

topic_edges.(destination_id,w)asedges;

}

Anycommand-lineinputoroutputiswrittenasfollows:

$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/

$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/

$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxes,appearinthetextlikethis:“Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.


http://www.packtpub.com/authors

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeThesourcecodeforthisbookcanbefoundonGitHubathttps://github.com/learninghadoop2/book-examples.Theauthorswillbeapplyinganyerratatothiscodeandkeepingituptodateasthetechnologiesevolve.Inadditionyoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

https://github.com/learninghadoop2/book-examples

http://www.packtpub.com

http://www.packtpub.com/support

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

http://www.packtpub.com/submit-errata

https://www.packtpub.com/books/content/support

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.


QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.


Chapter1.IntroductionThisbookwillteachyouhowtobuildamazingsystemsusingthelatestreleaseofHadoop.Beforeyouchangetheworldthough,weneedtodosomegroundwork,whichiswherethischaptercomesin.

Inthisintroductorychapter,wewillcoverthefollowingtopics:

AbriefrefresheronthebackgroundtoHadoopAwalk-throughofHadoop’sevolutionThekeyelementsinHadoop2TheHadoopdistributionswe’lluseinthisbookThedatasetwe’lluseforexamples

AnoteonversioningInHadoop1,theversionhistorywassomewhatconvolutedwithmultipleforkedbranchesinthe0.2xrange,leadingtooddsituations,wherea1.xversioncould,insomesituations,havefewerfeaturesthana0.23release.Intheversion2codebase,thisisfortunatelymuchmorestraightforward,butit’simportanttoclarifyexactlywhichversionwewilluseinthisbook.

Hadoop2.0wasreleasedinalphaandbetaversions,andalongtheway,severalincompatiblechangeswereintroduced.Therewas,inparticular,amajorAPIstabilizationeffortbetweenthebetaandfinalreleasestages.

Hadoop2.2.0wasthefirstgeneralavailability(GA)releaseoftheHadoop2codebase,anditsinterfacesarenowdeclaredstableandforwardcompatible.Wewillthereforeusethe2.2productandinterfacesinthisbook.Thoughtheprincipleswillbeusableona2.0beta,inparticular,therewillbeAPIincompatibilitiesinthebeta.ThisisparticularlyimportantasMapReducev2wasback-portedtoHadoop1byseveraldistributionvendors,buttheseproductswerebasedonthebetaandnottheGAAPIs.Ifyouareusingsuchaproduct,thenyouwillencountertheseincompatiblechanges.ItisrecommendedthatareleasebaseduponHadoop2.2orlaterisusedforboththedevelopmentandtheproductiondeploymentsofanyHadoop2workloads.

ThebackgroundofHadoopWe’reassumingthatmostreaderswillhavealittlefamiliaritywithHadoop,orattheveryleast,withbigdata-processingsystems.Consequently,wewon’tgiveadetailedbackgroundastowhyHadoopissuccessfulorthetypesofproblemithelpstosolveinthisbook.However,particularlybecauseofsomeaspectsofHadoop2andtheotherproductswewilluseinlaterchapters,itisusefultogiveasketchofhowweseeHadoopfittingintothetechnologylandscapeandwhicharetheparticularproblemareaswherewebelieveitgivesthemostbenefit.

Inancienttimes,beforetheterm“bigdata”cameintothepicture(whichequatestomaybeadecadeago),therewerefewoptionstoprocessdatasetsofsizesinterabytesandbeyond.Somecommercialdatabasescould,withveryspecificandexpensivehardwaresetups,bescaledtothislevel,buttheexpertiseandcapitalexpenditurerequiredmadeitanoptionforonlythelargestorganizations.Alternatively,onecouldbuildacustomsystemaimedatthespecificproblemathand.Thissufferedfromsomeofthesameproblems(expertiseandcost)andaddedtheriskinherentinanycutting-edgesystem.Ontheotherhand,ifasystemwassuccessfullyconstructed,itwaslikelyaverygoodfittotheneed.

Fewsmall-tomid-sizecompaniesevenworriedaboutthisspace,notonlybecausethesolutionswereoutoftheirreach,buttheygenerallyalsodidn’thaveanythingclosetothedatavolumesthatrequiredsuchsolutions.Astheabilitytogenerateverylargedatasetsbecamemorecommon,sodidtheneedtoprocessthatdata.

Eventhoughlargedatabecamemoredemocratizedandwasnolongerthedomainoftheprivilegedfew,majorarchitecturalchangeswererequiredifthedata-processingsystemscouldbemadeaffordabletosmallercompanies.Thefirstbigchangewastoreducetherequiredupfrontcapitalexpenditureonthesystem;thatmeansnohigh-endhardwareorexpensivesoftwarelicenses.Previously,high-endhardwarewouldhavebeenutilizedmostcommonlyinarelativelysmallnumberofverylargeserversandstoragesystems,eachofwhichhadmultipleapproachestoavoidhardwarefailures.Thoughveryimpressive,suchsystemsarehugelyexpensive,andmovingtoalargernumberoflower-endserverswouldbethequickestwaytodramaticallyreducethehardwarecostofanewsystem.Movingmoretowardcommodityhardwareinsteadofthetraditionalenterprise-gradeequipmentwouldalsomeanareductionincapabilitiesintheareaofresilienceandfaulttolerance.Thoseresponsibilitieswouldneedtobetakenupbythesoftwarelayer.Smartersoftware,dumberhardware.

GooglestartedthechangethatwouldeventuallybeknownasHadoop,whenin2003,andin2004,theyreleasedtwoacademicpapersdescribingtheGoogleFileSystem(GFS)(http://research.google.com/archive/gfs.html)andMapReduce(http://research.google.com/archive/mapreduce.html).Thetwotogetherprovidedaplatformforverylarge-scaledataprocessinginahighlyefficientmanner.Googlehadtakenthebuild-it-yourselfapproach,butinsteadofconstructingsomethingaimedatonespecificproblemordataset,theyinsteadcreatedaplatformonwhichmultipleprocessingapplicationscouldbeimplemented.Inparticular,theyutilizedlargenumbersof

http://research.google.com/archive/gfs.html

http://research.google.com/archive/mapreduce.html

commodityserversandbuiltGFSandMapReduceinawaythatassumedhardwarefailureswouldbecommonplaceandweresimplysomethingthatthesoftwareneededtodealwith.

Atthesametime,DougCuttingwasworkingontheNutchopensourcewebcrawler.HewasworkingonelementswithinthesystemthatresonatedstronglyoncetheGoogleGFSandMapReducepaperswerepublished.DougstartedworkonopensourceimplementationsoftheseGoogleideas,andHadoopwassoonborn,firstly,asasubprojectofLucene,andthenasitsowntop-levelprojectwithintheApacheSoftwareFoundation.

Yahoo!hiredDougCuttingin2006andquicklybecameoneofthemostprominentsupportersoftheHadoopproject.InadditiontooftenpublicizingsomeofthelargestHadoopdeploymentsintheworld,Yahoo!allowedDougandotherengineerstocontributetoHadoopwhileemployedbythecompany,nottomentioncontributingbacksomeofitsowninternallydevelopedHadoopimprovementsandextensions.

ComponentsofHadoopThebroadHadoopumbrellaprojecthasmanycomponentsubprojects,andwe’lldiscussseveraloftheminthisbook.Atitscore,Hadoopprovidestwoservices:storageandcomputation.AtypicalHadoopworkflowconsistsofloadingdataintotheHadoopDistributedFileSystem(HDFS)andprocessingusingtheMapReduceAPIorseveraltoolsthatrelyonMapReduceasanexecutionframework.

Hadoop1:HDFSandMapReduce

BothlayersaredirectimplementationsofGoogle’sownGFSandMapReducetechnologies.

CommonbuildingblocksBothHDFSandMapReduceexhibitseveralofthearchitecturalprinciplesdescribedintheprevioussection.Inparticular,thecommonprinciplesareasfollows:

Botharedesignedtorunonclustersofcommodity(thatis,lowtomediumspecification)serversBothscaletheircapacitybyaddingmoreservers(scale-out)asopposedtothepreviousmodelsofusinglargerhardware(scale-up)BothhavemechanismstoidentifyandworkaroundfailuresBothprovidemostoftheirservicestransparently,allowingtheusertoconcentrateontheproblemathandBothhaveanarchitecturewhereasoftwareclustersitsonthephysicalserversandmanagesaspectssuchasapplicationloadbalancingandfaulttolerance,withoutrelyingonhigh-endhardwaretodeliverthesecapabilities

StorageHDFSisafilesystem,thoughnotaPOSIX-compliantone.Thisbasicallymeansthatitdoesnotdisplaythesamecharacteristicsasthatofaregularfilesystem.Inparticular,thecharacteristicsareasfollows:

HDFSstoresfilesinblocksthataretypicallyatleast64MBor(morecommonlynow)128MBinsize,muchlargerthanthe4-32KBseeninmostfilesystemsHDFSisoptimizedforthroughputoverlatency;itisveryefficientatstreamingreadsoflargefilesbutpoorwhenseekingformanysmallonesHDFSisoptimizedforworkloadsthataregenerallywrite-onceandread-manyInsteadofhandlingdiskfailuresbyhavingphysicalredundanciesindiskarraysorsimilarstrategies,HDFSusesreplication.Eachoftheblockscomprisingafileisstoredonmultiplenodeswithinthecluster,andaservicecalledtheNameNodeconstantlymonitorstoensurethatfailureshavenotdroppedanyblockbelowthedesiredreplicationfactor.Ifthisdoeshappen,thenitschedulesthemakingofanothercopywithinthecluster.

ComputationMapReduceisanAPI,anexecutionengine,andaprocessingparadigm;itprovidesaseriesoftransformationsfromasourceintoaresultdataset.Inthesimplestcase,theinputdataisfedthroughamapfunctionandtheresultanttemporarydataisthenfedthroughareducefunction.

MapReduceworksbestonsemistructuredorunstructureddata.Insteadofdataconformingtorigidschemas,therequirementisinsteadthatthedatacanbeprovidedtothemapfunctionasaseriesofkey-valuepairs.Theoutputofthemapfunctionisasetofotherkey-valuepairs,andthereducefunctionperformsaggregationtocollectthefinalsetofresults.

Hadoopprovidesastandardspecification(thatis,interface)forthemapandreducephases,andtheimplementationoftheseareoftenreferredtoasmappersandreducers.AtypicalMapReduceapplicationwillcompriseanumberofmappersandreducers,andit’snotunusualforseveralofthesetobeextremelysimple.Thedeveloperfocusesonexpressingthetransformationbetweenthesourceandtheresultantdata,andtheHadoopframeworkmanagesallaspectsofjobexecutionandcoordination.

BettertogetherItispossibletoappreciatetheindividualmeritsofHDFSandMapReduce,buttheyareevenmorepowerfulwhencombined.Theycanbeusedindividually,butwhentheyaretogether,theybringoutthebestineachother,andthiscloseinterworkingwasamajorfactorinthesuccessandacceptanceofHadoop1.

WhenaMapReducejobisbeingplanned,Hadoopneedstodecideonwhichhosttoexecutethecodeinordertoprocessthedatasetmostefficiently.IftheMapReduceclusterhostsareallpullingtheirdatafromasinglestoragehostorarray,thenthislargelydoesn’tmatterasthestoragesystemisasharedresourcethatwillcausecontention.IfthestoragesystemwasmoretransparentandallowedMapReducetomanipulateitsdatamoredirectly,thentherewouldbeanopportunitytoperformtheprocessingclosertothedata,buildingontheprincipleofitbeinglessexpensivetomoveprocessingthandata.

ThemostcommondeploymentmodelforHadoopseestheHDFSandMapReduceclustersdeployedonthesamesetofservers.EachhostthatcontainsdataandtheHDFScomponenttomanagethedataalsohostsaMapReducecomponentthatcanscheduleandexecutedataprocessing.WhenajobissubmittedtoHadoop,itcanusethelocalityoptimizationtoscheduledataonthehostswheredataresidesasmuchaspossible,thusminimizingnetworktrafficandmaximizingperformance.

Hadoop2–what’sthebigdeal?IfwelookatthetwomaincomponentsofthecoreHadoopdistribution,storageandcomputation,weseethatHadoop2hasaverydifferentimpactoneachofthem.WhereastheHDFSfoundinHadoop2ismostlyamuchmorefeature-richandresilientproductthantheHDFSinHadoop1,forMapReduce,thechangesaremuchmoreprofoundandhave,infact,alteredhowHadoopisperceivedasaprocessingplatformingeneral.Let’slookatHDFSinHadoop2first.

StorageinHadoop2We’lldiscusstheHDFSarchitectureinmoredetailinChapter2,Storage,butfornow,it’ssufficienttothinkofamaster-slavemodel.Theslavenodes(calledDataNodes)holdtheactualfilesystemdata.Inparticular,eachhostrunningaDataNodewilltypicallyhaveoneormoredisksontowhichfilescontainingthedataforeachHDFSblockarewritten.TheDataNodeitselfhasnounderstandingoftheoverallfilesystem;itsroleistostore,serve,andensuretheintegrityofthedataforwhichitisresponsible.

Themasternode(calledtheNameNode)isresponsibleforknowingwhichoftheDataNodesholdswhichblockandhowtheseblocksarestructuredtoformthefilesystem.Whenaclientlooksatthefilesystemandwishestoretrieveafile,it’sviaarequesttotheNameNodethatthelistofrequiredblocksisretrieved.

ThismodelworkswellandhasbeenscaledtoclusterswithtensofthousandsofnodesatcompaniessuchasYahoo!So,thoughitisscalable,thereisaresiliencyrisk;iftheNameNodebecomesunavailable,thentheentireclusterisrenderedeffectivelyuseless.NoHDFSoperationscanbeperformed,andsincethevastmajorityofinstallationsuseHDFSasthestoragelayerforservices,suchasMapReduce,theyalsobecomeunavailableeveniftheyarestillrunningwithoutproblems.

Morecatastrophically,theNameNodestoresthefilesystemmetadatatoapersistentfileonitslocalfilesystem.IftheNameNodehostcrashesinawaythatthisdataisnotrecoverable,thenalldataontheclusteriseffectivelylostforever.ThedatawillstillexistonthevariousDataNodes,butthemappingofwhichblockscomprisewhichfilesislost.Thisiswhy,inHadoop1,thebestpracticewastohavetheNameNodesynchronouslywriteitsfilesystemmetadatatobothlocaldisksandatleastoneremotenetworkvolume(typicallyviaNFS).

SeveralNameNodehigh-availability(HA)solutionshavebeenmadeavailablebythird-partysuppliers,butthecoreHadoopproductdidnotoffersuchresilienceinVersion1.Giventhisarchitecturalsinglepointoffailureandtheriskofdataloss,itwon’tbeasurprisetohearthatNameNodeHAisoneofthemajorfeaturesofHDFSinHadoop2andissomethingwe’lldiscussindetailinlaterchapters.ThefeatureprovidesbothastandbyNameNodethatcanbeautomaticallypromotedtoserviceallrequestsshouldtheactiveNameNodefail,butalsobuildsadditionalresilienceforthecriticalfilesystemmetadataatopthismechanism.

HDFSinHadoop2isstillanon-POSIXfilesystem;itstillhasaverylargeblocksizeanditstilltradeslatencyforthroughput.However,itdoesnowhaveafewcapabilitiesthatcanmakeitlookalittlemorelikeatraditionalfilesystem.Inparticular,thecoreHDFSinHadoop2nowcanberemotelymountedasanNFSvolume.Thisisanotherfeaturethatwaspreviouslyofferedasaproprietarycapabilitybythird-partysuppliersbutisnowinthemainApachecodebase.

Overall,theHDFSinHadoop2ismoreresilientandcanbemoreeasilyintegratedintoexistingworkflowsandprocesses.It’sastrongevolutionoftheproductfoundinHadoop

ComputationinHadoop2TheworkonHDFS2wasstartedbeforeadirectionforMapReducecrystallized.ThiswaslikelyduetothefactthatfeaturessuchasNameNodeHAweresuchanobviouspaththatthecommunityknewthemostcriticalareastoaddress.However,MapReducedidn’treallyhaveasimilarlistofareasofimprovement,andthat’swhy,whentheMRv2initiativestarted,itwasn’tcompletelyclearwhereitwouldlead.

PerhapsthemostfrequentcriticismofMapReduceinHadoop1washowitsbatchprocessingmodelwasill-suitedtoproblemdomainswherefasterresponsetimeswererequired.Hive,forexample,whichwe’lldiscussinChapter7,HadoopandSQL,providesaSQL-likeinterfaceontoHDFSdata,but,behindthescenes,thestatementsareconvertedintoMapReducejobsthatarethenexecutedlikeanyother.Anumberofotherproductsandtoolstookasimilarapproach,providingaspecificuser-facinginterfacethathidaMapReducetranslationlayer.

Thoughthisapproachhasbeenverysuccessful,andsomeamazingproductshavebeenbuilt,thefactremainsthatinmanycases,thereisamismatchasalloftheseinterfaces,someofwhichexpectacertaintypeofresponsiveness,arebehindthescenes,beingexecutedonabatch-processingplatform.WhenlookingtoenhanceMapReduce,improvementscouldbemadetomakeitabetterfittotheseusecases,butthefundamentalmismatchwouldremain.ThissituationledtoasignificantchangeoffocusoftheMRv2initiative;perhapsMapReduceitselfdidn’tneedchange,buttherealneedwastoenabledifferentprocessingmodelsontheHadoopplatform.ThuswasbornYetAnotherResourceNegotiator(YARN).

LookingatMapReduceinHadoop1,theproductactuallydidtwoquitedifferentthings;itprovidedtheprocessingframeworktoexecuteMapReducecomputations,butitalsomanagedtheallocationofthiscomputationacrossthecluster.Notonlydiditdirectdatatoandbetweenthespecificmapandreducetasks,butitalsodeterminedwhereeachtaskwouldrun,andmanagedthefulljoblifecycle,monitoringthehealthofeachtaskandnode,reschedulingifanyfailed,andsoon.

Thisisnotatrivialtask,andtheautomatedparallelizationofworkloadshasalwaysbeenoneofthemainbenefitsofHadoop.IfwelookatMapReduceinHadoop1,weseethataftertheuserdefinesthekeycriteriaforthejob,everythingelseistheresponsibilityofthesystem.Critically,fromascaleperspective,thesameMapReducejobcanbeappliedtodatasetsofanyvolumehostedonclustersofanysize.Ifthedatais1GBinsizeandonasinglehost,thenHadoopwillscheduletheprocessingaccordingly.Ifthedataisinstead1PBinsizeandhostedacross1,000machines,thenitdoeslikewise.Fromtheuser’sperspective,theactualscaleofthedataandclusteristransparent,andasidefromaffectingthetimetakentoprocessthejob,itdoesnotchangetheinterfacewithwhichtointeractwiththesystem.

InHadoop2,thisroleofjobschedulingandresourcemanagementisseparatedfromthatofexecutingtheactualapplication,andisimplementedbyYARN.

YARNisresponsibleformanagingtheclusterresources,andsoMapReduceexistsasanapplicationthatrunsatoptheYARNframework.TheMapReduceinterfaceinHadoop2iscompletelycompatiblewiththatinHadoop1,bothsemanticallyandpractically.However,underthecovers,MapReducehasbecomeahostedapplicationontheYARNframework.

ThesignificanceofthissplitisthatotherapplicationscanbewrittenthatprovideprocessingmodelsmorefocusedontheactualproblemdomainandcanoffloadalltheresourcemanagementandschedulingresponsibilitiestoYARN.ThelatestversionsofmanydifferentexecutionengineshavebeenportedontoYARN,eitherinaproduction-readyorexperimentalstate,andithasshownthattheapproachcanallowasingleHadoopclustertoruneverythingfrombatch-orientedMapReducejobsthroughfast-responseSQLqueriestocontinuousdatastreamingandeventoimplementmodelssuchasgraphprocessingandtheMessagePassingInterface(MPI)fromtheHighPerformanceComputing(HPC)world.ThefollowingdiagramshowsthearchitectureofHadoop2:

Hadoop2

ThisiswhymuchoftheattentionandexcitementaroundHadoop2hasbeenfocusedonYARNandframeworksthatsitontopofit,suchasApacheTezandApacheSpark.WithYARN,theHadoopclusterisnolongerjustabatch-processingengine;itisthesingleplatformonwhichavastarrayofprocessingtechniquescanbeappliedtotheenormousdatavolumesstoredinHDFS.Moreover,applicationscanbuildonthesecomputationparadigmsandexecutionmodels.

TheanalogythatisachievingsometractionistothinkofYARNastheprocessingkerneluponwhichotherdomain-specificapplicationscanbebuilt.We’lldiscussYARNinmoredetailinthisbook,particularlyinChapter3,Processing–MapReduceandBeyond,Chapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark.

DistributionsofApacheHadoopIntheveryearlydaysofHadoop,theburdenofinstalling(oftenbuildingfromsource)andmanagingeachcomponentanditsdependenciesfellontheuser.Asthesystembecamemorepopularandtheecosystemofthird-partytoolsandlibrariesstartedtogrow,thecomplexityofinstallingandmanagingaHadoopdeploymentincreaseddramaticallytothepointwhereprovidingacoherentofferofsoftwarepackages,documentation,andtrainingbuiltaroundthecoreApacheHadoophasbecomeabusinessmodel.EntertheworldofdistributionsforApacheHadoop.

HadoopdistributionsareconceptuallysimilartohowLinuxdistributionsprovideasetofintegratedsoftwarearoundacommoncore.Theytaketheburdenofbundlingandpackagingsoftwarethemselvesandprovidetheuserwithaneasywaytoinstall,manage,anddeployApacheHadoopandaselectednumberofthird-partylibraries.Inparticular,thedistributionreleasesdeliveraseriesofproductversionsthatarecertifiedtobemutuallycompatible.Historically,puttingtogetheraHadoop-basedplatformwasoftengreatlycomplicatedbythevariousversioninterdependencies.

Cloudera(http://www.cloudera.com),Hortonworks(http://www.hortonworks.com),andMapR(http://www.mapr.com)areamongstthefirsttohavereachedthemarket,eachcharacterizedbydifferentapproachesandsellingpoints.Hortonworkspositionsitselfastheopensourceplayer;ClouderaisalsocommittedtoopensourcebutaddsproprietarybitsforconfiguringandmanagingHadoop;MapRprovidesahybridopensource/proprietaryHadoopdistributioncharacterizedbyaproprietaryNFSlayerinsteadofHDFSandafocusonprovidingservices.

AnotherstrongplayerinthedistributionsecosystemisAmazon,whichoffersaversionofHadoopcalledElasticMapReduce(EMR)ontopoftheAmazonWebServices(AWS)infrastructure.

WiththeadventofHadoop2,thenumberofavailabledistributionsforHadoophasincreaseddramatically,farinexcessofthefourwementioned.ApossiblyincompletelistofsoftwareofferingsthatincludesApacheHadoopcanbefoundathttp://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support.

http://www.cloudera.com

http://www.hortonworks.com

http://www.mapr.com

http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

AdualapproachInthisbook,wewilldiscussboththebuildingandthemanagementoflocalHadoopclustersinadditiontoshowinghowtopushtheprocessingintothecloudviaEMR.

Thereasonforthisistwofold:firstly,thoughEMRmakesHadoopmuchmoreaccessible,thereareaspectsofthetechnologythatonlybecomeapparentwhenmanuallyadministeringthecluster.AlthoughitisalsopossibletouseEMRinamoremanualmode,we’llgenerallyusealocalclusterforsuchexplorations.Secondly,thoughitisn’tnecessarilyaneither/ordecision,manyorganizationsuseamixtureofin-houseandcloud-hostedcapacities,sometimesduetoaconcernofoverrelianceonasingleexternalprovider,butpracticallyspeaking,it’softenconvenienttododevelopmentandsmall-scaletestsonlocalcapacityandthendeployatproductionscaleintothecloud.

Inafewofthelaterchapters,wherewediscussadditionalproductsthatintegratewithHadoop,we’llmostlygiveexamplesoflocalclusters,asthereisnodifferencebetweenhowtheproductsworkregardlessofwheretheyaredeployed.

AWS–infrastructureondemandfromAmazonAWSisasetofcloud-computingservicesofferedbyAmazon.Wewilluseseveraloftheseservicesinthisbook.

SimpleStorageService(S3)Amazon’sSimpleStorageService(S3),foundathttp://aws.amazon.com/s3/,isastorageservicethatprovidesasimplekey-valuestoragemodel.Usingweb,command-line,orprogrammaticinterfacestocreateobjects,whichcanbeanythingfromtextfilestoimagestoMP3s,youcanstoreandretrieveyourdatabasedonahierarchicalmodel.Inthismodel,youcreatebucketsthatcontainobjects.Eachbuckethasauniqueidentifier,andwithineachbucket,everyobjectisuniquelynamed.ThissimplestrategyenablesanextremelypowerfulserviceforwhichAmazontakescompleteresponsibility(forservicescaling,inadditiontoreliabilityandavailabilityofdata).

http://aws.amazon.com/s3/

ElasticMapReduce(EMR)Amazon’sElasticMapReduce,foundathttp://aws.amazon.com/elasticmapreduce/,isbasicallyHadoopinthecloud.Usinganyofthemultipleinterfaces(webconsole,CLI,orAPI),aHadoopworkflowisdefinedwithattributessuchasthenumberofHadoophostsrequiredandthelocationofthesourcedata.TheHadoopcodeimplementingtheMapReducejobsisprovided,andthevirtualGobuttonispressed.

Initsmostimpressivemode,EMRcanpullsourcedatafromS3,processitonaHadoopclusteritcreatesonAmazon’svirtualhoston-demandserviceEC2,pushtheresultsbackintoS3,andterminatetheHadoopclusterandtheEC2virtualmachineshostingit.Naturally,eachoftheseserviceshasacost(usuallyonperGBstoredandserver-timeusagebasis),buttheabilitytoaccesssuchpowerfuldata-processingcapabilitieswithnoneedfordedicatedhardwareisapowerfulone.

http://aws.amazon.com/elasticmapreduce/

GettingstartedWewillnowdescribethetwoenvironmentswewillusethroughoutthebook:Cloudera’sQuickStartvirtualmachinewillbeourreferencesystemonwhichwewillshowallexamples,butwewilladditionallydemonstratesomeexamplesonAmazon’sEMRwhenthereissomeparticularlyvaluableaspecttorunningtheexampleintheon-demandservice.

Althoughtheexamplesandcodeprovidedareaimedatbeingasgeneral-purposeandportableaspossible,ourreferencesetup,whentalkingaboutalocalcluster,willbeClouderarunningatopCentOSLinux.

Forthemostpart,wewillshowexamplesthatmakeuseof,orareexecutedfrom,aterminalprompt.AlthoughHadoop’sgraphicalinterfaceshaveimprovedsignificantlyovertheyears(forexample,theexcellentHUEandClouderaManager),whenitcomestodevelopment,automation,andprogrammaticaccesstothesystem,thecommandlineisstillthemostpowerfultoolforthejob.

Allexamplesandsourcecodepresentedinthisbookcanbedownloadedfromhttps://github.com/learninghadoop2/book-examples.Inaddition,wehaveahomepageforthebookwherewewillpublishupdatesandrelatedmaterialathttp://learninghadoop2.com.


http://learninghadoop2.com

ClouderaQuickStartVMOneoftheadvantagesofHadoopdistributionsisthattheygiveaccesstoeasy-to-install,packagedsoftware.ClouderatakesthisonestepfurtherandprovidesafreelydownloadableVirtualMachineinstanceofitslatestdistribution,knownastheCDHQuickStartVM,deployedontopofCentOSLinux.

Intheremainingpartsofthisbook,wewillusetheCDH5.0.0VMasthereferenceandbaselinesystemtorunexamplesandsourcecode.ImagesoftheVMareavailableforVMware(http://www.vmware.com/nl/products/player/),KVM(http://www.linux-kvm.org/page/Main_Page),andVirtualBox(https://www.virtualbox.org/)virtualizationsystems.

http://www.vmware.com/nl/products/player/

http://www.linux-kvm.org/page/Main_Page

https://www.virtualbox.org/

AmazonEMRBeforeusingElasticMapReduce,weneedtosetupanAWSaccountandregisteritwiththenecessaryservices.

CreatinganAWSaccountAmazonhasintegrateditsgeneralaccountswithAWS,whichmeansthat,ifyoualreadyhaveanaccountforanyoftheAmazonretailwebsites,thisistheonlyaccountyouwillneedtouseAWSservices.

NoteNotethatAWSserviceshaveacost;youwillneedanactivecreditcardassociatedwiththeaccounttowhichchargescanbemade.

IfyourequireanewAmazonaccount,gotohttp://aws.amazon.com,selectCreateanewAWSaccount,andfollowtheprompts.Amazonhasaddedafreetierforsomeservices,soyoumightfindthatintheearlydaysoftestingandexploration,youarekeepingmanyofyouractivitieswithinthenonchargedtier.Thescopeofthefreetierhasbeenexpanding,somakesureyouknowwhatyouwillandwon’tbechargedfor.

SigningupforthenecessaryservicesOnceyouhaveanAmazonaccount,youwillneedtoregisteritforusewiththerequiredAWSservices,thatis,SimpleStorageService(S3),ElasticComputeCloud(EC2),andElasticMapReduce.ThereisnocosttosimplysignuptoanyAWSservice;theprocessjustmakestheserviceavailabletoyouraccount.

GototheS3,EC2,andEMRpageslinkedfromhttp://aws.amazon.com,clickontheSignupbuttononeachpage,andthenfollowtheprompts.

http://aws.amazon.com

http://aws.amazon.com

UsingElasticMapReduceHavingcreatedanaccountwithAWSandregisteredalltherequiredservices,wecanproceedtoconfigureprogrammaticaccesstoEMR.

GettingHadoopupandrunningNoteCaution!Thiscostsrealmoney!

Beforegoinganyfurther,itiscriticaltounderstandthatuseofAWSserviceswillincurchargesthatwillappearonthecreditcardassociatedwithyourAmazonaccount.Mostofthechargesarequitesmallandincreasewiththeamountofinfrastructureconsumed;storing10GBofdatainS3costs10timesmorethan1GB,andrunning20EC2instancescosts20timesasmuchasasingleone.Therearetieredcostmodels,sotheactualcoststendtohavesmallermarginalincreasesathigherlevels.Butyoushouldreadcarefullythroughthepricingsectionsforeachservicebeforeusinganyofthem.NotealsothatcurrentlydatatransferoutofAWSservices,suchasEC2andS3,ischargeable,butdatatransferbetweenservicesisnot.Thismeansitisoftenmostcost-effectivetocarefullydesignyouruseofAWStokeepdatawithinAWSthroughasmuchofthedataprocessingaspossible.ForinformationregardingAWSandEMR,consulthttp://aws.amazon.com/elasticmapreduce/#pricing.

HowtouseEMRAmazonprovidesbothwebandcommand-lineinterfacestoEMR.Bothinterfacesarejustafrontendtotheverysamesystem;aclustercreatedwiththecommand-lineinterfacecanbeinspectedandmanagedwiththewebtoolsandvice-versa.

Forthemostpart,wewillbeusingthecommand-linetoolstocreateandmanageclustersprogrammaticallyandwillfallbackonthewebinterfacecaseswhereitmakessensetodoso.

AWScredentialsBeforeusingeitherprogrammaticorcommand-linetools,weneedtolookathowanaccountholderauthenticatestoAWStomakesuchrequests.

EachAWSaccounthasseveralidentifiers,suchasthefollowing,thatareusedwhenaccessingthevariousservices:

AccountID:eachAWSaccounthasanumericID.Accesskey:theassociatedaccesskeyisusedtoidentifytheaccountmakingtherequest.Secretaccesskey:thepartnertotheaccesskeyisthesecretaccesskey.Theaccesskeyisnotasecretandcouldbeexposedinservicerequests,butthesecretaccesskeyiswhatyouusetovalidateyourselfastheaccountowner.Treatitlikeyourcreditcard.Keypairs:thesearethekeypairsusedtologintoEC2hosts.Itispossibletoeithergeneratepublic/privatekeypairswithinEC2ortoimportexternallygeneratedkeysintothesystem.

http://aws.amazon.com/elasticmapreduce/#pricing

UsercredentialsandpermissionsaremanagedviaawebservicecalledIdentityandAccessManagement(IAM),whichyouneedtosignuptoinordertoobtainaccessandsecretkeys.

Ifthissoundsconfusing,it’sbecauseitis,atleastatfirst.WhenusingatooltoaccessanAWSservice,there’susuallythesingle,upfrontstepofaddingtherightcredentialstoaconfiguredfile,andtheneverythingjustworks.However,ifyoudodecidetoexploreprogrammaticorcommand-linetools,itwillbeworthinvestingalittletimetoreadthedocumentationforeachservicetounderstandhowitssecurityworks.MoreinformationoncreatinganAWSaccountandobtainingaccesscredentialscanbefoundathttp://docs.aws.amazon.com/iam.

http://docs.aws.amazon.com/iam

TheAWScommand-lineinterfaceEachAWSservicehistoricallyhaditsownsetofcommand-linetools.Recentlythough,Amazonhascreatedasingle,unifiedcommand-linetoolthatallowsaccesstomostservices.TheAmazonCLIcanbefoundathttp://aws.amazon.com/cli.

Itcanbeinstalledfromatarballorviathepiporeasy_installpackagemanagers.

OntheCDHQuickStartVM,wecaninstallawscliusingthefollowingcommand:

$pipinstallawscli

InordertoaccesstheAPI,weneedtoconfigurethesoftwaretoauthenticatetoAWSusingouraccessandsecretkeys.

ThisisalsoagoodmomenttosetupanEC2keypairbyfollowingtheinstructionsprovidedathttps://console.aws.amazon.com/ec2/home?region=us-east-1#c=EC2&s=KeyPairs.

AlthoughakeypairisnotstrictlynecessarytorunanEMRcluster,itwillgiveusthecapabilitytoremotelylogintothemasternodeandgainlow-levelaccesstothecluster.

Thefollowingcommandwillguideyouthroughaseriesofconfigurationstepsandstoretheresultingconfigurationinthe.aws/credentialfile:

$awsconfigure

OncetheCLIisconfigured,wecanqueryAWSwithaws<service><arguments>.TocreateandqueryanS3bucketusesomethinglikethefollowingcommand.NotethatS3bucketsneedtobegloballyuniqueacrossallAWSaccounts,somostcommonnames,suchass3://mybucket,willnotbeavailable:

$awss3mbs3://learninghadoop2

$awss3ls

WecanprovisionanEMRclusterwithfivem1.xlargenodesusingthefollowingcommands:

$awsemrcreate-cluster--name"EMRcluster"\

--ami-version3.2.0\

--instance-typem1.xlarge\

--instance-count5\

--log-uris3://learninghadoop2/emr-logs

Where--ami-versionistheIDofanAmazonMachineImagetemplate(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html),and--log-uriinstructsEMRtocollectlogsandstoretheminthelearninghadoop2S3bucket.

NoteIfyoudidnotspecifyadefaultregionwhensettinguptheAWSCLI,thenyouwillalsohavetoaddonetomostEMRcommandsintheAWSCLIusingthe—regionargument;forexample,--regioneu-west-1isruntousetheEUIrelandregion.Youcanfind

http://aws.amazon.com/cli

https://console.aws.amazon.com/ec2/home?region=us-east-1#c=EC2&s=KeyPairs

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html

detailsofallavailableAWSregionsathttp://docs.aws.amazon.com/general/latest/gr/rande.html.

Wecansubmitworkflowsbyaddingstepstoarunningclusterusingthefollowingcommand:

$awsemradd-steps--cluster-id<cluster>--steps<steps>

Toterminatethecluster,usethefollowingcommandline:

$awsemrterminate-clusters--cluster-id<cluster>

Inlaterchapters,wewillshowyouhowtoaddstepstoexecuteMapReducejobsandPigscripts.

MoreinformationonusingtheAWSCLIcanbefoundathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage.html.

http://docs.aws.amazon.com/general/latest/gr/rande.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage.html

RunningtheexamplesThesourcecodeofallexamplesisavailableathttps://github.com/learninghadoop2/book-examples.

Gradle(http://www.gradle.org/)scriptsandconfigurationsareprovidedtocompilemostoftheJavacode.ThegradlewscriptincludedwiththeexamplewillbootstrapGradleanduseittofetchdependenciesandcompilecode.

JARfilescanbecreatedbyinvokingthejartaskviaagradlewscript,asfollows:

./gradlewjar

JobsareusuallyexecutedbysubmittingaJARfileusingthehadoopjarcommand,asfollows:

$hadoopjarexample.jar<MainClass>[-libjars$LIBJARS]arg1arg2…argN

Theoptional-libjarsparameterspecifiesruntimethird-partydependenciestoshiptoremotenodes.

NoteSomeoftheframeworkswewillworkwith,suchasApacheSpark,comewiththeirownbuildandpackagemanagementtools.Additionalinformationandresourceswillbeprovidedfortheseparticularcases.

ThecopyJarGradletaskcanbeusedtodownloadthird-partydependenciesintobuild/libjars/<example>/lib,asfollows:

./gradlewcopyJar

Forconvenience,weprovideafatJarGradletaskthatbundlestheexampleclassesandtheirdependenciesintoasingleJARfile.Althoughthisapproachisdiscouragedinfavorofusing–libjar,itmightcomeinhandywhendealingwithdependencyissues.

Thefollowingcommandwillgeneratebuild/libs/<example>-all.jar:

$./gradlewfatJar


http://www.gradle.org

DataprocessingwithHadoopIntheremainingchaptersofthisbook,wewillintroducethecorecomponentsoftheHadoopecosystemaswellasanumberofthird-partytoolsandlibrariesthatwillmakewritingrobust,distributedcodeanaccessibleandhopefullyenjoyabletask.Whilereadingthisbook,youwilllearnhowtocollect,process,store,andextractinformationfromlargeamountsofstructuredandunstructureddata.

WewilluseadatasetgeneratedfromTwitter’s(http://www.twitter.com)real-timefirehose.Thisapproachwillallowustoexperimentwithrelativelysmalldatasetslocallyand,onceready,scaletheexamplesuptoproduction-leveldatasizes.

http://www.twitter.com

WhyTwitter?ThankstoitsprogrammaticAPIs,Twitterprovidesaneasywaytogeneratedatasetsofarbitrarysizeandinjectthemintoourlocal-orcloud-basedHadoopclusters.Otherthanthesheersize,thedatasetthatwewillusehasanumberofpropertiesthatfitseveralinterestingdatamodelingandprocessingusecases.

Twitterdatapossessesthefollowingproperties:

Unstructured:eachstatusupdateisatextmessagethatcancontainreferencestomediacontentsuchasURLsandimagesStructured:tweetsaretimestamped,sequentialrecordsGraph:relationshipssuchasrepliesandmentionscanbemodeledasanetworkofinteractionsGeolocated:thelocationwhereatweetwaspostedorwhereauserresidesRealtime:alldatageneratedonTwitterisavailableviaareal-timefirehose

ThesepropertieswillbereflectedinthetypeofapplicationthatwecanbuildwithHadoop.Theseincludeexamplesofsentimentanalysis,socialnetwork,andtrendanalysis.

BuildingourfirstdatasetTwitter’stermsofserviceprohibitredistributionofuser-generateddatainanyform;forthisreason,wecannotmakeavailableacommondataset.Instead,wewilluseaPythonscripttoprogrammaticallyaccesstheplatformandcreateadumpofusertweetscollectedfromalivestream.

Oneservice,multipleAPIsTwitteruserssharemorethan200milliontweets,alsoknownasstatusupdates,aday.TheplatformoffersaccesstothiscorpusofdataviafourtypesofAPIs,eachofwhichrepresentsafacetofTwitterandaimsatsatisfyingspecificusecases,suchaslinkingandinteractingwithtwittercontentfromthird-partysources(TwitterforProducts),programmaticaccesstospecificusers’orsites’content(REST),searchcapabilitiesacrossusers’orsites’timelines(Search),andaccesstoallcontentcreatedontheTwitternetworkinrealtime(Streaming).

TheStreamingAPIallowsdirectaccesstotheTwitterstream,trackingkeywords,retrievinggeotaggedtweetsfromacertainregion,andmuchmore.Inthisbook,wewillmakeuseofthisAPIasadatasourcetoillustrateboththebatchandreal-timecapabilitiesofHadoop.Wewillnot,however,interactwiththeAPIitself;rather,wewillmakeuseofthird-partylibrariestooffloadchoressuchasauthenticationandconnectionmanagement.

AnatomyofaTweetEachtweetobjectreturnedbyacalltothereal-timeAPIsisrepresentedasaserializedJSONstringthatcontainsasetofattributesandmetadatainadditiontoatextualmessage.ThisadditionalcontentincludesanumericalIDthatuniquelyidentifiesthetweet,thelocationwherethetweetwasshared,theuserwhosharedit(userobject),whetheritwasrepublishedbyotherusers(retweeted)andhowmanytimes(retweetcount),themachine-detectedlanguageofitstext,whetherthetweetwaspostedinreplytosomeoneand,ifso,theuserandtweetIDsitrepliedto,andsoon.

ThestructureofaTweet,andanyotherobjectexposedbytheAPI,isconstantlyevolving.Anup-to-datereferencecanbefoundathttps://dev.twitter.com/docs/platform-objects/tweets.

TwittercredentialsTwittermakesuseoftheOAuthprotocoltoauthenticateandauthorizeaccessfromthird-partysoftwaretoitsplatform.

Theapplicationobtainsthroughanexternalchannel,forinstanceawebform,thefollowingpairofcredentials:

ConsumerkeyConsumersecret

Theconsumersecretisneverdirectlytransmittedtothethirdpartyasitisusedtosign

https://dev.twitter.com/docs/platform-objects/tweets

eachrequest.

Theuserauthorizestheapplicationtoaccesstheserviceviaathree-wayprocessthat,oncecompleted,grantstheapplicationatokenconsistingofthefollowing:

AccesstokenAccesssecret

Similarly,totheconsumer,theaccesssecretisneverdirectlytransmittedtothethirdparty,anditisusedtosigneachrequest.

InordertousetheStreamingAPI,wewillfirstneedtoregisteranapplicationandgrantitprogrammaticaccesstothesystem.IfyourequireanewTwitteraccount,proceedtothesignuppageathttps://twitter.com/signup,andfillintherequiredinformation.Oncethisstepiscompleted,weneedtocreateasampleapplicationthatwillaccesstheAPIonourbehalfandgrantittheproperauthorizationrights.Wewilldosousingthewebformfoundathttps://dev.twitter.com/apps.

Whencreatinganewapp,weareaskedtogiveitaname,adescription,andaURL.ThefollowingscreenshotshowsthesettingsofasampleapplicationnamedLearningHadoop2BookDataset.Forthepurposeofthisbook,wedonotneedtospecifyavalidURL,soweusedaplaceholderinstead.

Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.

Wearenowpresentedwithapagethatsummarizesourapplicationdetailsasseeninthefollowingscreenshot;theauthenticationandauthorizationcredentialscanbefoundundertheOAuthTooltab.

WearefinallyreadytogenerateourveryfirstTwitterdataset.

https://twitter.com/signup

https://dev.twitter.com/apps

ProgrammaticaccesswithPythonInthissection,wewillusePythonandthetweepylibrary,foundathttps://github.com/tweepy/tweepy,tocollectTwitter’sdata.Thestream.pyfilefoundinthech1directoryofthebookcodearchiveinstantiatesalistenertothereal-timefirehose,grabsadatasample,andechoeseachtweet’stexttostandardoutput.

Thetweepylibrarycanbeinstalledusingeithertheeasy_installorpippackagemanagersorbycloningtherepositoryathttps://github.com/tweepy/tweepy.

OntheCDHQuickStartVM,wecaninstalltweepyusingthefollowingcommandline:

$pipinstalltweepy

Wheninvokedwiththe-jparameter,thescriptwilloutputaJSONtweettostandardoutput;-textractsandprintsthetextfield.Wespecifyhowmanytweetstoprintwith–n<numtweets>.When–nisnotspecified,thescriptwillrunindefinitely.ExecutioncanbeterminatedbypressingCtrl+C.

ThescriptexpectsOAuthcredentialstobestoredasshellenvironmentvariables;thefollowingcredentialswillhavetobesetintheterminalsessionfromwherestream.pywillbeexecuted.

$exportTWITTER_CONSUMER_KEY="your_consumer_key"

$exportTWITTER_CONSUMER_SECRET="your_consumer_secret"

$exportTWITTER_ACCESS_KEY="your_access_key"

$exportTWITTER_ACCESS_SECRET="your_access_secret"

OncetherequireddependencyhasbeeninstalledandtheOAuthdataintheshellenvironmenthasbeenset,wecanruntheprogramasfollows:

$pythonstream.py–t–n1000>tweets.txt

WearerelyingonLinux’sshellI/Otoredirecttheoutputwiththe>operatorofstream.pytoafilecalledtweets.txt.Ifeverythingwasexecutedcorrectly,youshouldseeawalloftext,whereeachlineisatweet.

Noticethatinthisexample,wedidnotmakeuseofHadoopatall.Inthenextchapters,wewillshowhowtoimportadatasetgeneratedfromtheStreamingAPIintoHadoopandanalyzeitscontentonthelocalclusterandAmazonEMR.

Fornow,let’stakealookatthesourcecodeofstream.py,whichcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch1/stream.py:

importtweepy

importos

importjson

importargparse

consumer_key=os.environ['TWITTER_CONSUMER_KEY']

consumer_secret=os.environ['TWITTER_CONSUMER_SECRET']

access_key=os.environ['TWITTER_ACCESS_KEY']

access_secret=os.environ['TWITTER_ACCESS_SECRET']

https://github.com/tweepy/tweepy

https://github.com/tweepy/tweepy

https://github.com/learninghadoop2/book-examples/blob/master/ch1/stream.py

classEchoStreamListener(tweepy.StreamListener):

def__init__(self,api,dump_json=False,numtweets=0):

self.api=api

self.dump_json=dump_json

self.count=0

self.limit=int(numtweets)

super(tweepy.StreamListener,self).__init__()

defon_data(self,tweet):

tweet_data=json.loads(tweet)

if'text'intweet_data:

ifself.dump_json:

printtweet.rstrip()

else:

printtweet_data['text'].encode("utf-8").rstrip()

self.count=self.count+1

returnFalseifself.count==self.limitelseTrue

defon_error(self,status_code):

returnTrue

defon_timeout(self):

returnTrue

…

if__name__=='__main__':

parser=get_parser()

args=parser.parse_args()

auth=tweepy.OAuthHandler(consumer_key,consumer_secret)

auth.set_access_token(access_key,access_secret)

api=tweepy.API(auth)

sapi=tweepy.streaming.Stream(

auth,EchoStreamListener(

api=api,

dump_json=args.json,

numtweets=args.numtweets))

sapi.sample()

First,weimportthreedependencies:tweepy,andtheosandjsonmodules,whichcomewiththePythoninterpreterversion2.6orgreater.

Wethendefineaclass,EchoStreamListener,thatinheritsandextendsStreamListenerfromtweepy.Asthenamesuggests,StreamListenerlistensforeventsandtweetsbeingpublishedonthereal-timestreamandperformsactionsaccordingly.

Wheneveraneweventisdetected,ittriggersacalltoon_data().Inthismethod,weextractthetextfieldfromatweetobjectandprintittostandardoutputwithUTF-8encoding.Alternatively,ifthescriptisinvokedwith-j,weprintthewholeJSONtweet.Whenthescriptisexecuted,weinstantiateatweepy.OAuthHandlerobjectwiththeOAuthcredentialsthatidentifyourTwitteraccount,andthenweusethisobjecttoauthenticatewiththeapplicationaccessandsecretkey.Wethenusetheauthobjecttocreateaninstanceofthetweepy.APIclass(api)

Uponsuccessfulauthentication,wetellPythontolistenforeventsonthereal-timestreamusingEchoStreamListener.

AnhttpGETrequesttothestatuses/sampleendpointisperformedbysample().Therequestreturnsarandomsampleofallpublicstatuses.

NoteBeware!Bydefault,sample()willrunindefinitely.RemembertoexplicitlyterminatethemethodcallbypressingCtrl+C.

SummaryThischaptergaveawhirlwindtourofwhereHadoopcamefrom,itsevolution,andwhytheversion2releaseissuchamajormilestone.WealsodescribedtheemergingmarketinHadoopdistributionsandhowwewilluseacombinationoflocalandclouddistributionsinthebook.

Finally,wedescribedhowtosetuptheneededsoftware,accounts,andenvironmentsrequiredinsubsequentchaptersanddemonstratedhowtopulldatafromtheTwitterstreamthatwewilluseforexamples.

Withthisbackgroundoutoftheway,wewillnowmoveontoadetailedexaminationofthestoragelayerwithinHadoop.

Chapter2.StorageAftertheoverviewofHadoopinthepreviouschapter,wewillnowstartlookingatitsvariouscomponentpartsinmoredetail.Wewillstartattheconceptualbottomofthestackinthischapter:themeansandmechanismsforstoringdatawithinHadoop.Inparticular,wewilldiscussthefollowingtopics:

DescribethearchitectureoftheHadoopDistributedFileSystem(HDFS)ShowwhatenhancementstoHDFShavebeenmadeinHadoop2ExplorehowtoaccessHDFSusingcommand-linetoolsandtheJavaAPIGiveabriefdescriptionofZooKeeper—another(sortof)filesystemwithinHadoopSurveyconsiderationsforstoringdatainHadoopandtheavailablefileformats

InChapter3,Processing–MapReduceandBeyond,wewilldescribehowHadoopprovidestheframeworktoallowdatatobeprocessed.

TheinnerworkingsofHDFSInChapter1,Introduction,wegaveaveryhigh-leveloverviewofHDFS;wewillnowexploreitinalittlemoredetail.Asmentionedinthatchapter,HDFScanbeviewedasafilesystem,thoughonewithveryspecificperformancecharacteristicsandsemantics.It’simplementedwithtwomainserverprocesses:theNameNodeandtheDataNodes,configuredinamaster/slavesetup.IfyouviewtheNameNodeasholdingallthefilesystemmetadataandtheDataNodesasholdingtheactualfilesystemdata(blocks),thenthisisagoodstartingpoint.EveryfileplacedontoHDFSwillbesplitintomultipleblocksthatmightresideonnumerousDataNodes,andit’stheNameNodethatunderstandshowtheseblockscanbecombinedtoconstructthefiles.

ClusterstartupLet’sexplorethevariousresponsibilitiesofthesenodesandthecommunicationbetweenthembyassumingwehaveanHDFSclusterthatwaspreviouslyshutdownandthenexaminingthestartupbehavior.

NameNodestartupWe’llfirstlyconsiderthestartupoftheNameNode(thoughthereisnoactualorderingrequirementforthisandwearedoingitfornarrativereasonsalone).TheNameNodeactuallystorestwotypesofdataaboutthefilesystem:

Thestructureofthefilesystem,thatis,directorynames,filenames,locations,andattributesTheblocksthatcompriseeachfileonthefilesystem

ThisdataisstoredinfilesthattheNameNodereadsatstartup.NotethattheNameNodedoesnotpersistentlystorethemappingoftheblocksthatarestoredonparticularDataNodes;we’llseehowthatinformationiscommunicatedshortly.

BecausetheNameNodereliesonthisin-memoryrepresentationofthefilesystem,ittendstohavequitedifferenthardwarerequirementscomparedtotheDataNodes.We’llexplorehardwareselectioninmoredetailinChapter10,RunningaHadoopCluster;fornow,justrememberthattheNameNodetendstobequitememoryhungry.Thisisparticularlytrueonverylargeclusterswithmany(millionsormore)files,particularlyifthesefileshaveverylongnames.ThisscalinglimitationontheNameNodehasalsoledtoanadditionalHadoop2featurethatwewillnotexploreinmuchdetail:NameNodefederation,wherebymultipleNameNodes(orNameNodeHApairs)workcollaborativelytoprovidetheoverallmetadataforthefullfilesystem.

ThemainfilewrittenbytheNameNodeiscalledfsimage;thisisthesinglemostimportantpieceofdataintheentirecluster,aswithoutit,theknowledgeofhowtoreconstructallthedatablocksintotheusablefilesystemislost.Thisfileisreadintomemoryandallfuturemodificationstothefilesystemareappliedtothisin-memoryrepresentationofthefilesystem.TheNameNodedoesnotwriteoutnewversionsoffsimageasnewchangesareappliedafteritisrun;instead,itwritesanotherfilecallededits,whichisalistofthechangesthathavebeenmadesincethelastversionoffsimagewaswritten.

TheNameNodestartupprocessistofirstreadthefsimagefile,thentoreadtheeditsfile,andapplyallthechangesstoredintheeditsfiletothein-memorycopyoffsimage.Itthenwritestodiskanewup-to-dateversionofthefsimagefileandisreadytoreceiveclientrequests.

DataNodestartupWhentheDataNodesstartup,theyfirstcatalogtheblocksforwhichtheyholdcopies.Typically,theseblockswillbewrittensimplyasfilesonthelocalDataNodefilesystem.

TheDataNodewillperformsomeblockconsistencycheckingandthenreporttotheNameNodethelistofblocksforwhichithasvalidcopies.ThisishowtheNameNodeconstructsthefinalmappingitrequires—bylearningwhichblocksarestoredonwhichDataNodes.OncetheDataNodehasregistereditselfwiththeNameNode,anongoingseriesofheartbeatrequestswillbesentbetweenthenodestoallowtheNameNodetodetectDataNodesthathaveshutdown,becomeunreachable,orhavenewlyenteredthecluster.

BlockreplicationHDFSreplicateseachblockontomultipleDataNodes;thedefaultreplicationfactoris3,butthisisconfigurableonaper-filelevel.HDFScanalsobeconfiguredtobeabletodeterminewhethergivenDataNodesareinthesamephysicalhardwarerackornot.Givensmartblockplacementandthisknowledgeoftheclustertopology,HDFSwillattempttoplacethesecondreplicaonadifferenthostbutinthesameequipmentrackasthefirstandthethirdonahostoutsidetherack.Inthisway,thesystemcansurvivethefailureofasmuchasafullrackofequipmentandstillhaveatleastonelivereplicaforeachblock.Aswe’llseeinChapter3,Processing–MapReduceandBeyond,knowledgeofblockplacementalsoallowsHadooptoscheduleprocessingasnearaspossibletoareplicaofeachblock,whichcangreatlyimproveperformance.

Rememberthatreplicationisastrategyforresiliencebutisnotabackupmechanism;ifyouhavedatamasteredinHDFSthatiscritical,thenyouneedtoconsiderbackuporotherapproachesthatgiveprotectionforerrors,suchasaccidentallydeletedfiles,againstwhichreplicationwillnotdefend.

WhentheNameNodestartsupandisreceivingtheblockreportsfromtheDataNodes,itwillremaininsafemodeuntilaconfigurablethresholdofblocks(thedefaultis99.9percent)havebeenreportedaslive.Whileinsafemode,clientscannotmakeanymodificationstothefilesystem.

Command-lineaccesstotheHDFSfilesystemWithintheHadoopdistribution,thereisacommand-lineutilitycalledhdfs,whichistheprimarywaytointeractwiththefilesystemfromthecommandline.Runthiswithoutanyargumentstoseethevarioussubcommandsavailable.Therearemany,though;severalareusedtodothingslikestartingorstoppingvariousHDFScomponents.Thegeneralformofthehdfscommandis:

hdfs<sub-command><command>[arguments]

Thetwomainsubcommandswewilluseinthisbookare:

dfs:Thisisusedforgeneralfilesystemaccessandmanipulation,includingreading/writingandaccessingfilesanddirectoriesdfsadmin:Thisisusedforadministrationandmaintenanceofthefilesystem.Wewillnotcoverthiscommandindetail,though.Havealookatthe-reportcommand,whichgivesalistingofthestateofthefilesystemandallDataNodes:

$hdfsdfsadmin-report

NoteNotethatthedfsanddfsadmincommandscanalsobeusedwiththemainHadoopcommand-lineutility,forexample,hadoopfs-ls/.ThiswastheapproachinearlierversionsofHadoopbutisnowdeprecatedinfavorofthehdfscommand.

ExploringtheHDFSfilesystemRunthefollowingtogetalistoftheavailablecommandsprovidedbythedfssubcommand:

$hdfsdfs

Aswillbeseenfromtheoutputoftheprecedingcommand,manyoftheselooksimilartostandardUnixfilesystemcommandsand,notsurprisingly,theyworkaswouldbeexpected.InourtestVM,wehaveauseraccountcalledcloudera.Usingthisuser,wecanlisttherootofthefilesystemasfollows:

$hdfsdfs-ls/

Found7items

drwxr-xr-x-hbasehbase02014-04-0415:18/hbase

drwxr-xr-x-hdfssupergroup02014-10-2113:16/jar

drwxr-xr-x-hdfssupergroup02014-10-1515:26/schema

drwxr-xr-x-solrsolr02014-04-0415:16/solr

drwxrwxrwt-hdfssupergroup02014-11-1211:29/tmp

drwxr-xr-x-hdfssupergroup02014-07-1309:05/user

drwxr-xr-x-hdfssupergroup02014-04-0415:15/var

TheoutputisverysimilartotheUnixlscommand.Thefileattributesworkthesameastheuser/group/worldattributesonaUnixfilesystem(includingthetstickybitascanbeseen)plusdetailsoftheowner,group,andmodificationtimeofthedirectories.Thecolumnbetweenthegroupnameandthemodifieddateisthesize;thisis0fordirectoriesbutwillhaveavalueforfilesaswe’llseeinthecodefollowingthenextinformationbox:

NoteIfrelativepathsareused,theyaretakenfromthehomedirectoryoftheuser.Ifthereisnohomedirectory,wecancreateitusingthefollowingcommands:

$sudo-uhdfshdfsdfs–mkdir/user/cloudera

$sudo-uhdfshdfsdfs–chowncloudera:cloudera/user/cloudera

Themkdirandchownstepsrequiresuperuserprivileges(sudo-uhdfs).

$hdfsdfs-mkdirtestdir

$hdfsdfs-ls

Found1items

drwxr-xr-x-clouderacloudera02014-11-1311:21testdir

Then,wecancreateafile,copyittoHDFS,andreaditscontentsdirectlyfromitslocationonHDFS,asfollows:

$echo"Helloworld">testfile.txt

$hdfsdfs-puttestfile.txttestdir

Notethatthereisanoldercommandcalled-copyFromLocal,whichworksinthesamewayas-put;youmightseeitinolderdocumentationonline.Now,runthefollowingcommandandchecktheoutput:

$hdfsdfs-lstestdir

Found1items

-rw-r--r--3clouderacloudera122014-11-1311:21

testdir/testfile.txt

Notethenewcolumnbetweenthefileattributesandtheowner;thisisthereplicationfactorofthefile.Now,finally,runthefollowingcommand:

$hdfsdfs-tailtestdir/testfile.txt

Helloworld

Muchoftherestofthedfssubcommandsareprettyintuitive;playaround.We’llexploresnapshotsandprogrammaticaccesstoHDFSlaterinthischapter.

ProtectingthefilesystemmetadataBecausethefsimagefileissocriticaltothefilesystem,itslossisacatastrophicfailure.InHadoop1,wheretheNameNodewasasinglepointoffailure,thebestpracticewastoconfiguretheNameNodetosynchronouslywritethefsimageandeditsfilestobothlocalstorageplusatleastoneotherlocationonaremotefilesystem(oftenNFS).IntheeventofNameNodefailure,areplacementNameNodecouldbestartedusingthisup-to-datecopyofthefilesystemmetadata.Theprocesswouldrequirenon-trivialmanualintervention,however,andwouldresultinaperiodofcompleteclusterunavailability.

SecondaryNameNodenottotherescueThemostunfortunatelynamedcomponentinallofHadoop1wastheSecondaryNameNode,which,notunreasonably,manypeopleexpecttobesomesortofbackuporstandbyNameNode.Itisnot;instead,theSecondaryNameNodewasresponsibleonlyforperiodicallyreadingthelatestversionofthefsimageandeditsfileandcreatinganewup-to-datefsimagewiththeoutstandingeditsapplied.Onabusycluster,thischeckpointcouldsignificantlyspeeduptherestartoftheNameNodebyreducingthenumberofeditsithadtoapplybeforebeingabletoserviceclients.

InHadoop2,thenamingismoreclear;thereareCheckpointnodes,whichdotherolepreviouslyperformedbytheSecondaryNameNode,plusBackupNameNodes,whichkeepalocalup-to-datecopyofthefilesystemmetadataeventhoughtheprocesstopromoteaBackupnodetobetheprimaryNameNodeisstillamultistagemanualprocess.

Hadoop2NameNodeHAInmostproductionHadoop2clusters,however,itmakesmoresensetousethefullHighAvailability(HA)solutioninsteadofrelyingonCheckpointandBackupnodes.ItisactuallyanerrortotrytocombineNameNodeHAwiththeCheckpointandBackupnodemechanisms.

Thecoreideaisforapair(currentlynomorethantwoaresupported)ofNameNodesconfiguredinanactive/passivecluster.OneNameNodeactsasthelivemasterthatservicesallclientrequests,andthesecondremainsreadytotakeovershouldtheprimaryfail.Inparticular,Hadoop2HDFSenablesthisHAthroughtwomechanisms:

ProvidingameansforbothNameNodestohaveconsistentviewsofthefilesystemProvidingameansforclientstoalwaysconnecttothemasterNameNode

KeepingtheHANameNodesinsyncThereareactuallytwomechanismsbywhichtheactiveandstandbyNameNodeskeeptheirviewsofthefilesystemconsistent;useofanNFSshareorQuorumJournalManager(QJM).

IntheNFScase,thereisanobviousrequirementonanexternalremoteNFSfileshare—notethatasuseofNFSwasbestpracticeinHadoop1forasecondcopyoffilesystemmetadatamanyclustersalreadyhaveone.Ifhighavailabilityisaconcern,thoughitshouldbeborneinmindthatmakingNFShighlyavailableoftenrequireshigh-endandexpensivehardware.InHadoop2,HAusesNFS;however,theNFSlocationbecomestheprimarylocationforthefilesystemmetadata.AstheactiveNameNodewritesallfilesystemchangestotheNFSshare,thestandbynodedetectsthesechangesandupdatesitscopyofthefilesystemmetadataaccordingly.

TheQJMmechanismusesanexternalservice(theJournalManagers)insteadofafilesystem.TheJournalManagerclusterisanoddnumberofservices(3,5,and7arethemostcommon)runningonthatnumberofhosts.AllchangestothefilesystemaresubmittedtotheQJMservice,andachangeistreatedascommittedonlywhenamajorityoftheQJMnodeshavecommittedthechange.ThestandbyNameNodereceiveschangeupdatesfromtheQJMserviceandusesthisinformationtokeepitscopyofthefilesystemmetadatauptodate.

TheQJMmechanismdoesnotrequireadditionalhardwareastheCheckpointnodesarelightweightandcanbeco-locatedwithotherservices.Thereisalsonosinglepointoffailureinthemodel.Consequently,theQJMHAisusuallythepreferredoption.

Ineithercase,bothinNFS-basedHAandQJM-basedHA,theDataNodessendblockstatusreportstobothNameNodestoensurethatbothhaveup-to-dateinformationofthemappingofblockstoDataNodes.Rememberthatthisblockassignmentinformationisnotheldinthefsimage/editsdata.

ClientconfigurationTheclientstotheHDFSclusterremainmostlyunawareofthefactthatNameNodeHAisbeingused.TheconfigurationfilesneedtoincludethedetailsofbothNameNodes,butthemechanismsfordeterminingwhichistheactiveNameNode—andwhentoswitchtothestandby—arefullyencapsulatedintheclientlibraries.ThefundamentalconceptthoughisthatinsteadofreferringtoanexplicitNameNodehostasinHadoop1,HDFSinHadoop2identifiesanameserviceIDfortheNameNodewithinwhichmultipleindividualNameNodes(eachwithitsownNameNodeID)aredefinedforHA.NotethattheconceptofnameserviceIDisalsousedbyNameNodefederation,whichwebrieflymentionedearlier.

HowafailoverworksFailovercanbeeithermanualorautomatic.AmanualfailoverrequiresanadministratortotriggertheswitchthatpromotesthestandbytothecurrentlyactiveNameNode.Thoughautomaticfailoverhasthegreatestimpactonmaintainingsystemavailability,theremightbeconditionsinwhichthisisnotalwaysdesirable.Triggeringamanualfailoverrequiresrunningonlyafewcommandsand,therefore,eveninthismode,thefailoverissignificantlyeasierthaninthecaseofHadoop1orwithHadoop2Backupnodes,wherethetransitiontoanewNameNoderequiressubstantialmanualeffort.

Regardlessofwhetherthefailoveristriggeredmanuallyorautomatically,ithastwomainphases:confirmationthatthepreviousmasterisnolongerservingrequestsandthepromotionofthestandbytobethemaster.

ThegreatestriskinafailoveristohaveaperiodinwhichbothNameNodesareservicingrequests.Insuchasituation,itispossiblethatconflictingchangesmightbemadetothefilesystemonthetwoNameNodesorthattheymightbecomeoutofsync.EventhoughthisshouldnotbepossibleiftheQJMisbeingused(itonlyeveracceptsconnectionsfromasingleclient),out-of-dateinformationmightbeservedtoclients,whomightthentrytomakeincorrectdecisionsbasedonthisstalemetadata.Thisis,ofcourse,particularlylikelyifthepreviousmasterNameNodeisbehavingincorrectlyinsomeway,whichiswhytheneedforthefailoverisidentifiedinthefirstplace.

ToensureonlyoneNameNodeisactiveatanytime,afencingmechanismisusedtovalidatethattheexistingNameNodemasterhasbeenshutdown.ThesimplestincludedmechanismwilltrytosshintotheNameNodehostandactivelykilltheprocessthoughacustomscriptcanalsobeexecuted,sothemechanismisflexible.ThefailoverwillnotcontinueuntilthefencingissuccessfulandthesystemhasconfirmedthatthepreviousmasterNameNodeisnowdeadandhasreleasedanyrequiredresources.

Oncefencingsucceeds,thestandbyNameNodebecomesthemasterandwillstartwritingtotheNFS-mountedfsimageandeditslogsifNFSisbeingusedforHAorwillbecomethesingleclienttotheQJMifthatistheHAmechanism.

Beforediscussingautomaticfailover,weneedaslightseguetointroduceanotherApacheprojectthatisusedtoenablethisfeature.

ApacheZooKeeper–adifferenttypeoffilesystemWithinHadoop,wewillmostlytalkaboutHDFSwhendiscussingfilesystemsanddatastorage.But,insidealmostallHadoop2installations,thereisanotherservicethatlookssomewhatlikeafilesystem,butwhichprovidessignificantcapabilitycrucialtotheproperfunctioningofdistributedsystems.ThisserviceisApacheZooKeeper(http://zookeeper.apache.org)and,asitisakeypartoftheimplementationofHDFSHA,wewillintroduceitinthischapter.Itis,however,alsousedbymultipleotherHadoopcomponentsandrelatedprojects,sowewilltouchonitseveralmoretimesthroughoutthebook.

ZooKeeperstartedoutasasubcomponentofHBaseandwasusedtoenableseveraloperationalcapabilitiesoftheservice.Whenanycomplexdistributedsystemisbuilt,thereareaseriesofactivitiesthatarealmostalwaysrequiredandwhicharealwaysdifficulttogetright.Theseactivitiesincludethingssuchashandlingsharedlocks,detectingcomponentfailure,andsupportingleaderelectionwithinagroupofcollaboratingservices.ZooKeeperwascreatedasthecoordinationservicethatwouldprovideaseriesofprimitiveoperationsuponwhichHBasecouldimplementthesetypesofoperationallycriticalfeatures.NotethatZooKeeperalsotakesinspirationfromtheGoogleChubbysystemdescribedathttp://research.google.com/archive/chubby-osdi06.pdf.

ZooKeeperrunsasaclusterofinstancesreferredtoasanensemble.Theensembleprovidesadatastructure,whichissomewhatanalogoustoafilesystem.EachlocationinthestructureiscalledaZNodeandcanhavechildrenasifitwereadirectorybutcanalsohavecontentasifitwereafile.NotethatZooKeeperisnotasuitableplacetostoreverylargeamountsofdata,andbydefault,themaximumamountofdatainaZNodeis1MB.Atanypointintime,oneserverintheensembleisthemasterandmakesalldecisionsaboutclientrequests.Thereareverywell-definedrulesaroundtheresponsibilitiesofthemaster,includingthatithastoensurethatarequestisonlycommittedwhenamajorityoftheensemblehavecommittedthechange,andthatoncecommittedanyconflictingchangeisrejected.

YoushouldhaveZooKeeperinstalledwithinyourClouderaVirtualMachine.Ifnot,useClouderaManagertoinstallitasasinglenodeonthehost.Inproductionsystems,ZooKeeperhasveryspecificsemanticsaroundabsolutemajorityvoting,sosomeofthelogiconlymakessenseinalargerensemble(3,5,or7nodesarethemostcommonsizes).

Thereisacommand-lineclienttoZooKeepercalledzookeeper-clientintheClouderaVM;notethatinthevanillaZooKeeperdistributionitiscalledzkCli.sh.Ifyourunitwithnoarguments,itwillconnecttotheZooKeeperserverrunningonthelocalmachine.Fromhere,youcantypehelptogetalistofcommands.

Themostimmediatelyinterestingcommandswillbecreate,ls,andget.Asthenamessuggest,thesecreateaZNode,listtheZNodesataparticularpointinthefilesystem,and

http://zookeeper.apache.org

http://research.google.com/archive/chubby-osdi06.pdf

getthedatastoredataparticularZNode.Herearesomeexamplesofusage.

CreateaZNodewithnodata:

$create/zk-test''

CreateachildofthefirstZNodeandstoresometextinit:

$create/zk-test/child1'sampledata'

RetrievethedataassociatedwithaparticularZNode:

$get/zk-test/child1

TheclientcanalsoregisterawatcheronagivenZNode—thiswillraiseanalertiftheZNodeinquestionchanges,eitheritsdataorchildrenbeingmodified.

Thismightnotsoundveryuseful,butZNodescanadditionallybecreatedasbothsequentialandephemeralnodes,andthisiswherethemagicstarts.

ImplementingadistributedlockwithsequentialZNodesIfaZNodeiscreatedwithintheCLIwiththe-soption,itwillbecreatedasasequentialnode.ZooKeeperwillsuffixthesuppliednamewitha10-digitintegerguaranteedtobeuniqueandgreaterthananyothersequentialchildrenofthesameZNode.Wecanusethismechanismtocreateadistributedlock.ZooKeeperitselfisnotholdingtheactuallock;theclientneedstounderstandwhatparticularstatesinZooKeepermeanintermsoftheirmappingtotheapplicationlocksinquestion.

Ifwecreatea(non-sequential)ZNodeat/zk-lock,thenanyclientwishingtoholdthelockwillcreateasequentialchildnode.Forexample,thecreate-s/zk-lock/locknodecommandmightcreatethenode,/zk-lock/locknode-0000000001,inthefirstcase,withincreasingintegersuffixesforsubsequentcalls.WhenaclientcreatesaZNodeunderthelock,itwillthencheckifitssequentialnodehasthelowestintegersuffix.Ifitdoes,thenitistreatedashavingthelock.Ifnot,thenitwillneedtowaituntilthenodeholdingthelockisdeleted.Theclientwillusuallyputawatchonthenodewiththenextlowestsuffixandthenbealertedwhenthatnodeisdeleted,indicatingthatitnowholdsthelock.

ImplementinggroupmembershipandleaderelectionusingephemeralZNodesAnyZooKeeperclientwillsendheartbeatstotheserverthroughoutthesession,showingthatitisalive.FortheZNodeswehavediscusseduntilnow,wecansaythattheyarepersistentandwillsurviveacrosssessions.Wecan,however,createaZNodeasephemeral,meaningitwilldisappearoncetheclientthatcreatediteitherdisconnectsorisdetectedasbeingdeadbytheZooKeeperserver.WithintheCLIanephemeralZNodeiscreatedbyaddingthe-eflagtothecreatecommand.

EphemeralZNodesareagoodmechanismtoimplementgroupmembershipdiscoverywithinadistributedsystem.Foranysystemwherenodescanfail,join,andleavewithoutnotice,knowingwhichnodesarealiveatanypointintimeisoftenadifficulttask.WithinZooKeeper,wecanprovidethebasisforsuchdiscoverybyhavingeachnodecreateanephemeralZNodeatacertainlocationintheZooKeeperfilesystem.TheZNodescanholddataabouttheservicenodes,suchashostname,IPaddress,portnumber,andsoon.Togetalistoflivenodes,wecansimplylistthechildnodesoftheparentgroupZNode.Becauseofthenatureofephemeralnodes,wecanhaveconfidencethatthelistoflivenodesretrievedatanytimeisuptodate.

IfwehaveeachservicenodecreateZNodechildrenthatarenotjustephemeralbutalsosequential,thenwecanalsobuildamechanismforleaderelectionforservicesthatneedtohaveasinglemasternodeatanyonetime.Themechanismisthesameforlocks;theclientservicenodecreatesthesequentialandephemeralZNodeandthenchecksifithasthelowestsequencenumber.Ifso,thenitisthemaster.Ifnot,thenitwillregisterawatcheronthenextlowestsequencenodetobealertedwhenitmightbecomethemaster.

JavaAPITheorg.apache.zookeeper.ZooKeeperclassisthemainprogrammaticclienttoaccessaZooKeeperensemble.Refertothejavadocsforthefulldetails,butthebasicinterfaceisrelativelystraightforwardwithobviousone-to-onecorrespondencetocommandsintheCLI.Forexample:

create:isequivalenttoCLIcreategetChildren:isequivalenttoCLIlsgetData:isequivalenttoCLIget

BuildingblocksAscanbeseen,ZooKeeperprovidesasmallnumberofwell-definedoperationswithverystrongsemanticguaranteesthatcanbebuiltintohigher-levelservices,suchasthelocks,groupmembership,andleaderelectionwediscussedearlier.It’sbesttothinkofZooKeeperasatoolkitofwell-engineeredandreliablefunctionscriticaltodistributedsystemsthatcanbebuiltuponwithouthavingtoworryabouttheintricaciesoftheirimplementation.TheprovidedZooKeeperinterfaceisquitelow-levelthough,andthereareafewhigher-levelinterfacesemergingthatprovidemoreofthemappingofthelow-levelprimitivesintoapplication-levellogic.TheCuratorproject(http://curator.apache.org/)isagoodexampleofthis.

ZooKeeperwasusedsparinglywithinHadoop1,butit’snowquiteubiquitous.It’susedbybothMapReduceandHDFSforthehighavailabilityoftheirJobTrackerandNameNodecomponents.HiveandImpala,whichwewillexplorelater,useittoplacelocksondatatablesthatarebeingaccessedbymultipleconcurrentjobs.Kafka,whichwe’lldiscussinthecontextofSamza,usesZooKeeperfornode(brokerinKafkaterminology),leaderelection,andstatemanagement.

http://curator.apache.org/

FurtherreadingWehavenotdescribedZooKeeperinmuchdetailandhavecompletelyomittedaspectssuchasitsabilitytoapplyquotasandaccesscontrolliststoZNodeswithinthefilesystemandthemechanismstobuildcallbacks.OurpurposeherewastogiveenoughofthedetailssothatyouwouldhavesomeideaofhowitwasbeingusedwithintheHadoopservicesweexploreinthisbook.Formoreinformation,consulttheprojecthomepage.

AutomaticNameNodefailoverNowthatwehaveintroducedZooKeeper,wecanshowhowitisusedtoenableautomaticNameNodefailover.

AutomaticNameNodefailoverintroducestwonewcomponentstothesystem,aZooKeeperquorum,andtheZooKeeperFailoverController(ZKFC),whichrunsoneachNameNodehost.TheZKFCcreatesanephemeralZNodeinZooKeeperandholdsthisZNodeforaslongasitdetectsthelocalNameNodetobealiveandfunctioningcorrectly.Itdeterminesthisbycontinuouslysendingsimplehealth-checkrequeststotheNameNode,andiftheNameNodefailstorespondcorrectlyoverashortperiodoftimetheZKFCwillassumetheNameNodehasfailed.IfaNameNodemachinecrashesorotherwisefails,theZKFCsessioninZooKeeperwillbeclosedandtheephemeralZNodewillalsobeautomaticallyremoved.

TheZKFCprocessesarealsomonitoringtheZNodesoftheotherNameNodesinthecluster.IftheZKFConthestandbyNameNodehostseestheexistingmasterZNodedisappear,itwillassumethemasterhasfailedandwillattemptafailover.ItdoesthisbytryingtoacquirethelockfortheNameNode(throughtheprotocoldescribedintheZooKeepersection)andifsuccessfulwillinitiateafailoverthroughthesamefencing/promotionmechanismdescribedearlier.

HDFSsnapshotsWementionedearlierthatHDFSreplicationaloneisnotasuitablebackupstrategy.IntheHadoop2filesystem,snapshotshavebeenadded,whichbringsanotherlevelofdataprotectiontoHDFS.

Filesystemsnapshotshavebeenusedforsometimeacrossavarietyoftechnologies.Thebasicideaisthatitbecomespossibletoviewtheexactstateofthefilesystematparticularpointsintime.Thisisachievedbytakingacopyofthefilesystemmetadataatthepointthesnapshotismadeandmakingthisavailabletobeviewedinthefuture.

Aschangestothefilesystemaremade,anychangethatwouldaffectthesnapshotistreatedspecially.Forexample,ifafilethatexistsinthesnapshotisdeletedthen,eventhoughitwillberemovedfromthecurrentstateofthefilesystem,itsmetadatawillremaininthesnapshot,andtheblocksassociatedwithitsdatawillremainonthefilesystemthoughnotaccessiblethroughanyviewofthesystemotherthanthesnapshot.

Anexamplemightillustratethispoint.Say,youhaveafilesystemcontainingthefollowingfiles:

/data1(5blocks)

/data2(10blocks)

Youtakeasnapshotandthendeletethefile/data2.Ifyouviewthecurrentstateofthefilesystem,thenonly/data1willbevisible.Ifyouexaminethesnapshot,youwillseebothfiles.Behindthescenes,all15blocksstillexist,butonlythoseassociatedwiththeun-deletedfile/data1arepartofthecurrentfilesystem.Theblocksforthefile/data2willbereleasedonlywhenthesnapshotisitselfremoved—snapshotsareread-onlyviews.

SnapshotsinHadoop2canbeappliedateitherthefullfilesystemleveloronlyonparticularpaths.Apathneedstobesetassnapshottable,andnotethatyoucannothaveapathsnapshottableifanyofitschildrenorparentpathsarethemselvessnapshottable.

Let’stakeasimpleexamplebasedonthedirectorywecreatedearliertoillustratetheuseofsnapshots.Thecommandswearegoingtoillustrateneedtobeexecutedwithsuperuserprivileges,whichcanbeobtainedwithsudo-uhdfs.

First,usethedfsadminsubcommandofthehdfsCLIutilitytoenablesnapshotsofadirectory,asfollows:

$sudo-uhdfshdfsdfsadmin-allowSnapshot\

/user/cloudera/testdir

Allowingsnapshotontestdirsucceeded

Now,wecreatethesnapshotandexamineit;snapshotsareavailablethroughthe.snapshotsubdirectoryofthesnapshottabledirectory.Notethatthe.snapshotdirectorywillnotbevisibleinanormallistingofthedirectory.Here’showwecreateasnapshotandexamineit:

$sudo-uhdfshdfsdfs-createSnapshot\

/user/cloudera/testdirsn1

Createdsnapshot/user/cloudera/testdir/.snapshot/sn1

$sudo-uhdfshdfsdfs-ls\

/user/cloudera/testdir/.snapshot/sn1

Found1items-rw-r--r--1clouderacloudera122014-11-1311:21

/user/cloudera/testdir/.snapshot/sn1/testfile.txt

Now,weremovethetestfilefromthemaindirectoryandverifythatitisnowempty:

$sudo-uhdfshdfsdfs-rm\

/user/cloudera/testdir/testfile.txt

14/11/1313:13:51INFOfs.TrashPolicyDefault:Namenodetrashconfiguration:

Deletioninterval=1440minutes,Emptierinterval=0minutes.Moved:

'hdfs://localhost.localdomain:8020/user/cloudera/testdir/testfile.txt'to

trashat:hdfs://localhost.localdomain:8020/user/hdfs/.Trash/Current

$hdfsdfs-ls/user/cloudera/testdir

$

Notethementionoftrashdirectories;bydefault,HDFSwillcopyanydeletedfilesintoa.Trashdirectoryintheuser’shomedirectory,whichhelpstodefendagainstslippingfingers.Thesefilescanberemovedthroughhdfsdfs-expungeorwillbeautomaticallypurgedin7daysbydefault.

Now,weexaminethesnapshotwherethenow-deletedfileisstillavailable:

$hdfsdfs-lstestdir/.snapshot/sn1

Found1itemsdrwxr-xr-x-clouderacloudera02014-11-1313:12

testdir/.snapshot/sn1

$hdfsdfs-tailtestdir/.snapshot/sn1/testfile.txt

Helloworld

Then,wecandeletethesnapshot,freeingupanyblocksheldbyit,asfollows:

$sudo-uhdfshdfsdfs-deleteSnapshot\

/user/cloudera/testdirsn1

$hdfsdfs-lstestdir/.snapshot

$

Ascanbeseen,thefileswithinasnapshotarefullyavailabletobereadandcopied,providingaccesstothehistoricalstateofthefilesystematthepointwhenthesnapshotwasmade.Eachdirectorycanhaveupto65,535snapshots,andHDFSmanagessnapshotsinsuchawaythattheyarequiteefficientintermsofimpactonnormalfilesystemoperations.Theyareagreatmechanismtousepriortoanyactivitythatmighthaveadverseeffects,suchastryinganewversionofanapplicationthataccessesthefilesystem.Ifthenewsoftwarecorruptsfiles,theoldstateofthedirectorycanberestored.Ifafteraperiodofvalidationthesoftwareisaccepted,thenthesnapshotcaninsteadbedeleted.

HadoopfilesystemsUntilnow,wereferredtoHDFSastheHadoopfilesystem.Inreality,Hadoophasaratherabstractnotionoffilesystem.HDFSisonlyoneofseveralimplementationsoftheorg.apache.hadoop.fs.FileSystemJavaabstractclass.Alistofavailablefilesystemscanbefoundathttps://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/fs/FileSystem.html.Thefollowingtablesummarizessomeofthesefilesystems,alongwiththecorrespondingURIschemeandJavaimplementationclass.

Filesystem URIscheme Javaimplementation

Local file org.apache.hadoop.fs.LocalFileSystem

HDFS hdfs org.apache.hadoop.hdfs.DistributedFileSystem

S3(native) s3n org.apache.hadoop.fs.s3native.NativeS3FileSystem

S3(block-based) s3 org.apache.hadoop.fs.s3.S3FileSystem

ThereexisttwoimplementationsoftheS3filesystem.Native—s3n—isusedtoreadandwriteregularfiles.Datastoredusings3ncanbeaccessedbyanytoolandconverselycanbeusedtoreaddatageneratedbyotherS3tools.s3ncannothandlefileslargerthan5TBorrenameoperations.

MuchlikeHDFS,theblock-basedS3filesystemstoresfilesinblocksandrequiresanS3buckettobededicatedtothefilesystem.FilesstoredinanS3filesystemcanbelargerthan5TB,buttheywillnotbeinteroperablewithotherS3tools.Additionallyblock-basedS3supportsrenameoperations.

https://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/fs/FileSystem.html

HadoopinterfacesHadoopiswritteninJava,andnotsurprisingly,allinteractionwiththesystemhappensviatheJavaAPI.Thecommand-lineinterfaceweusedthroughthehdfscommandinpreviousexamplesisaJavaapplicationthatusestheFileSystemclasstocarryoutinput/outputoperationsontheavailablefilesystems.

JavaFileSystemAPITheJavaAPI,providedbytheorg.apache.hadoop.fspackage,exposesApacheHadoopfilesystems.

org.apache.hadoop.fs.FileSystemistheabstractclasseachfilesystemimplementsandprovidesageneralinterfacetointeractwithdatainHadoop.AllcodethatusesHDFSshouldbewrittenwiththecapabilityofhandlingaFileSystemobject.

LibhdfsLibhdfsisaClibrarythat,despiteitsname,canbeusedtoaccessanyHadoopfilesystemandnotjustHDFS.ItiswrittenusingtheJavaNativeInterface(JNI)andmimicstheJavaFileSystemclass.

ThriftApacheThrift(http://thrift.apache.org)isaframeworkforbuildingcross-languagesoftwarethroughdataserializationandremotemethodinvocationmechanisms.TheHadoopThriftAPI,availableincontrib,exposesHadoopfilesystemsasaThriftservice.Thisinterfacemakesiteasyfornon-JavacodetoaccessdatastoredinaHadoopfilesystem.

Otherthantheaforementionedinterfaces,thereexistotherinterfacesthatallowaccesstoHadoopfilesystemsviaHTTPandFTP—theseforHDFSonly—aswellasWebDAV.

http://thrift.apache.org

ManagingandserializingdataHavingafilesystemisallwellandgood,butwealsoneedmechanismstorepresentdataandstoreitonthefilesystems.Wewillexploresomeofthesemechanismsnow.

TheWritableinterfaceItisuseful,tousasdevelopers,ifwecanmanipulatehigher-leveldatatypesandhaveHadooplookaftertheprocessesrequiredtoserializethemintobytestowritetoafilesystemandreconstructfromastreamofbyteswhenitisreadfromthefilesystem.

Theorg.apache.hadoop.iopackagecontainstheWritableinterface,whichprovidesthismechanismandisspecifiedasfollows:

publicinterfaceWritable

{

voidwrite(DataOutputout)throwsIOException;

voidreadFields(DataInputin)throwsIOException;

}

Themainpurposeofthisinterfaceistoprovidemechanismsfortheserializationanddeserializationofdataasitispassedacrossthenetworkorreadandwrittenfromthedisk.

WhenweexploreprocessingframeworksonHadoopinlaterchapters,wewilloftenseeinstanceswheretherequirementisforadataargumenttobeofthetypeWritable.Ifweusedatastructuresthatprovideasuitableimplementationofthisinterface,thentheHadoopmachinerycanautomaticallymanagetheserializationanddeserializationofthedatatypewithoutknowinganythingaboutwhatitrepresentsorhowitisused.

IntroducingthewrapperclassesFortunately,youdon’thavetostartfromscratchandbuildWritablevariantsofallthedatatypesyouwilluse.HadoopprovidesclassesthatwraptheJavaprimitivetypesandimplementtheWritableinterface.Theyareprovidedintheorg.apache.hadoop.iopackage.

Theseclassesareconceptuallysimilartotheprimitivewrapperclasses,suchasIntegerandLong,foundinjava.lang.Theyholdasingleprimitivevaluethatcanbeseteitheratconstructionorviaasettermethod.Theyareasfollows:

BooleanWritable

ByteWritable

DoubleWritable

FloatWritable

IntWritable

LongWritable

VIntWritable:avariablelengthintegertypeVLongWritable:avariablelengthlongtypeThereisalsoText,whichwrapsjava.lang.String.

ArraywrapperclassesHadoopalsoprovidessomecollection-basedwrapperclasses.TheseclassesprovideWritablewrappersforarraysofotherWritableobjects.Forexample,aninstancecouldeitherholdanarrayofIntWritableorDoubleWritable,butnotarraysoftherawintorfloattypes.AspecificsubclassfortherequiredWritableclasswillberequired.Theyareasfollows:

ArrayWritable

TwoDArrayWritable

TheComparableandWritableComparableinterfacesWewereslightlyinaccuratewhenwesaidthatthewrapperclassesimplementWritable;theyactuallyimplementacompositeinterfacecalledWritableComparableintheorg.apache.hadoop.iopackagethatcombinesWritablewiththestandardjava.lang.Comparableinterface:

publicinterfaceWritableComparableextendsWritable,Comparable

{}

TheneedforComparablewillonlybecomeapparentwhenweexploreMapReduceinthenextchapter,butfornow,justrememberthatthewrapperclassesprovidemechanismsforthemtobebothserializedandsortedbyHadooporanyofitsframeworks.

StoringdataUntilnow,weintroducedthearchitectureofHDFSandhowtoprogrammaticallystoreandretrievedatausingthecommand-linetoolsandtheJavaAPI.Intheexamplesseenuntilnow,wehaveimplicitlyassumedthatourdatawasstoredasatextfile.Inreality,someapplicationsanddatasetswillrequireadhocdatastructurestoholdthefile’scontents.Overtheyears,fileformatshavebeencreatedtoaddressboththerequirementsofMapReduceprocessing—forinstance,wewantdatatobesplittable—andtosatisfytheneedtomodelbothstructuredandunstructureddata.Currently,alotoffocushasbeendedicatedtobettercapturetheusecasesofrelationaldatastorageandmodeling.Intheremainderofthischapter,wewillintroducesomeofthepopularfileformatchoicesavailablewithintheHadoopecosystem.

SerializationandContainersWhentalkingaboutfileformats,weareassumingtwotypesofscenarios,whichareasfollows:

Serialization:wewanttoencodedatastructuresgeneratedandmanipulatedatprocessingtimetoaformatwecanstoretoafile,transmit,andatalaterstage,retrieveandtranslatebackforfurthermanipulationContainers:oncedataisserializedtofiles,containersprovidemeanstogroupmultiplefilestogetherandaddadditionalmetadata

CompressionWhenworkingwithdata,filecompressioncanoftenleadtosignificantsavingsbothintermsofthespacenecessarytostorefilesaswellasonthedataI/Oacrossthenetworkandfrom/tolocaldisks.

Inbroadterms,whenusingaprocessingframework,compressioncanoccuratthreepointsintheprocessingpipeline:

inputfilestobeprocessedoutputfilesthatresultafterprocessingiscompletedintermediate/temporaryfilesproducedinternallywithinthepipeline

Whenweaddcompressionatanyofthesestages,wehaveanopportunitytodramaticallyreducetheamountofdatatobereadorwrittentothediskoracrossthenetwork.ThisisparticularlyusefulwithframeworkssuchasMapReducethatcan,forexample,producevolumesoftemporarydatathatarelargerthaneithertheinputoroutputdatasets.

ApacheHadoopcomeswithanumberofcompressioncodecs:gzip,bzip2,LZO,snappy—eachwithitsowntradeoffs.Pickingacodecisaneducatedchoicethatshouldconsiderboththekindofdatabeingprocessedaswellasthenatureoftheprocessingframeworkitself.

Otherthanthegeneralspace/timetradeoff,wherethelargestspacesavingscomeattheexpenseofcompressionanddecompressionspeed(andviceversa),weneedtotakeintoaccountthatdatastoredinHDFSwillbeaccessedbyparallel,distributedsoftware;someofthissoftwarewillalsoadditsownparticularrequirementsonfileformats.MapReduce,forexample,ismostefficientonfilesthatcanbesplitintovalidsubfiles.

Thiscancomplicatedecisions,suchasthechoiceofwhethertocompressandwhichcodectouseifso,asmostcompressioncodecs(suchasgzip)donotsupportsplittablefiles,whereasafew(suchasLZO)do.

General-purposefileformatsThefirstclassoffileformatsarethosegeneral-purposeonesthatcanbeappliedtoanyapplicationdomainandmakenoassumptionsondatastructureoraccesspatterns.

Text:thesimplestapproachtostoringdataonHDFSistouseflatfiles.Textfilescanbeusedbothtoholdunstructureddata—awebpageoratweet—aswellasstructureddata—aCSVfilethatisafewmillionrowslong.Textfilesaresplittable,thoughoneneedstoconsiderhowtohandleboundariesbetweenmultipleelements(forexample,lines)inthefile.SequenceFile:aSequenceFileisaflatdatastructureconsistingofbinarykey/valuepairs,introducedtoaddressspecificrequirementsofMapReduce-basedprocessing.ItisstillextensivelyusedinMapReduceasaninput/outputformat.AswewillseeinChapter3,Processing–MapReduceandBeyond,internally,thetemporaryoutputsofmapsarestoredusingSequenceFile.

SequenceFileprovidesWriter,Reader,andSorterclassestowrite,read,and,sortdata,respectively.

Dependingonthecompressionmechanisminuse,threevariationsofSequenceFilecanbedistinguished:

Uncompressedkey/valuerecords.Recordcompressedkey/valuerecords.Only‘values’arecompressed.Blockcompressedkey/valuerecords.Keysandvaluesarecollectedinblocksofarbitrarysizeandcompressedseparately.

Ineachcase,however,theSequenceFileremainssplittable,whichisoneofitsbiggeststrengths.

Column-orienteddataformatsIntherelationaldatabaseworld,column-orienteddatastoresorganizeandstoretablesbasedonthecolumns;generallyspeaking,thedataforeachcolumnwillbestoredtogether.ThisisasignificantlydifferentapproachcomparedtomostrelationalDBMSthatorganizedataperrow.Column-orientedstoragehassignificantperformanceadvantages;forexample,ifaqueryneedstoreadonlytwocolumnsfromaverywidetablecontaininghundredsofcolumns,thenonlytherequiredcolumndatafilesareaccessed.Atraditionalrow-orienteddatabasewouldhavetoreadallcolumnsforeachrowforwhichdatawasrequired.Thishasthegreatestimpactonworkloadswhereaggregatefunctionsarecomputedoverlargenumbersofsimilaritems,suchaswithOLAPworkloadstypicalofdatawarehousesystems.

InChapter7,HadoopandSQL,wewillseehowHadoopisbecomingaSQLbackendforthedatawarehouseworldthankstoprojectssuchasApacheHiveandClouderaImpala.Aspartoftheexpansionintothisdomain,anumberoffileformatshavebeendevelopedtoaccountforbothrelationalmodelinganddatawarehousingneeds.

RCFile,ORC,andParquetarethreestate-of-the-artcolumn-orientedfileformatsdevelopedwiththeseusecasesinmind.

RCFileRowColumnarFile(RCFile)wasoriginallydevelopedbyFacebooktobeusedasthebackendstoragefortheirHivedatawarehousesystemthatwasthefirstmainstreamSQL-on-Hadoopsystemavailableasopensource.

RCFileaimstoprovidethefollowing:

fastdataloadingfastqueryprocessingefficientstorageutilizationadaptabilitytodynamicworkloads

MoreinformationonRCFilecanbefoundathttp://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html.

ORCTheOptimizedRowColumnarfileformat(ORC)aimstocombinetheperformanceoftheRCFilewiththeflexibilityofAvro.ItisprimarilyintendedtoworkwithApacheHiveandhasbeeninitiallydevelopedbyHortonworkstoovercometheperceivedlimitationsofotheravailablefileformats.

Moredetailscanbefoundathttp://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html.

ParquetParquet,foundathttp://parquet.incubator.apache.org,wasoriginallyajointeffortof

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

http://parquet.incubator.apache.org

Cloudera,Twitter,andCriteo,andnowhasbeendonatedtotheApacheSoftwareFoundation.ThegoalsofParquetaretoprovideamodern,performant,columnarfileformattobeusedwithClouderaImpala.AswithImpala,ParquethasbeeninspiredbytheDremelpaper(http://research.google.com/pubs/pub36632.html).Itallowscomplex,nesteddatastructuresandallowsefficientencodingonaper-columnlevel.

AvroApacheAvro(http://avro.apache.org)isaschema-orientedbinarydataserializationformatandfilecontainer.Avrowillbeourpreferredbinarydataformatthroughoutthisbook.Itisbothsplittableandcompressible,makingitanefficientformatfordataprocessingwithframeworkssuchasMapReduce.

Numerousotherprojectsalsohavebuilt-inspecificAvrosupportandintegration,however,soitisverywidelyapplicable.WhendataisstoredinanAvrofile,itsschema—definedasaJSONobject—isstoredwithit.Afilecanbelaterprocessedbyathirdpartywithnoapriorinotionofhowdataisencoded.Thismakesdataself-describingandfacilitatesusewithdynamicandscriptinglanguages.Theschema-on-readmodelalsohelpsAvrorecordstobeefficienttostoreasthereisnoneedfortheindividualfieldstobetagged.

Inlaterchapters,youwillseehowthesepropertiescanmakedatalifecyclemanagementeasierandallownon-trivialoperationssuchasschemamigrations.

UsingtheJavaAPIWe’llnowdemonstratetheuseoftheJavaAPItoparseAvroschemas,readandwriteAvrofiles,anduseAvro’scodegenerationfacilities.Notethattheformatisintrinsicallylanguageindependent;thereareAPIsformostlanguages,andfilescreatedbyJavawillseamlesslybereadfromanyotherlanguage.

AvroschemasaredescribedasJSONdocumentsandrepresentedbytheorg.apache.avro.Schemaclass.TodemonstratetheAPIformanipulatingAvrodocuments,we’lllookaheadtoanAvrospecificationweuseforaHivetableinChapter7,HadoopandSQL.Thefollowingcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/src/main/java/com/learninghadoop2/avro/AvroParse.java.

Inthefollowingcode,wewillusetheAvroJavaAPItocreateanAvrofilecontainingatweetrecordandthenre-readthefile,usingtheschemainthefiletoextractthedetailsofthestoredrecords:

publicstaticvoidtestGenericRecord(){

try{

Schemaschema=newSchema.Parser()

.parse(newFile("tweets_avro.avsc"));

GenericRecordtweet=newGenericData

.Record(schema);

tweet.put("text","Thegenerictweettext");

http://research.google.com/pubs/pub36632.html

http://avro.apache.org

https://github.com/learninghadoop2/book-examples/blob/master/ch2/src/main/java/com/learninghadoop2/avro/AvroParse.java

Filefile=newFile("tweets.avro");

DatumWriter<GenericRecord>datumWriter=

newGenericDatumWriter<>(schema);

DataFileWriter<GenericRecord>fileWriter=

newDataFileWriter<>(datumWriter);

fileWriter.create(schema,file);

fileWriter.append(tweet);

fileWriter.close();

DatumReader<GenericRecord>datumReader=

newGenericDatumReader<>(schema);

DataFileReader<GenericRecord>fileReader=

newDataFileReader(file,datumReader);

GenericRecordgenericTweet=null;

while(fileReader.hasNext()){

genericTweet=(GenericRecord)fileReader

.next(genericTweet);

for(Schema.Fieldfield:

genericTweet.getSchema().getFields()){

Objectval=genericTweet.get(field.name());

if(val!=null){

System.out.println(val);

}

}

}

}catch(IOExceptionie){

System.out.println("Errorparsingorwritingfile.");

}

}

Thetweets_avro.avscschema,foundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/tweets_avro.avsc,describesatweetwithmultiplefields.TocreateanAvroobjectofthistype,wefirstparsetheschemafile.WethenuseAvro’sconceptofaGenericRecordtobuildanAvrodocumentthatcomplieswiththisschema.Inthiscase,weonlysetasingleattribute—thetweettextitself.

TowritethisAvrofile—containingasingleobject—wethenuseAvro’sI/Ocapabilities.Toreadthefile,wedonotneedtostartwiththeschema,aswecanextractthisfromtheGenericRecordwereadfromthefile.Wethenwalkthroughtheschemastructureanddynamicallyprocessthedocumentbasedonthediscoveredfields.Thisisparticularlypowerful,asitisthekeyenablerofclientsremainingindependentoftheAvroschemaandhowitevolvesovertime.

Ifwehavetheschemafileinadvance,however,wecanuseAvrocodegenerationtocreateacustomizedclassthatmakesmanipulatingAvrorecordsmucheasier.Togeneratethecode,wewillusethecompileclassintheavro-tools.jar,passingitthenameoftheschemafileandthedesiredoutputdirectory:

$java-jar/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/avro/avro-

https://github.com/learninghadoop2/book-examples/blob/master/ch2/tweets_avro.avsc

tools.jarcompileschematweets_avro.avscsrc/main/java

Theclasswillbeplacedinadirectorystructurebasedonanynamespacedefinedintheschema.Sincewecreatedthisschemainthecom.learninghadoop2.avrotablesnamespace,weseethefollowing:

$lssrc/main/java/com/learninghadoop2/avrotables/tweets_avro.java

Withthisclass,let’srevisitthecreationandtheactofreadingandwritingAvroobjects,asfollows:

publicstaticvoidtestGeneratedCode(){

tweets_avrotweet=newtweets_avro();

tweet.setText("Thecodegeneratedtweettext");

try{

Filefile=newFile("tweets.avro");

DatumWriter<tweets_avro>datumWriter=

newSpecificDatumWriter<>(tweets_avro.class);

DataFileWriter<tweets_avro>fileWriter=

newDataFileWriter<>(datumWriter);

fileWriter.create(tweet.getSchema(),file);

fileWriter.append(tweet);

fileWriter.close();

DatumReader<tweets_avro>datumReader=

newSpecificDatumReader<>(tweets_avro.class);

DataFileReader<tweets_avro>fileReader=

newDataFileReader<>(file,datumReader);

while(fileReader.hasNext()){

tweet=fileReader.next(tweet);

System.out.println(tweet.getText());

}

}catch(IOExceptionie){

System.out.println("Errorinparsingorwritingfiles.");

}

}

Becauseweusedcodegeneration,wenowusetheAvroSpecificRecordmechanismalongsidethegeneratedclassthatrepresentstheobjectinourdomainmodel.Consequently,wecandirectlyinstantiatetheobjectandaccessitsattributesthroughfamiliarget/setmethods.

Writingthefileissimilartotheactionperformedbefore,exceptthatweusespecificclassesandalsoretrievetheschemadirectlyfromthetweetobjectwhenneeded.Readingissimilarlyeasedthroughtheabilitytocreateinstancesofaspecificclassanduseget/setmethods.

SummaryThischapterhasgivenawhistle-stoptourthroughstorageonaHadoopcluster.Inparticular,wecovered:

Thehigh-levelarchitectureofHDFS,themainfilesystemusedinHadoopHowHDFSworksunderthecoversand,inparticular,itsapproachtoreliabilityHowHadoop2hasaddedsignificantlytoHDFS,particularlyintheformofNameNodeHAandfilesystemsnapshotsWhatZooKeeperisandhowitisusedbyHadooptoenablefeaturessuchasNameNodeautomaticfailoverAnoverviewofthecommand-linetoolsusedtoaccessHDFSTheAPIforfilesystemsinHadoopandhowatacodelevelHDFSisjustoneimplementationofamoreflexiblefilesystemabstractionHowdatacanbeserializedontoaHadoopfilesystemandsomeofthesupportprovidedinthecoreclassesThevariousfileformatsavailableinwhichdataismostfrequentlystoredinHadoopandsomeoftheirparticularusecases

Inthenextchapter,wewilllookindetailathowHadoopprovidesprocessingframeworksthatcanbeusedtoprocessthedatastoredwithinit.

Chapter3.Processing–MapReduceandBeyondInHadoop1,theplatformhadtwoclearcomponents:HDFSfordatastorageandMapReducefordataprocessing.ThepreviouschapterdescribedtheevolutionofHDFSinHadoop2andinthischapterwe’lldiscussdataprocessing.

ThepicturewithprocessinginHadoop2haschangedmoresignificantlythanhasstorage,andHadoopnowsupportsmultipleprocessingmodelsasfirst-classcitizens.Inthischapterwe’llexplorebothMapReduceandothercomputationalmodelsinHadoop2.Inparticular,we’llcover:

WhatMapReduceisandtheJavaAPIrequiredtowriteapplicationsforitHowMapReduceisimplementedinpracticeHowHadoopreadsdataintoandoutofitsprocessingjobsYARN,theHadoop2componentthatallowsprocessingbeyondMapReduceontheplatformAnintroductiontoseveralcomputationalmodelsimplementedonYARN

MapReduceMapReduceistheprimaryprocessingmodelsupportedinHadoop1.Itfollowsadivideandconquermodelforprocessingdatamadepopularbya2006paperbyGoogle(http://research.google.com/archive/mapreduce.html)andhasfoundationsbothinfunctionalprogramminganddatabaseresearch.Thenameitselfreferstotwodistinctstepsappliedtoallinputdata,amapfunctionandareducefunction.

EveryMapReduceapplicationisasequenceofjobsthatbuildatopthisverysimplemodel.Sometimes,theoverallapplicationmayrequiremultiplejobs,wheretheoutputofthereducestagefromoneistheinputtothemapstageofanother,andsometimestheremightbemultiplemaporreducefunctions,butthecoreconceptsremainthesame.

WewillintroducetheMapReducemodelbylookingatthenatureofthemapandreducefunctionsandthendescribetheJavaAPIrequiredtobuildimplementationsofthefunctions.Aftershowingsomeexamples,wewillwalkthroughaMapReduceexecutiontogivemoreinsightintohowtheactualMapReduceframeworkexecutescodeatruntime.

LearningtheMapReducemodelcanbealittlecounter-intuitive;it’softendifficulttoappreciatehowverysimplefunctionscan,whencombined,provideveryrichprocessingonenormousdatasets.Butitdoeswork,trustus!

Asweexplorethenatureofthemapandreducefunctions,thinkofthemasbeingappliedtoastreamofrecordsbeingretrievedfromthesourcedataset.We’lldescribehowthathappenslater;fornow,thinkofthesourcedatabeingslicedintosmallerchunks,eachofwhichgetsfedtoadedicatedinstanceofthemapfunction.Eachrecordhasthemapfunctionapplied,producingasetofintermediarydata.Recordsareretrievedfromthistemporarydatasetandallassociatedrecordsarefedtogetherthroughthereducefunction.Thefinaloutputofthereducefunctionforallthesetsofrecordsistheoverallresultforthecompletejob.

Fromafunctionalperspective,MapReducetransformsdatastructuresfromonelistof(key,value)pairsintoanother.DuringtheMapphase,dataisloadedfromHDFS,andafunctionisappliedinparalleltoeveryinput(key,value)andanewlistof(key,value)pairsistheoutput:

map(k1,v1)->list(k2,v2)

Theframeworkthencollectsallpairswiththesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.AReducefunctionisappliedinparalleltoeachgroup,whichinturnproducesalistofvalues:

reduce(k2,list(v2))→k3,list(v3)

TheoutputisthenwrittenbacktoHDFSinthefollowingmanner:

http://research.google.com/archive/mapreduce.html

MapandReducephases

JavaAPItoMapReduceTheJavaAPItoMapReduceisexposedbytheorg.apache.hadoop.mapreducepackage.WritingaMapReduceprogram,atitscore,isamatterofsubclassingHadoop-providedMapperandReducerbaseclasses,andoverridingthemap()andreduce()methodswithourownimplementation.

TheMapperclassForourownMapperimplementations,wewillsubclasstheMapperbaseclassandoverridethemap()method,asfollows:

classMapper<K1,V1,K2,V2>

{

voidmap(K1key,V1valueMapper.Contextcontext)

throwsIOException,InterruptedException

...

}

Theclassisdefinedintermsofthekey/valueinputandoutputtypes,andthenthemapmethodtakesaninputkey/valuepairasitsparameter.TheotherparameterisaninstanceoftheContextclassthatprovidesvariousmechanismstocommunicatewiththeHadoopframework,oneofwhichistooutputtheresultsofamaporreducemethod.

NoticethatthemapmethodonlyreferstoasingleinstanceofK1andV1key/valuepairs.ThisisacriticalaspectoftheMapReduceparadigminwhichyouwriteclassesthatprocesssinglerecords,andtheframeworkisresponsibleforalltheworkrequiredtoturnanenormousdatasetintoastreamofkey/valuepairs.Youwillneverhavetowritemaporreduceclassesthattrytodealwiththefulldataset.HadoopalsoprovidesmechanismsthroughitsInputFormatandOutputFormatclassesthatprovideimplementationsofcommonfileformatsandlikewiseremovetheneedforhavingtowritefileparsersforanybutcustomfiletypes.

Therearethreeadditionalmethodsthatsometimesmayberequiredtobeoverridden:.

protectedvoidsetup(Mapper.Contextcontext)


Thismethodiscalledoncebeforeanykey/valuepairsarepresentedtothemapmethod.Thedefaultimplementationdoesnothing:

protectedvoidcleanup(Mapper.Contextcontext)


Thismethodiscalledonceafterallkey/valuepairshavebeenpresentedtothemapmethod.Thedefaultimplementationdoesnothing:

protectedvoidrun(Mapper.Contextcontext)


ThismethodcontrolstheoverallflowoftaskprocessingwithinaJVM.Thedefaultimplementationcallsthesetupmethodoncebeforerepeatedlycallingthemapmethodforeachkey/valuepairinthesplitandthenfinallycallsthecleanupmethod.

TheReducerclassTheReducerbaseclassworksverysimilarlytotheMapperclassandusuallyrequiresonlysubclassestooverrideasinglereduce()method.Hereisthecut-downclassdefinition:

publicclassReducer<K2,V2,K3,V3>

{

voidreduce(K2key,Iterable<V2>values,

Reducer.Contextcontext)


...

}

Again,noticetheclassdefinitionintermsofthebroaderdataflow(thereducemethodacceptsK2/V2asinputandprovidesK3/V3asoutput),whiletheactualreducemethodtakesonlyasinglekeyanditsassociatedlistofvalues.TheContextobjectisagainthemechanismtooutputtheresultofthemethod.

Thisclassalsohasthesetup,runandcleanupmethodswithsimilardefaultimplementationsaswiththeMapperclassthatcanoptionallybeoverridden:

protectedvoidsetup(Reducer.Contextcontext)


Thesetup()methodiscalledoncebeforeanykey/listsofvaluesarepresentedtothereducemethod.Thedefaultimplementationdoesnothing:

protectedvoidcleanup(Reducer.Contextcontext)


Thecleanup()methodiscalledonceafterallkey/listsofvalueshavebeenpresentedtothereducemethod.Thedefaultimplementationdoesnothing:

protectedvoidrun(Reducer.Contextcontext)


Therun()methodcontrolstheoverallflowofprocessingthetaskwithintheJVM.Thedefaultimplementationcallsthesetupmethodbeforerepeatedlyandpotentiallyconcurrentlycallingthereducemethodforasmanykey/valuepairsprovidedtotheReducerclass,andthenfinallycallsthecleanupmethod.

TheDriverclassTheDriverclasscommunicateswiththeHadoopframeworkandspecifiestheconfigurationelementsneededtorunaMapReducejob.ThisinvolvesaspectssuchastellingHadoopwhichMapperandReducerclassestouse,wheretofindtheinputdataandinwhatformat,andwheretoplacetheoutputdataandhowtoformatit.

ThedriverlogicusuallyexistsinthemainmethodoftheclasswrittentoencapsulateaMapReducejob.ThereisnodefaultparentDriverclasstosubclass:

publicclassExampleDriverextendsConfiguredimplementsTool

{

...

publicstaticvoidrun(String[]args)throwsException

{

//CreateaConfigurationobjectthatisusedtosetotheroptions

Configurationconf=getConf();

//Getcommandlinearguments

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

//Createtheobjectrepresentingthejob

Jobjob=newJob(conf,"ExampleJob");

//Setthenameofthemainclassinthejobjarfile

job.setJarByClass(ExampleDriver.class);

//Setthemapperclass

job.setMapperClass(ExampleMapper.class);

//Setthereducerclass

job.setReducerClass(ExampleReducer.class);

//Setthetypesforthefinaloutputkeyandvalue

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

//Setinputandoutputfilepaths

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

//Executethejobandwaitforittocomplete

System.exit(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException

{

intexitCode=ToolRunner.run(newExampleDriver(),args);

System.exit(exitCode);

}

}

Intheprecedinglinesofcode,org.apache.hadoop.util.Toolisaninterfaceforhandlingcommand-lineoptions.TheactualhandlingisdelegatedtoToolRunner.run,whichruns

ToolwiththegivenConfigurationusedtogetandsetajob’sconfigurationoptions.Bysubclassingorg.apache.hadoop.conf.Configured,wecansettheConfigurationobjectdirectlyfromcommand-lineoptionsviaGenericOptionsParser.

Givenourprevioustalkofjobs,it’snotsurprisingthatmuchofthesetupinvolvesoperationsonajobobject.Thisincludessettingthejobnameandspecifyingwhichclassesaretobeusedforthemapperandreducerimplementations.

Certaininput/outputconfigurationsaresetand,finally,theargumentspassedtothemainmethodareusedtospecifytheinputandoutputlocationsforthejob.Thisisaverycommonmodelthatyouwillseeoften.

Thereareanumberofdefaultvaluesforconfigurationoptions,andweareimplicitlyusingsomeofthemintheprecedingclass.Mostnotably,wedon’tsayanythingabouttheformatoftheinputfilesorhowtheoutputfilesaretobewritten.ThesearedefinedthroughtheInputFormatandOutputFormatclassesmentionedearlier;wewillexplorethemindetaillater.Thedefaultinputandoutputformatsaretextfilesthatsuitourexamples.Therearemultiplewaysofexpressingtheformatwithintextfilesinadditiontoparticularlyoptimizedbinaryformats.

AcommonmodelforlesscomplexMapReducejobsistohavetheMapperandReducerclassesasinnerclasseswithinthedriver.Thisallowseverythingtobekeptinasinglefile,whichsimplifiesthecodedistribution.

CombinerHadoopallowstheuseofacombinerclasstoperformsomeearlysortingoftheoutputfromthemapmethodbeforeit’sretrievedbythereducer.

MuchofHadoop’sdesignispredicatedonreducingtheexpensivepartsofajobthatusuallyequatetodiskandnetworkI/O.Theoutputofthemapperisoftenlarge;it’snotinfrequenttoseeitmanytimesthesizeoftheoriginalinput.Hadoopdoesallowconfigurationoptionstohelpreducetheimpactofthereducerstransferringsuchlargechunksofdataacrossthenetwork.Thecombinertakesadifferentapproachwhereit’spossibletoperformearlyaggregationtorequirelessdatatobetransferredinthefirstplace.

Thecombinerdoesnothaveitsowninterface;acombinermusthavethesamesignatureasthereducer,andhencealsosubclassestheReduceclassfromtheorg.apache.hadoop.mapreducepackage.Theeffectofthisistobasicallyperformamini-reduceonthemapperfortheoutputdestinedforeachreducer.

Hadoopdoesnotguaranteewhetherthecombinerwillbeexecuted.Sometimes,itmaynotbeexecutedatall,whileatothertimesitmaybeusedonce,twice,ormoretimesdependingonthesizeandnumberofoutputfilesgeneratedbythemapperforeachreducer.

PartitioningOneoftheimplicitguaranteesoftheReduceinterfaceisthatasinglereducerwillbegivenallthevaluesassociatedwithagivenkey.Withmultiplereducetasksrunningacrossacluster,eachmapperoutputmustbepartitionedintotheseparateoutputsdestinedforeachreducer.Thesepartitionedfilesarestoredonthelocalnodefilesystem.

Thenumberofreducetasksacrosstheclusterisnotasdynamicasthatofmappers,andindeedwecanspecifythevalueaspartofourjobsubmission.Hadooptherefore,knowshowmanyreducerswillbeneededtocompletethejob,andfromthis,itknowsintohowmanypartitionsthemapperoutputshouldbesplit.

TheoptionalpartitionfunctionWithintheorg.apache.hadoop.mapreducepackageisthePartitionerclass,anabstractclasswiththefollowingsignature:

publicabstractclassPartitioner<Key,Value>

{

publicabstractintgetPartition(Keykey,Valuevalue,int

numPartitions);

}

Bydefault,Hadoopwilluseastrategythathashestheoutputkeytoperformthepartitioning.ThisfunctionalityisprovidedbytheHashPartitionerclasswithintheorg.apache.hadoop.mapreduce.lib.partitionpackage,butit’snecessaryinsomecasestoprovideacustomsubclassofPartitionerwithapplication-specificpartitioninglogic.NoticethatthegetPartitionfunctiontakesthekey,value,andnumberofpartitionsasparameters,anyofwhichcanbeusedbythecustompartitioninglogic.

Acustompartitioningstrategywouldbeparticularlynecessaryif,forexample,thedataprovidedaveryunevendistributionwhenthestandardhashfunctionwasapplied.Unevenpartitioningcanresultinsometaskshavingtoperformsignificantlymoreworkthanothers,leadingtomuchlongeroveralljobexecutiontime.

Hadoop-providedmapperandreducerimplementationsWedon’talwayshavetowriteourownMapperandReducerclassesfromscratch.HadoopprovidesseveralcommonMapperandReducerimplementationsthatcanbeusedinourjobs.Ifwedon’toverrideanyofthemethodsintheMapperandReducerclasses,thedefaultimplementationsaretheidentityMapperandReducerclasses,whichsimplyoutputtheinputunchanged.

Themappersarefoundatorg.apache.hadoop.mapreduce.lib.mapperandincludethefollowing:

InverseMapper:returns(value,key)asanoutput,thatis,theinputkeyisoutputasthevalueandtheinputvalueisoutputasthekeyTokenCounterMapper:countsthenumberofdiscretetokensineachlineofinputIdentityMapper:implementstheidentityfunction,mappinginputsdirectlytooutputs

Thereducersarefoundatorg.apache.hadoop.mapreduce.lib.reduceandcurrentlyincludethefollowing:

IntSumReducer:outputsthesumofthelistofintegervaluesperkeyLongSumReducer:outputsthesumofthelistoflongvaluesperkeyIdentityReducer:implementstheidentityfunction,mappinginputsdirectlytooutputs

SharingreferencedataOccasionally,wemightwanttosharedataacrosstasks.Forinstance,ifweneedtoperformalookupoperationonanID-to-stringtranslationtable,wemightwantsuchadatasourcetobeaccessiblebythemapperorreducer.AstraightforwardapproachistostorethedatawewanttoaccessonHDFSandusetheFileSystemAPItoqueryitaspartoftheMaporReducesteps.

Hadoopgivesusanalternativemechanismtoachievethegoalofsharingreferencedataacrossalltasksinthejob,theDistributedCachedefinedbytheorg.apache.hadoop.mapreduce.filecache.DistributedCacheclass.Thiscanbeusedtoefficientlymakeavailablecommonread-onlyfilesthatareusedbythemaporreducetaskstoallnodes.

Thefilescanbetextdataasinthiscase,butcouldalsobeadditionalJARs,binarydata,orarchives;anythingispossible.ThefilestobedistributedareplacedonHDFSandaddedtotheDistributedCachewithinthejobdriver.Hadoopcopiesthefilesontothelocalfilesystemofeachnodepriortojobexecution,meaningeverytaskhaslocalaccesstothefiles.

AnalternativeistobundleneededfilesintothejobJARsubmittedtoHadoop.ThisdoestiethedatatothejobJAR,makingitmoredifficulttoshareacrossjobsandrequirestheJARtoberebuiltifthedatachanges.

WritingMapReduceprogramsInthischapter,wewillbefocusingonbatchworkloads;givenasetofhistoricaldata,wewilllookatpropertiesofthatdataset.InChapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark,wewillshowhowasimilartypeofanalysiscanbeperformedoverastreamoftextcollectedinrealtime.

GettingstartedInthefollowingexamples,wewillassumeadatasetgeneratedbycollecting1,000tweetsusingthestream.pyscript,asshowninChapter1,Introduction:

$pythonstream.py–t–n1000>tweets.txt

WecanthencopythedatasetintoHDFSwith:

$hdfsdfs-puttweets.txt<destination>

TipNotethatuntilnowwehavebeenworkingonlywiththetextoftweets.Intheremainderofthisbook,we’llextendstream.pytooutputadditionaltweetmetadatainJSONformat.Keepthisinmindbeforedumpingterabytesofmessageswithstream.py.

OurfirstMapReduceprogramwillbethecanonicalWordCountexample.Avariationofthisprogramwillbeusedtodeterminetrendingtopics.Wewillthenanalyzetextassociatedwithtopicstodeterminewhetheritexpressesa“positive”or“negative”sentiment.Finally,wewillmakeuseofaMapReducepattern—ChainMapper—topullthingstogetherandpresentadatapipelinetocleanandpreparethetextualdatawe’llfeedtothetrendingtopicandsentimentanalysismodel.

RunningtheexamplesThefullsourcecodeoftheexamplesdescribedinthissectioncanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch3.

BeforewerunourjobinHadoop,wemustcompileourcodeandcollecttherequiredclassfilesintoasingleJARfilethatwewillsubmittothesystem.UsingGradle,youcanbuildtheneededJARfilewith:

$./gradlewjar

LocalclusterJobsareexecutedonHadoopusingtheJARoptiontotheHadoopcommand-lineutility.Tousethis,wespecifythenameoftheJARfile,themainclasswithinit,andanyargumentsthatwillbepassedtothemainclass,asshowninthefollowingcommand:

$hadoopjar<jobjarfile><mainclass><argument1>…<argument2>

ElasticMapReduceRecallfromChapter1,Introduction,thatElasticMapReduceexpectsthejobJARfileanditsinputdatatobelocatedinanS3bucketandconverselywilldumpitsoutputbackintoS3.

NoteBecareful:thiswillcostmoney!Forthisexample,wewillusethesmallestpossibleclusterconfigurationavailableforEMR,asingle-nodecluster

Firstofall,wewillcopythetweetdatasetandthelistofpositiveandnegativewordstoS3usingtheawscommand-lineutility:

$awss3puttweets.txts3://<bucket>/input

$awss3putjob.jars3://<bucket>

WecanexecuteajobusingtheEMRcommand-linetoolasfollowsbyuploadingtheJARfiletos3://<bucket>andaddingCUSTOM_JARstepswiththeawsCLI:

$awsemradd-steps--cluster-id<cluster-id>--steps\

Type=CUSTOM_JAR,\

Name=CustomJAR,\

Jar=s3://<bucket>/job.jar,\

MainClass=<classname>,\

Args=arg1,arg2,…argN

Here,cluster-idistheIDofarunningEMRcluster,<classname>isthefullyqualifiednameofthemainclass,andarg1,arg2,…,argNarethejobarguments.

https://github.com/learninghadoop2/book-examples/tree/master/ch3

WordCount,theHelloWorldofMapReduceWordCountcountswordoccurrencesinadataset.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/WordCount.javaConsiderthefollowingblockofcodeforexample:

publicclassWordCountextendsConfiguredimplementsTool

{

publicstaticclassWordCountMapper

extendsMapper<Object,Text,Text,IntWritable>

{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext

)throwsIOException,InterruptedException{

String[]words=value.toString().split("");

for(Stringstr:words)

{

word.set(str);

context.write(word,one);

}

}

}

publicstaticclassWordCountReducer

extendsReducer<Text,IntWritable,Text,IntWritable>{

publicvoidreduce(Textkey,Iterable<IntWritable>values,

Contextcontext


inttotal=0;

for(IntWritableval:values){

total++;

}

context.write(key,newIntWritable(total));

}

}

publicintrun(String[]args)throwsException{




Jobjob=Job.getInstance(conf);

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);





https://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/WordCount.java

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newWordCount(),args);


}

}

ThisisourfirstcompleteMapReducejob.Lookatthestructure,andyoushouldrecognizetheelementswehavepreviouslydiscussed:theoverallJobclasswiththedriverconfigurationinitsmainmethodandtheMapperandReducerimplementationsdefinedasstaticnestedclasses.

We’lldoamoredetailedwalkthroughofthemechanicsofMapReduceinthenextsection,butfornow,let’slookattheprecedingcodeandthinkofhowitrealizesthekey/valuetransformationswediscussedearlier.

TheinputtotheMapperclassisarguablythehardesttounderstand,asthekeyisnotactuallyused.ThejobspecifiesTextInputFormatastheformatoftheinputdataand,bydefault,thisdeliverstothemapperdatawherethekeyisthebyteoffsetinthefileandthevalueisthetextofthatline.Inreality,youmayneveractuallyseeamapperthatusesthatbyteoffsetkey,butit’sprovided.

Themapperisexecutedonceforeachlineoftextintheinputsource,andeverytimeittakesthelineandbreaksitintowords.ItthenusestheContextobjecttooutput(morecommonlyknownasemitting)eachnewkey/valueoftheform(word,1).TheseareourK2/V2values.

Wesaidbeforethattheinputtothereducerisakeyandacorrespondinglistofvalues,andthereissomemagicthathappensbetweenthemapandreducemethodstocollectthevaluesforeachkeythatfacilitatesthis—calledtheshufflestage,whichwewon’tdescriberightnow.Hadoopexecutesthereduceronceforeachkey,andtheprecedingreducerimplementationsimplycountsthenumbersintheIterableobjectandgivesoutputforeachwordintheformof(word,count).TheseareourK3/V3values.

Takealookatthesignaturesofourmapperandreducerclasses:theWordCountMapperclassacceptsIntWritableandTextasinputandprovidesTextandIntWritableasoutput.TheWordCountReducerclasshasTextandIntWritableacceptedasbothinputandoutput.Thisisagainquiteacommonpattern,wherethemapmethodperformsaninversiononthekeyandvalues,andinsteademitsaseriesofdatapairsonwhichthereducerperformsaggregation.

Thedriverismoremeaningfulhere,aswehaverealvaluesfortheparameters.Weuseargumentspassedtotheclasstospecifytheinputandoutputlocations.

Runthejobwith:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.WordCount\

twitter.txtoutput

Examinetheoutputwithacommandsuchasthefollowing;theactualfilenamemightbedifferent,sojustlookinsidethedirectorycalledoutputinyourhomedirectoryonHDFS:

$hdfsdfs-catoutput/part-r-00000

Wordco-occurrencesWordsoccurringtogetherarelikelytobephrasesandcommon—frequentlyoccurring—phrasesarelikelytobeimportant.InNaturalLanguageProcessing,alistofco-occurringtermsiscalledanN-Gram.N-Gramsarethefoundationofseveralstatisticalmethodsfortextanalytics.WewillgiveanexampleofthespecialcaseofanN-Gram—andametricoftenencounteredinanalyticsapplications—composedoftwoterms(abigram).

AnaïveimplementationinMapReducewouldbeanextensionofWordCountthatemitsamulti-fieldkeycomposedoftwotab-separatedwords.

publicclassBiGramCountextendsConfiguredimplementsTool

{

publicstaticclassBiGramMapper

extendsMapper<Object,Text,Text,IntWritable>{



publicvoidmap(Objectkey,Textvalue,Contextcontext



Textbigram=newText();

Stringprev=null;

for(Strings:words){

if(prev!=null){

bigram.set(prev+"\t+\t"+s);

context.write(bigram,one);

}

prev=s;

}

}

}

@Override



args=newGenericOptionsParser(conf,args).getRemainingArgs();


job.setJarByClass(BiGramCount.class);

job.setMapperClass(BiGramMapper.class);

job.setReducerClass(IntSumReducer.class);






}


intexitCode=ToolRunner.run(newBiGramCount(),args);


}

}

Inthisjob,wereplaceWordCountReducerwithorg.apache.hadoop.mapreduce.lib.reduce.IntSumReducer,whichimplementsthesamelogic.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/BiGramCount.java

https://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/BiGramCount.java

TrendingtopicsThe#symbol,calledahashtag,isusedtomarkkeywordsortopicsinatweet.ItwascreatedorganicallybyTwitterusersasawaytocategorizemessages.TwitterSearch(foundathttps://twitter.com/search-home)popularizedtheuseofhashtagsasamethodtoconnectandfindcontentrelatedtospecifictopicsaswellasthepeopletalkingaboutsuchtopics.Bycountingthefrequencywithwhichahashtagismentionedoveragiventimeperiod,wecandeterminewhichtopicsaretrendinginthesocialnetwork.

publicclassHashTagCountextendsConfiguredimplementsTool

{

publicstaticclassHashTagCountMapper

extendsMapper<Object,Text,Text,IntWritable>

{



privateStringhashtagRegExp=

"(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)";

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{



{

if(str.matches(hashtagRegExp)){

word.set(str);

context.write(word,one);

}

}

}

}






job.setJarByClass(HashTagCount.class);

job.setMapperClass(HashTagCountMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);






https://twitter.com/search-home

}


intexitCode=ToolRunner.run(newHashTagCount(),args);


}

}

AsintheWordCountexample,wetokenizetextintheMapper.Weusearegularexpression—hashtagRegExp—todetectthepresenceofahashtaginTwitter’stextandemitthehashtagandthenumber1whenahashtagisfound.IntheReducerstep,wethencountthetotalnumberofemittedhashtagoccurrencesusingIntSumReducer.

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagCount.java

ThiscompiledclasswillbeintheJARfilewebuiltwithGradleearlier,sonowweexecuteHashTagCountwiththefollowingcommand:

$hadoopjarbuild/libs/mapreduce-example.jar\

com.learninghadoop2.mapreduce.HashTagCounttwitter.txtoutput

Let’sexaminetheoutputasbefore:

$hdfsdfs-catoutput/part-r-00000

Youshouldseeoutputsimilartothefollowing:

#whey1

#willpower1

#win2

#winterblues1

#winterstorm1

#wipolitics1

#women6

#woodgrain1

Eachlineiscomposedofahashtagandthenumberoftimesitappearsinthetweetsdataset.Asyoucansee,theMapReducejobordersresultsbykey.Ifwewanttofindthemostmentionedtopics,weneedtoordertheresultset.Thenaïveapproachwouldbetoperformatotalorderoftheaggregatedvaluesandselectingthetop10.

Iftheoutputdatasetissmall,wecanpipeittostandardoutputandsortitusingthesortutility:

$hdfsdfs-catoutput/part-r-00000|sort-k2-n-r|head-n10

AnothersolutionwouldbetowriteanotherMapReducejobtotraversethewholeresultsetandsortbyvalue.Whendatabecomeslarge,thistypeofglobalsortingcanbecomequiteexpensive.Inthefollowingsection,wewillillustrateanefficientdesignpatterntosortaggregateddata

TheTopNpattern

https://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagCount.java

IntheTopNpattern,wekeepdatasortedinalocaldatastructure.EachmappercalculatesalistofthetopNrecordsinitssplitandsendsitslisttothereducer.AsinglereducertaskfindsthetopNglobalrecords.

WewillapplythisdesignpatterntoimplementaTopTenHashTagjobthatfindsthetoptentopicsinourdataset.ThejobtakesasinputtheoutputdatageneratedbyHashTagCountandreturnsalistofthetenmostfrequentlymentionedhashtags.

InTopTenMapperweuseTreeMaptokeepasortedlist—inascendingorder—ofhashtags.Thekeyofthismapisthenumberofoccurrences;thevalueisatab-separatedstringofhashtagsandtheirfrequency.Inmap(),foreachvalue,weupdatethetopNmap.WhentopNhasmorethantenitems,weremovethesmallest:

publicstaticclassTopTenMapperextendsMapper<Object,Text,

NullWritable,Text>{

privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();



publicvoidmap(Objectkey,Textvalue,Contextcontext)throws

IOException,InterruptedException{

String[]words=value.toString().split("\t");

if(words.length<2){

return;

}

topN.put(Integer.parseInt(words[1]),newText(value));

if(topN.size()>10){

topN.remove(topN.firstKey());

}

}

@Override

protectedvoidcleanup(Contextcontext)throwsIOException,

InterruptedException{

for(Textt:topN.values()){

context.write(NullWritable.get(),t);

}

}

}

Wedon’temitanykey/valueinthemapfunction.Weimplementacleanup()methodthat,oncethemapperhasconsumedallitsinput,emitsthe(hashtag,count)valuesintopN.WeuseaNullWritablekeybecausewewantallvaluestobeassociatedwiththesamekeysothatwecanperformaglobalorderoverallmappers’topnlists.Thisimpliesthatourjobwillexecuteonlyonereducer.

Thereducerimplementslogicsimilartowhatwehaveinmap().WeinstantiateTreeMapanduseittokeepanorderedlistofthetop10values:

publicstaticclassTopTenReducerextends

Reducer<NullWritable,Text,NullWritable,Text>{

privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();

@Override

publicvoidreduce(NullWritablekey,Iterable<Text>values,Context

context)throwsIOException,InterruptedException{

for(Textvalue:values){

String[]words=value.toString().split("\t");

topN.put(Integer.parseInt(words[1]),

newText(value));

if(topN.size()>10){

topN.remove(topN.firstKey());

}

}

for(Textword:topN.descendingMap().values()){

context.write(NullWritable.get(),word);

}

}

}

Finally,wetraversetopNindescendingordertogeneratethelistoftrendingtopics.

NoteNotethatinthisimplementation,weoverridehashtagsthathaveafrequencyvaluealreadypresentinTreeMapwhencallingtopN.put().Dependingontheusecase,it’sadvisedtouseadifferentdatastructure—suchastheonesofferedbytheGuavalibrary(https://code.google.com/p/guava-libraries/)—oradjusttheupdatingstrategy.

Inthedriver,weenforceasinglereducerbysettingjob.setNumReduceTasks(1):

$hadoopjarbuild/libs/mapreduce-example.jar\

com.learninghadoop2.mapreduce.TopTenHashTag\

output/part-r-00000\

top-ten

Wecaninspectthetoptentolisttrendingtopics:

$hdfsdfs-cattop-ten/part-r-00000

#Stalker48150

#gameinsight55

#12M52

#KCA46

#LORDJASONJEROME29

#Valencia19

#LesAnges616

#VoteLuan15

#hadoop212

#Gameinsight11

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/TopTenHashTag.java

https://code.google.com/p/guava-libraries/

https://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/TopTenHashTag.java

SentimentofhashtagsTheprocessofidentifyingsubjectiveinformationinadatasourceiscommonlyreferredtoassentimentanalysis.Inthepreviousexample,weshowhowtodetecttrendingtopicsinasocialnetwork;we’llnowanalyzethetextsharedaroundthosetopicstodeterminewhethertheyexpressamostlypositiveornegativesentiment.

AlistofpositiveandnegativewordsfortheEnglishlanguage—aso-calledopinionlexicon—canbefoundathttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar.

NoteTheseresources—andmanymore—havebeencollectedbyProf.BingLiu’sgroupattheUniversityofIllinoisatChicagoandhavebeenused,amongothers,inBingLiu,MinqingHuandJunshengCheng.“OpinionObserver:AnalyzingandComparingOpinionsontheWeb.”Proceedingsofthe14thInternationalWorldWideWebconference(WWW-2005),May10-14,2005,Chiba,Japan.

Inthisexample,we’llpresentabag-of-wordsmethodthat,althoughsimplisticinnature,canbeusedasabaselinetomineopinionintext.Foreachtweetandeachhashtag,wewillcountthenumberoftimesapositiveoranegativewordappearsandnormalizethiscountbythetextlength.

NoteThebag-of-wordsmodelisanapproachusedinNaturalLanguageProcessingandInformationRetrievaltorepresenttextualdocuments.Inthismodel,textisrepresentedasthesetorbag—withmultiplicity—ofitswords,disregardinggrammarandmorphologicalpropertiesandevenwordorder.

UncompressthearchiveandplacethewordlistsintoHDFSwiththefollowingcommandline:

$hdfsdfs–putpositive-words.txt<destination>

$hdfsdfs–putnegative-words.txt<destination>

IntheMapperclass,wedefinetwoobjectsthatwillholdthewordlists:positiveWordsandnegativeWordsasSet<String>:

privateSet<String>positiveWords=null;

privateSet<String>negativeWords=null;

Weoverridethedefaultsetup()methodoftheMappersothatalistofpositiveandnegativewords—specifiedbytwoconfigurationproperties:job.positivewords.pathandjob.negativewords.path—isreadfromHDFSusingthefilesystemAPIwediscussedinthepreviouschapter.WecouldhavealsousedDistributedCachetosharethisdataacrossthecluster.Thehelpermethod,parseWordsList,readsalistofwordlists,stripsoutcomments,andloadswordsintoHashSet<String>:

privateHashSet<String>parseWordsList(FileSystemfs,PathwordsListPath)

http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

{

HashSet<String>words=newHashSet<String>();

try{

if(fs.exists(wordsListPath)){

FSDataInputStreamfi=fs.open(wordsListPath);

BufferedReaderbr=

newBufferedReader(newInputStreamReader(fi));

Stringline=null;

while((line=br.readLine())!=null){

if(line.length()>0&&!line.startsWith(BEGIN_COMMENT)){

words.add(line);

}

}

fi.close();

}

}

catch(IOExceptione){

e.printStackTrace();

}

returnwords;

}

IntheMapperstep,weemitforeachhashtaginthetweettheoverallsentimentofthetweet(simplythepositivewordcountminusthenegativewordcount)andthelengthofthetweet.

We’llusetheseinthereducertocalculateanoverallsentimentratioweightedbythelengthofthetweetstoestimatethesentimentexpressedbyatweetonahashtag,asfollows:




IntegerpositiveCount=newInteger(0);

IntegernegativeCount=newInteger(0);

IntegerwordsCount=newInteger(0);


{

if(str.matches(HASHTAG_PATTERN)){

hashtags.add(str);

}

if(positiveWords.contains(str)){

positiveCount+=1;

}elseif(negativeWords.contains(str)){

negativeCount+=1;

}

wordsCount+=1;

}

IntegersentimentDifference=0;

if(wordsCount>0){

sentimentDifference=positiveCount-negativeCount;

}

Stringstats;

for(Stringhashtag:hashtags){

word.set(hashtag);

stats=String.format("%d%d",sentimentDifference,

wordsCount);

context.write(word,newText(stats));

}

}

}

IntheReducerstep,weaddtogetherthesentimentscoresgiventoeachinstanceofthehashtaganddividebythetotalsizeofallthetweetsinwhichitoccurred:

publicstaticclassHashTagSentimentReducer

extendsReducer<Text,Text,Text,DoubleWritable>{

publicvoidreduce(Textkey,Iterable<Text>values,

Contextcontext


doubletotalDifference=0;

doubletotalWords=0;

for(Textval:values){

String[]parts=val.toString().split("");

totalDifference+=Double.parseDouble(parts[0]);

totalWords+=Double.parseDouble(parts[1]);

}

context.write(key,

newDoubleWritable(totalDifference/totalWords));

}

}

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentiment.java

Afterrunningtheprecedingcode,executeHashTagSentimentwiththefollowingcommand:


com.learninghadoop2.mapreduce.HashTagSentimenttwitter.txtoutput-sentiment

<positivewords><negativewords>

Youcanexaminetheoutputwiththefollowingcommand:

$hdfsdfs-catoutput-sentiment/part-r-00

000

Youshouldseeanoutputsimilartothefollowing:

#10680.011861271213042056

#10YearsOfLove0.012285135487494233

https://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentiment.java

#110.011941109121333999

#120.011938693593171155

#12F0.012339242266249566

#12M0.011864286953783268

#12MCalleEnPazYaTeVasNicolas

Intheprecedingoutput,eachlineiscomposedofahashtagandthesentimentpolarityassociatedwithit.Thisnumberisaheuristicthattellsuswhetherahashtagisassociatedmostlywithpositive(polarity>0)ornegative(polarity<0)sentimentandthemagnitudeofsuchasentiment—thehigherorlowerthenumber,thestrongerthesentiment.

TextcleanupusingchainmapperIntheexamplespresenteduntilnow,weignoredakeystepofessentiallyeveryapplicationbuiltaroundtextprocessing,whichisthenormalizationandcleanupoftheinputdata.Threecommoncomponentsofthisnormalizationstepare:

ChangingthelettercasetoeitherlowerorupperRemovalofstopwordsStemming

Inthissection,wewillshowhowtheChainMapperclass—foundatorg.apache.hadoop.mapreduce.lib.chain.ChainMapper—allowsustosequentiallycombineaseriesofMapperstoputtogetherasthefirststepofadatacleanuppipeline.Mappersareaddedtotheconfiguredjobusingthefollowing:

ChainMapper.addMapper(

JobConfjob,

Class<?extendsMapper<K1,V1,K2,V2>>klass,

Class<?extendsK1>inputKeyClass,

Class<?extendsV1>inputValueClass,

Class<?extendsK2>outputKeyClass,

Class<?extendsV2>outputValueClass,JobConfmapperConf)

Thestaticmethod,addMapper,requiresthefollowingargumentstobepassed:

job:JobConftoaddtheMapperclassclass:MapperclasstoaddinputKeyClass:mapperinputkeyclassinputValueClass:mapperinputvalueclassoutputKeyClass:mapperoutputkeyclassoutputValueClass:mapperoutputvalueclassmapperConf:aJobConfwiththeconfigurationfortheMapperclass

Inthisexample,wewilltakecareofthefirstitemlistedabove:beforecomputingthesentimentofeachtweet,wewillconverttolowercaseeachwordpresentinitstext.Thiswillallowustomoreaccuratelyascertainthesentimentofhashtagsbyignoringdifferencesincapitalizationacrosstweets.

Firstofall,wedefineanewMapper—LowerCaseMapper—whosemap()functioncallsJavaString’stoLowerCase()methodonitsinputvalueandemitsthelowercasedtext:

publicclassLowerCaseMapperextendsMapper<LongWritable,Text,

IntWritable,Text>{

privateTextlowercased=newText();

publicvoidmap(LongWritablekey,Textvalue,Contextcontext)


lowercased.set(value.toString().toLowerCase());

context.write(newIntWritable(1),lowercased);

}

}

IntheHashTagSentimentChaindriver,weconfiguretheJobobjectsothatbothMappers

willbechainedtogetherandexecuted:

publicclassHashTagSentimentChain

extendsConfiguredimplementsTool

{



args=newGenericOptionsParser(conf,args).getRemainingArgs();

//location(onhdfs)ofthepositivewordslist

conf.set("job.positivewords.path",args[2]);

conf.set("job.negativewords.path",args[3]);


job.setJarByClass(HashTagSentimentChain.class);

ConfigurationlowerCaseMapperConf=newConfiguration(false);

ChainMapper.addMapper(job,

LowerCaseMapper.class,

LongWritable.class,Text.class,

IntWritable.class,Text.class,

lowerCaseMapperConf);

ConfigurationhashTagSentimentConf=newConfiguration(false);

ChainMapper.addMapper(job,

HashTagSentiment.HashTagSentimentMapper.class,

IntWritable.class,

Text.class,Text.class,

Text.class,

hashTagSentimentConf);

job.setReducerClass(HashTagSentiment.HashTagSentimentReducer.class);

job.setInputFormatClass(TextInputFormat.class);


job.setOutputFormatClass(TextOutputFormat.class);



}


intexitCode=ToolRunner.run(

newHashTagSentimentChain(),args);


}

}

TheLowerCaseMapperandHashTagSentimentMapperclassesareinvokedinapipeline,wheretheoutputofthefirstbecomestheinputofthesecond.TheoutputofthelastMapperwillbewrittentothetask’soutput.AnimmediatebenefitofthisdesignisareductionofdiskI/Ooperations.Mappersdonotneedtobeawarethattheyarechained.

It’sthereforepossibletoreusespecializedMappersthatcanbecombinedwithinasingletask.NotethatthispatternassumesthatallMappers—andtheReduce—usematchingoutputandinput(key,value)pairs.NocastingorconversionisdonebyChainMapperitself.

Finally,noticethattheaddMappercallforthelastmapperinthechainspecifiestheoutputkey/valueclassesapplicabletothewholemapperpipelinewhenusedasacomposite.

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentimentChain.java

ExecuteHashTagSentimentChainwiththecommand:


com.learninghadoop2.mapreduce.HashTagSentimentChaintwitter.txtoutput

<positivewords><negativewords>

Youshouldseeanoutputsimilartothepreviousexample.Noticethatthistime,thehashtagineachlineislowercased.

https://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentimentChain.java

WalkingthrougharunofaMapReducejobToexploretherelationshipbetweenmapperandreducerinmoredetail,andtoexposesomeofHadoop’sinnerworkings,we’llnowgothroughhowaMapReducejobisexecuted.ThisappliestobothMapReduceinHadoop1andHadoop2eventhoughthelatterisimplementedverydifferentlyusingYARN,whichwe’lldiscusslaterinthischapter.Additionalinformationontheservicesdescribedinthissection,aswellassuggestionsfortroubleshootingMapReduceapplications,canbefoundinChapter10,RunningaHadoopCluster.

StartupThedriveristheonlypieceofcodethatrunsonourlocalmachine,andthecalltoJob.waitForCompletion()startsthecommunicationwiththeJobTracker,whichisthemasternodeintheMapReducesystem.TheJobTrackerisresponsibleforallaspectsofjobschedulingandexecution,soitbecomesourprimaryinterfacewhenperforminganytaskrelatedtojobmanagement.

ToshareresourcesontheclustertheJobTrackercanuseoneofseveralschedulingapproachestohandleincomingjobs.Thegeneralmodelistohaveanumberofqueuestowhichjobscanbesubmittedalongwithpoliciestoassignresourcesacrossthequeues.ThemostcommonlyusedimplementationsforthesepoliciesareCapacityandFairScheduler.

TheJobTrackercommunicateswiththeNameNodeonourbehalfandmanagesallinteractionsrelatingtothedatastoredonHDFS.

SplittingtheinputThefirstoftheseinteractionshappenswhentheJobTrackerlooksattheinputdataanddetermineshowtoassignittomaptasks.RecallthatHDFSfilesareusuallysplitintoblocksofatleast64MBandtheJobTrackerwillassigneachblocktoonemaptask.OurWordCountexample,ofcourse,usedatrivialamountofdatathatwaswellwithinasingleblock.Pictureamuchlargerinputfilemeasuredinterabytes,andthesplitmodelmakesmoresense.Eachsegmentofthefile—orsplit,inMapReduceterminology—isprocesseduniquelybyonemaptask.Onceithascomputedthesplits,theJobTrackerplacesthemandtheJARfilecontainingtheMapperandReducerclassesintoajob-specificdirectoryonHDFS,whosepathwillbepassedtoeachtaskasitstarts.

TaskassignmentTheTaskTrackerserviceisresponsibleforallocatingresources,executingandtrackingthestatusofmapandreducetasksrunningonanode.OncetheJobTrackerhasdeterminedhowmanymaptaskswillbeneeded,itlooksatthenumberofhostsinthecluster,howmanyTaskTrackersareworking,andhowmanymaptaskseachcanconcurrentlyexecute(auser-definableconfigurationvariable).TheJobTrackeralsolookstoseewherethevariousinputdatablocksarelocatedacrosstheclusterandattemptstodefineanexecutionplanthatmaximizesthecaseswhentheTaskTrackerprocessesasplit/blocklocatedonthesamephysicalhost,or,failingthat,itprocessesatleastoneinthesamehardwarerack.ThisdatalocalityoptimizationisahugereasonbehindHadoop’sabilitytoefficientlyprocesssuchlargedatasets.Recallalsothat,bydefault,eachblockisreplicatedacrossthreedifferenthosts,sothelikelihoodofproducingatask/hostplanthatseesmostblocksprocessedlocallyishigherthanitmightseematfirst.

TaskstartupEachTaskTrackerthenstartsupaseparateJavavirtualmachinetoexecutethetasks.Thisdoesaddastartuptimepenalty,butitisolatestheTaskTrackerfromproblemscausedbymisbehavingmaporreducetasks,anditcanbeconfiguredtobesharedbetweensubsequentlyexecutedtasks.

Iftheclusterhasenoughcapacitytoexecuteallthemaptasksatonce,theywillallbestartedandgivenareferencetothesplittheyaretoprocessandthejobJARfile.Iftherearemoretasksthantheclustercapacity,theJobTrackerwillkeepaqueueofpendingtasksandassignthemtonodesastheycompletetheirinitiallyassignedmaptasks.

Wearenowreadytoseetheexecuteddataofmaptasks.Ifallthissoundslikealotofwork,itis;itexplainswhy,whenrunninganyMapReducejob,thereisalwaysanon-trivialamountoftimetakenasthesystemgetsstartedandperformsallthesesteps.

OngoingJobTrackermonitoringTheJobTrackerdoesn’tjuststopworknowandwaitfortheTaskTrackerstoexecuteallthemappersandreducers.It’sconstantlyexchangingheartbeatandstatusmessageswiththeTaskTrackers,lookingforevidenceofprogressorproblems.Italsocollectsmetricsfromthetasksthroughoutthejobexecution,someprovidedbyHadoopandothersspecifiedbythedeveloperofthemapandreducetasks,althoughwedon’tuseanyinthisexample.

MapperinputThedriverclassspecifiestheformatandstructureoftheinputfileusingTextInputFormat,andfromthis,Hadoopknowstotreatthisastextwiththebyteoffsetasthekeyandlinecontentsasthevalue.Assumethatourdatasetcontainsthefollowingtext:

Thisisatest

Yesitis

Thetwoinvocationsofthemapperwillthereforebegiventhefollowingoutput:

1Thisisatest

2Yesitis

MapperexecutionThekey/valuepairsreceivedbythemapperaretheoffsetinthefileofthelineandthelinecontents,respectively,becauseofhowthejobisconfigured.OurimplementationofthemapmethodinWordCountMapperdiscardsthekey,aswedonotcarewhereeachlineoccurredinthefile,andsplitstheprovidedvalueintowordsusingthesplitmethodonthestandardJavaStringclass.NotethatbettertokenizationcouldbeprovidedbyuseofregularexpressionsortheStringTokenizerclass,butforourpurposesthissimpleapproachwillsuffice.Foreachindividualword,themapperthenemitsakeycomprisedoftheactualworditself,andavalueof1.

MapperoutputandreducerinputTheoutputofthemapperisaseriesofpairsoftheform(word,1);inourexample,thesewillbe:

(This,1),(is,1),(a,1),(test,1),(Yes,1),(it,1),(is,1)

Theseoutputpairsfromthemapperarenotpasseddirectlytothereducer.Betweenmappingandreducingistheshufflestage,wheremuchofthemagicofMapReduceoccurs.

ReducerinputThereducerTaskTrackerreceivesupdatesfromtheJobTrackerthattellitwhichnodesintheclusterholdmapoutputpartitionsthatneedtobeprocessedbyitslocalreducetask.Itthenretrievesthesefromthevariousnodesandmergesthemintoasinglefilethatwillbefedtothereducetask.

ReducerexecutionOurWordCountReducerclassisverysimple;foreachword,itsimplycountsthenumberofelementsinthearrayandemitsthefinal(word,count)outputforeachword.ForourinvocationofWordCountonoursampleinput,allbutonewordhasonlyonevalueinthelistofvalues;ishastwo.

ReduceroutputThefinalsetofreduceroutputforourexampleistherefore:

(This,1),(is,2),(a,1),(test,1),(Yes,1),(it,1)

ThisdatawillbeoutputtopartitionfileswithintheoutputdirectoryspecifiedinthedriverthatwillbeformattedusingthespecifiedOutputFormatimplementation.Eachreducetaskwritestoasinglefilewiththefilenamepart-r-nnnnn,wherennnnnstartsat00000andisincremented.

ShutdownOncealltaskshavecompletedsuccessfully,theJobTrackeroutputsthefinalstateofthejobtotheclient,alongwiththefinalaggregatesofsomeofthemoreimportantcountersthatithasbeenaggregatingalongtheway.Thefulljobandtaskhistoryisavailableinthelogdirectoryoneachnodeor,moreaccessibly,viatheJobTrackerwebUI;pointyourbrowsertoport50030ontheJobTrackernode.

Input/OutputWehavetalkedaboutfilesbeingbrokenintosplitsaspartofthejobstartupandthedatainasplitbeingsenttothemapperimplementation.However,thisoverlookstwoaspects:howthedataisstoredinthefileandhowtheindividualkeysandvaluesarepassedtothemapperstructure.

InputFormatandRecordReaderHadoophastheconceptofInputFormatforthefirstoftheseresponsibilities.TheInputFormatabstractclassintheorg.apache.hadoop.mapreducepackageprovidestwomethodsasshowninthefollowingcode:

publicabstractclassInputFormat<K,V>

{

publicabstractList<InputSplit>getSplits(JobContextcontext);

RecordReader<K,V>createRecordReader(InputSplitsplit,

TaskAttemptContextcontext);

}

ThesemethodsdisplaythetworesponsibilitiesoftheInputFormatclass:

ToprovidedetailsonhowtodivideaninputfileintothesplitsrequiredformapprocessingTocreateaRecordReaderthatwillgeneratetheseriesofkey/valuepairsfromasplit

TheRecordReaderclassisalsoanabstractclasswithintheorg.apache.hadoop.mapreducepackage:

publicabstractclassRecordReader<Key,Value>implementsCloseable

{

publicabstractvoidinitialize(InputSplitsplit,

TaskAttemptContextcontext);

publicabstractbooleannextKeyValue()

throwsIOException,InterruptedException;

publicabstractKeygetCurrentKey()


publicabstractValuegetCurrentValue()


publicabstractfloatgetProgress()


publicabstractclose()throwsIOException;

}

ARecordReaderinstanceiscreatedforeachsplitandcallsgetNextKeyValuetoreturnaBooleanindicatingwhetheranotherkey/valuepairisavailable,and,ifso,thegetKeyandgetValuemethodsareusedtoaccessthekeyandvaluerespectively.

ThecombinationoftheInputFormatandRecordReaderclassesthereforeareallthatisrequiredtobridgebetweenanykindofinputdataandthekey/valuepairsrequiredbyMapReduce.

Hadoop-providedInputFormatTherearesomeHadoop-providedInputFormatimplementationswithintheorg.apache.hadoop.mapreduce.lib.inputpackage:

FileInputFormat:isanabstractbaseclassthatcanbetheparentofanyfile-basedinput.SequenceFileInputFormat:isanefficientbinaryfileformatthatwillbediscussedinanupcomingsection.TextInputFormat:isusedforplaintextfiles.KeyValueTextInputFormat:isusedforplaintextfiles.Eachlineisdividedintokeyandvaluepartsbyaseparatorbyte.

Notethatinputformatsarenotrestrictedtoreadingfromfiles;FileInputFormatisitselfasubclassofInputFormat.It’spossibletohaveHadoopusedatathatisnotbasedonfilesastheinputtoMapReducejobs;commonsourcesarerelationaldatabasesorcolumn-orienteddatabases,suchasAmazonDynamoDBorHBase.

Hadoop-providedRecordReaderHadoopprovidesafewcommonRecordReaderimplementations,whicharealsopresentwithintheorg.apache.hadoop.mapreduce.lib.inputpackage:

LineRecordReader:implementationisthedefaultRecordReaderclassfortextfilesthatpresentsthebyteoffsetinthefileasthekeyandthelinecontentsasthevalueSequenceFileRecordReader:implementationreadsthekey/valuefromthebinarySequenceFilecontainer

OutputFormatandRecordWriterThereisasimilarpatternforwritingtheoutputofajobcoordinatedbysubclassesofOutputFormatandRecordWriterfromtheorg.apache.hadoop.mapreducepackage.Wewon’texploretheseinanydetailhere,butthegeneralapproachissimilar,althoughOutputFormatdoeshaveamoreinvolvedAPI,asithasmethodsfortaskssuchasvalidationoftheoutputspecification.

It’sthisstepthatcausesajobtofailifaspecifiedoutputdirectoryalreadyexists.Ifyouwanteddifferentbehavior,itwouldrequireasubclassofOutputFormatthatoverridesthismethod.

Hadoop-providedOutputFormatThefollowingoutputformatsareprovidedintheorg.apache.hadoop.mapreduce.outputpackage:

FileOutputFormat:isthebaseclassforallfile-basedOutputFormatsNullOutputFormat:isadummyimplementationthatdiscardstheoutputandwritesnothingtothefileSequenceFileOutputFormat:writestothebinarySequenceFileformatTextOutputFormat:writesaplaintextfile

NotethattheseclassesdefinetheirrequiredRecordWriterimplementationsasstaticnestedclasses,sotherearenoseparatelyprovidedRecordWriterimplementations.

SequencefilesTheSequenceFileclasswithintheorg.apache.hadoop.iopackageprovidesanefficientbinaryfileformatthatisoftenusefulasanoutputfromaMapReducejob.Thisisespeciallytrueiftheoutputfromthejobisprocessedastheinputofanotherjob.Sequencefileshaveseveraladvantages,asfollows:

Asbinaryfiles,theyareintrinsicallymorecompactthantextfilesTheyadditionallysupportoptionalcompression,whichcanalsobeappliedatdifferentlevels,thatis,theycompresseachrecordoranentiresplitTheycanbesplitandprocessedinparallel

Thislastcharacteristicisimportantasmostbinaryformats—particularlythosethatarecompressedorencrypted—cannotbesplitandmustbereadasasinglelinearstreamofdata.UsingsuchfilesasinputtoaMapReducejobmeansthatasinglemapperwillbeusedtoprocesstheentirefile,causingapotentiallylargeperformancehit.Insuchasituation,it’spreferabletouseasplittableformat,suchasSequenceFile,or,ifyoucannotavoidreceivingthefileinanotherformat,doapreprocessingstepthatconvertsitintoasplittableformat.Thiswillbeatradeoff,astheconversionwilltaketime,butinmanycases—especiallywithcomplexmaptasks—thiswillbeoutweighedbythetimesavedthroughincreasedparallelism.

YARNYARNstartedoutaspartoftheMapReducev2(MRv2)initiativebutisnowanindependentsub-projectwithinHadoop(thatis,it’satthesamelevelasMapReduce).ItgrewoutofarealizationthatMapReduceinHadoop1conflatedtworelatedbutdistinctresponsibilities:resourcemanagementandapplicationexecution.

Althoughithasenabledpreviouslyunimaginedprocessingonenormousdatasets,theMapReducemodelataconceptuallevelhasanimpactonperformanceandscalability.ImplicitintheMapReducemodelisthatanyapplicationcanonlybecomposedofaseriesoflargelylinearMapReducejobs,eachofwhichfollowsamodelofoneormoremapsfollowedbyoneormorereduces.Thismodelisagreatfitforsomeapplications,butnotall.Inparticular,it’sapoorfitforworkloadsrequiringverylow-latencyresponsetimes;theMapReducestartuptimesandsometimeslengthyjobchainsoftengreatlyexceedthetoleranceforauser-facingprocess.Themodelhasalsobeenfoundtobeveryinefficientforjobsthatwouldmorenaturallyberepresentedasadirectedacyclicgraph(DAG)oftaskswherethenodesonthegraphareprocessingsteps,andtheedgesaredataflows.IfanalyzedandexecutedasaDAGthentheapplicationmaybeperformedinonestepwithhighparallelismacrosstheprocessingsteps,butwhenviewedthroughtheMapReducelens,theresultisusuallyaninefficientseriesofinterdependentMapReducejobs.

NumerousprojectshavebuiltdifferenttypesofprocessingatopMapReduceandalthoughmanyarewildlysuccessful(ApacheHiveandPigaretwostandoutexamples),theclosecouplingofMapReduceasaprocessingparadigmwiththejobschedulingmechanisminHadoop1madeitverydifficultforanynewprojecttotailoreitheroftheseareastoitsspecificneeds.

TheresultisYetAnotherResourceNegotiator(YARN),whichprovidesahighlycapablejobschedulingmechanismwithinHadoopandthewell-definedinterfacesfordifferentprocessingmodelstobeimplementedwithinit.

YARNarchitectureTounderstandhowYARNworks,it’simportanttostopthinkingaboutMapReduceandhowitprocessesdata.YARNitselfsaysnothingaboutthenatureoftheapplicationsthatrunatopit,ratherit’sfocusedonprovidingthemachineryfortheschedulingandexecutionofthesejobs.Aswe’llsee,YARNisjustascapableofhostinglong-runningstreamprocessingorlow-latency,user-facingworkloadsasitiscapableofhostingbatch-processingworkloads,suchasMapReduce.

ThecomponentsofYARNYARNiscomprisedoftwomaincomponents,theResourceManager(RM),whichmanagesresourcesacrossthecluster,andtheNodeManager(NM),whichrunsoneachhostandmanagestheresourcesontheindividualmachine.TheResourceManagerandNodeManagersdealwiththeschedulingandmanagementofcontainers,anabstractnotionofthememory,CPU,andI/Othatwillbededicatedtorunaparticularpieceofapplicationcode.UsingMapReduceasanexample,whenrunningatopYARN,theJobTrackerandeachTaskTrackerallrunintheirowndedicatedcontainers.Notethough,thatinYARN,eachMapReducejobhasitsowndedicatedJobTracker;thereisnosingleinstancethatmanagesalljobs,asinHadoop1.

YARNitselfisresponsibleonlyfortheschedulingoftasksacrossthecluster;allnotionsofapplication-levelprogress,monitoring,andfaulttolerancearehandledintheapplicationcode.Thisisaveryexplicitdesigndecision;bymakingYARNasindependentaspossible,ithasaveryclearsetofresponsibilitiesanddoesnotartificiallyconstrainthetypesofapplicationthatcanbeimplementedonYARN.

Asthearbiterofallclusterresources,YARNhastheabilitytoefficientlymanagetheclusterasawholeandnotfocusonapplication-levelresourcerequirements.IthasapluggableschedulingpolicywiththeprovidedimplementationssimilartotheexistingHadoopCapacityandFairScheduler.YARNalsotreatsallapplicationcodeasinherentlyuntrustedandallapplicationmanagementandcontroltasksareperformedinuserspace.

AnatomyofaYARNapplicationAsubmittedYARNapplicationhastwocomponents:theApplicationMaster(AM),whichcoordinatestheoverallapplicationflow,andthespecificationofthecodethatwillrunontheworkernodes.ForMapReduceatopYARN,theJobTrackerimplementstheApplicationMasterfunctionalityandTaskTrackersaretheapplicationcustomcodedeployedontheworkernodes.

Asmentionedintheprevioussection,theresponsibilitiesofapplicationmanagement,progressmonitoringandfaulttolerancearepushedtotheapplicationlevelinYARN.It’stheApplicationMasterthatperformsthesetasks;YARNitselfsaysnothingaboutthemechanismsforcommunicationbetweentheApplicationMasterandthecoderunningintheworkercontainers,forexample.

ThisgenericityallowsYARNapplicationstonotbetiedtoJavaclasses.The

ApplicationManagercaninsteadrequestaNodeManagertoexecuteshellscripts,nativeapplications,oranyothertypeofprocessingthatismadeavailableoneachnode.

LifecycleofaYARNapplicationAswithMapReducejobsinHadoop1,YARNapplicationsaresubmittedtotheclusterbyaclient.WhenaYARNapplicationisstarted,theclientfirstcallstheResourceManager(morespecificallytheApplicationManagerportionoftheResourceManager)andrequeststheinitialcontainerwithinwhichtoexecutetheApplicationMaster.InmostcasestheApplicationMasterwillrunfromahostedcontainerinthecluster,justaswilltherestoftheapplicationcode.TheApplicationManagercommunicateswiththeothermaincomponentoftheResourceManager,thescheduleritself,whichhastheultimateresponsibilityofmanagingallresourcesacrossthecluster.

TheApplicationMasterstartsupintheprovidedcontainer,registersitselfwiththeResourceManager,andbeginstheprocessofnegotiatingitsrequiredresources.TheApplicationMastercommunicateswiththeResourceManagerandrequeststhecontainersitrequires.Thespecificationofthecontainersrequestedcanalsoincludeadditionalinformation,suchasdesiredplacementwithintheclusterandconcreteresourcerequirements,suchasaparticularamountofmemoryorCPU.

TheResourceManagerprovidestheApplicationMasterwiththedetailsofthecontainersithasbeenallocated,andtheApplicationMasterthencommunicateswiththeNodeManagerstostarttheapplication-specifictaskforeachcontainer.ThisisdonebyprovidingtheNodeManagerwiththespecificationoftheapplicationtobeexecuted,whichasmentionedmaybeaJARfile,ascript,apathtoalocalexecutable,oranythingelsethattheNodeManagercaninvoke.EachNodeManagerinstantiatesthecontainerfortheapplicationcodeandstartstheapplicationbasedontheprovidedspecification.

FaulttoleranceandmonitoringFromthispointonward,thebehaviorislargelyapplicationspecific.YARNwillnotmanageapplicationprogressbutdoesperformanumberofongoingtasks.TheAMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromallApplicationMasters,andifitdeterminesthatanApplicationMasterhasfailedorstoppedworking,itwillde-registerthefailedApplicationMasterandreleaseallitsallocatedcontainers.TheResourceManagerwillthenrescheduletheapplicationaconfigurablenumberoftimes.

AlongsidethisprocesstheNMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromtheNodeManagersandkeepstrackofthehealthofeachNodeManagerinthecluster.SimilartothemonitoringofApplicationMasterhealth,aNodeManagerwillbemarkedasdeadafterreceivingnoheartbeatsforadefaulttimeof10minutes,afterwhichallallocatedcontainersaremarkedasdead,andthenodeisexcludedfromfutureresourceallocation.

Atthesametime,theNodeManagerwillactivelymonitorresourceutilizationofeachallocatedcontainerand,forthoseresourcesnotconstrainedbyhardlimits,willkillcontainersthatexceedtheirresourceallocation.

Atahigherlevel,theYARNschedulerwillalwaysbelookingtomaximizetheclusterutilizationwithintheconstraintsofthesharingpolicybeingemployed.AswithHadoop1,thiswillallowlow-priorityapplicationstousemoreclusterresourcesifcontentionislow,buttheschedulerwillthenpreempttheseadditionalcontainers(thatis,requestthemtobeterminated)ifhigher-priorityapplicationsaresubmitted.

Therestoftheresponsibilityforapplication-levelfaulttoleranceandprogressmonitoringmustbeimplementedwithintheapplicationcode.ForMapReduceonYARN,forexample,allthemanagementoftaskschedulingandretriesisprovidedattheapplicationlevelandisnotinanywaydeliveredbyYARN.

ThinkinginlayersTheselaststatementsmaysuggestthatwritingapplicationstorunonYARNisalotofwork,andthisistrue.TheYARNAPIisquitelow-levelandlikelyintimidatingformostdeveloperswhojustwanttorunsomeprocessingtasksontheirdata.IfallwehadwasYARNandeverynewHadoopapplicationhadtohaveitsownApplicationMasterimplemented,thenYARNwouldnotlookquiteasinterestingasitdoes.

Whatmakesthepicturebetteristhat,ingeneral,therequirementisn’ttoimplementeachandeveryapplicationonYARN,butinsteaduseitforasmallernumberofprocessingframeworksthatprovidemuchfriendlierinterfacestobeimplemented.ThefirstofthesewasMapReduce;withithostedonYARN,thedeveloperwritestotheusualmapandreduceinterfacesandislargelyunawareoftheYARNmechanics.

Butonthesamecluster,anotherdevelopermayberunningajobthatusesadifferentframeworkwithsignificantlydifferentprocessingcharacteristics,andYARNwillmanagebothatthesametime.

We’llgivesomemoredetailonseveralYARNprocessingmodelscurrentlyavailable,buttheyrunthegamutfrombatchprocessingthroughlow-latencyqueriestostreamandgraphprocessingandbeyond.

AstheYARNexperiencegrows,however,thereareanumberofinitiativestomakethedevelopmentoftheseprocessingframeworkseasier.Ontheonehandtherearehigher-levelinterfaces,suchasClouderaKitten(https://github.com/cloudera/kitten)orApacheTwill(http://twill.incubator.apache.org/),thatgivefriendlierabstractionsabovetheYARNAPIs.Perhapsamoresignificantdevelopmentmodel,though,istheemergenceofframeworksthatproviderichertoolstomoreeasilyconstructapplicationswithacommongeneralclassofperformancecharacteristics.

https://github.com/cloudera/kitten

http://twill.incubator.apache.org/

ExecutionmodelsWehavementioneddifferentYARNapplicationshavingdistinctprocessingcharacteristics,butanemergingpatternhasseentheirexecutionmodelsingeneralbeingasourceofdifferentiation.Bythis,werefertohowtheYARNapplicationlifecycleismanaged,andweidentifythreemaintypes:per-jobapplication,per-session,andalways-on.

Batchprocessing,suchasMapReduceonYARN,seesthelifecycleoftheMapReduceframeworktiedtothatofthesubmittedapplication.IfwesubmitaMapReducejob,thentheJobTrackerandTaskTrackersthatexecuteitarecreatedspecificallyforthejobandareterminatedwhenthejobcompletes.Thisworkswellforbatch,butifwewishtoprovideamoreinteractivemodelthenthestartupoverheadofestablishingtheYARNapplicationandallitsresourceallocationswillseverelyimpacttheuserexperienceifeverycommandissuedsuffersthispenalty.Amoreinteractive,orsession-based,lifecyclewillseetheYARNapplicationstartandthenbeavailabletoserviceanumberofsubmittedrequests/commands.TheYARNapplicationterminatesonlywhenthesessionisexited.

Finally,wehavetheconceptoflong-runningapplicationsthatprocesscontinuousdatastreamsindependentofanyinteractiveinput.FortheseitmakesmostsensefortheYARNapplicationtostartandcontinuouslyprocessdatathatisretrievedthroughsomeexternalmechanism.Theapplicationwillonlyexitwhenexplicitlyshutdownorifanabnormalsituationoccurs.

YARNintherealworld–ComputationbeyondMapReduceThepreviousdiscussionshavebeenalittleabstract,sointhissection,wewillexploreafewexistingYARNapplicationstoseejusthowtheyusetheframeworkandhowtheyprovideabreadthofprocessingcapability.OfparticularinterestishowtheYARNframeworkstakeverydifferentapproachestoresourcemanagement,I/Opipelining,andfaulttolerance.

TheproblemwithMapReduceUntilnow,wehavelookedatMapReduceintermsofAPI.MapReduceinHadoopismorethanthat;upuntilHadoop2,itwasthedefaultexecutionengineforanumberoftools,amongwhichwereHiveandPig,whichwewilldiscussinmoredetaillaterinthisbook.WehaveseenhowMapReduceapplicationsare,infact,chainsofjobs.Thisveryaspectisonethebiggestpainpointsandconstrainingfactorsoftheframeworks.MapReducecheckpointsdatatoHDFSforintra-processcommunication:

AchainofMapReducejobs

Attheendofeachreducephase,outputiswrittentodisksothatitcanthenbeloadedbythemappersofthenextjobandusedasitsinput.ThisI/Ooverheadintroduceslatency,especiallywhenwehaveapplicationsthatrequiremultiplepassesonadataset(hencemultiplewrites).Unfortunately,thistypeofiterativecomputationisatthecoreofmanyanalyticsapplications.

ApacheTezandApacheSparkaretwoframeworksthataddressthisproblembygeneralizingtheMapReduceparadigm.Wewillbrieflydiscussthemintheremainderofthissection,nexttoApacheSamza,aframeworkthattakesanentirelydifferentapproachtoreal-timeprocessing.

TezTez(http://tez.apache.org)isalow-levelAPIandexecutionenginefocusedonprovidinglow-latencyprocessing,andisbeingusedasthebasisofthelatestevolutionofHive,Pigandseveralotherframeworksthatimplementstandardjoin,filter,mergeandgroupoperations.TezisanimplementationandevolutionofaprogrammingmodelpresentedbyMicrosoftinthe2009Dryadpaper(http://research.microsoft.com/en-us/projects/dryad/).TezisageneralizationofMapReduceasdataflowthatstrivestoachievefast,interactivecomputingbypipeliningI/Ooperationsoveraqueueforintra-processcommunication.ThisavoidstheexpensivewritestodisksthataffectMapReduce.TheAPIprovidesprimitivesexpressingdependenciesbetweenjobsasaDAG.ThefullDAGisthensubmittedtoaplannerthatcanoptimizetheexecutionflow.ThesameapplicationdepictedintheprecedingdiagramwouldbeexecutedinTezasasinglejob,withI/OpipelinedfromreducerstoreducerswithoutHDFSwritesandsubsequentreadsbymappers.Anexamplecanbeseeninthefollowingdiagram:.

ATezDAGisageneralizationofMapReduce

ThecanonicalWordCountexamplecanbefoundathttps://github.com/apache/incubator-tez/blob/master/tez-mapreduce-

http://tez.apache.org

http://research.microsoft.com/en-us/projects/dryad/

https://github.com/apache/incubator-tez/blob/master/tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java

examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.

DAGdag=newDAG("WordCount");

dag.addVertex(tokenizerVertex)

.addVertex(summerVertex)

.addEdge(newEdge(tokenizerVertex,summerVertex,

edgeConf.createDefaultEdgeProperty()));

Eventhoughthegraphtopologydagcanbeexpressedwithafewlinesofcode,theboilerplaterequiredtoexecutethejobisconsiderable.Thiscodehandlesmanyofthelow-levelschedulingandexecutionresponsibilities,includingfaulttolerance.WhenTezdetectsafailedtask,itwalksbackuptheprocessinggraphtofindthepointfromwhichtore-executethefailedtasks.

Hive-on-tezHive0.13isthefirsthigh-profileprojecttouseTezasitsexecutionengine.We’lldiscussHiveinalotmoredetailinChapter7,HadoopandSQL,butfornowwewilljusttouchonhowit’simplementedonYARN.

Hive(http://hive.apache.org)isanengineforqueryingdatastoredonHDFSthroughstandardSQLsyntax.Ithasbeenenormouslysuccessful,asthistypeofcapabilitygreatlyreducesthebarrierstostartanalyticexplorationofdatainHadoop.

InHadoop1,Hivehadnochoice,buttoimplementitsSQLstatementsasaseriesofMapReducejobs.WhenSQLissubmittedtoHive,itgeneratestherequiredMapReducejobsbehindthescenesandexecutestheseonthecluster.Thisapproachhastwomaindrawbacks:thereisanon-trivialstartuppenaltyeachtime,andtheconstrainedMapReducemodelmeansthatseeminglysimpleSQLstatementsareoftentranslatedintoalengthyseriesofmultipledependentMapReducejobs.ThisisanexampleofthetypeofprocessingmorenaturallyconceptualizedasaDAGoftasks,asdescribedearlierinthischapter.

AlthoughsomebenefitsareachievedwhenHiveexecuteswithinMapReduce,withinYARN,themajorbenefitscomeinHive0.13whentheprojectisfullyre-implementedusingTez.ByexploitingtheTezAPIs,whicharefocusedonprovidinglow-latencyprocessing,Hivegainsevenmoreperformancewhilemakingitscodebasesimpler.

SinceTeztreatsitsworkloadsastheDAGswhichprovideamuchbetterfittotranslatedSQLqueries,HiveonTezcanperformanySQLstatementasasinglejobwithmaximizedparallelism.

TezhelpsHivesupportinteractivequeriesbyprovidinganalways-runningserviceinsteadofrequiringtheapplicationtobeinstantiatedfromscratchforeachSQLsubmission.Thisisimportantbecause,eventhoughqueriesthatprocesshugedatavolumeswillsimplytakesometime,thegoalisforHivetobecomelessofabatchtoolandinsteadmovetobeasmuchofaninteractivetoolaspossible.

http://hive.apache.org

ApacheSparkSpark(.apache.org)isaprocessingframeworkthatexcelsatiterativeandnearreal-timeprocessing.CreatedatUCBerkeley,ithasbeendonatedasanApacheproject.SparkprovidesanabstractionthatallowsdatainHadooptobeviewedasadistributeddatastructureuponwhichaseriesofoperationscanbeperformed.TheframeworkisbasedonthesameconceptsTezdrawsinspirationfrom(Dryad),butexcelswithjobsthatallowdatatobeheldandprocessedinmemory,anditcanveryefficientlyscheduleprocessingonthein-memorydatasetacrossthecluster.Sparkautomaticallycontrolsreplicationofdataacrossthecluster,ensuringthateachelementofthedistributeddatasetisheldinmemoryonatleasttwomachines,andprovidesreplication-basedfaulttolerancesomewhatakintoHDFS.

Sparkstartedasastandalonesystem,butwasportedtoalsorunonYARNasofits0.8release.Sparkisparticularlyinterestingbecause,althoughitsclassicprocessingmodelisbatch-oriented,withtheSparkshellitprovidesaninteractivefrontendandwiththeSparkStreamingsub-projectalsooffersnearreal-timeprocessingofdatastreams.Sparkisdifferentthingstodifferentpeople;it’sbothahigh-levelAPIandanexecutionengine.Atthetimeofwriting,portsofHiveandPigtoSparkareinprogress.

http://spark.apache.org

ApacheSamzaSamza(http://samza.apache.org)isastream-processingframeworkdevelopedatLinkedInanddonatedtotheApacheSoftwareFoundation.Samzaprocessesconceptuallyinfinitestreamsofdata,whichareseenbytheapplicationasaseriesofmessages.

SamzacurrentlyintegratesmosttightlywithApacheKafka(http://kafka.apache.org)althoughitdoeshaveapluggablearchitecture.Kafkaitselfisamessagingsystemthatexcelsatlargedatavolumesandprovidesatopic-basedabstractionsimilartomostothermessagingplatforms,suchasRabbitMQ.Publisherssendmessagestotopicsandinterestedclientsconsumemessagesfromthetopicsastheyarrive.Kafkahasmultipleaspectsthatsetitapartfromothermessagingplatforms,butforthisdiscussion,themostinterestingoneisthatKafkastoresmessagesforaperiodoftime,whichallowsmessagesintopicstobereplayed.Topicsarepartitionedacrossmultiplehostsandpartitionscanbereplicatedacrosshoststoprotectfromnodefailure.

Samzabuildsitsprocessingflowonitsconceptofstreams,whichwhenusingKafkamapdirectlytoKafkapartitions.AtypicalSamzajobmaylistentoonetopicforincomingmessages,performsometransformations,andthenwritetheoutputtoadifferenttopic.MultipleSamzajobscanthenbecomposedtoprovidemorecomplexprocessingstructures.

AsaYARNapplication,theSamzaApplicationMastermonitorsthehealthofallrunningSamzatasks.Ifataskfails,thenareplacementtaskisinstantiatedinanewcontainer.Samzaachievesfaulttolerancebyhavingeachtaskwriteitsprogresstoanewstream(againmodeledasaKafkatopic),soanyreplacementtaskjustneedstoreadthelatesttaskstatefromthischeckpointtopicandthenreplaythemainmessagetopicfromthelastprocessedposition.Samzaadditionallyofferssupportforlocaltaskstate,whichcanbeveryusefulforjoinandaggregationtypeworkloads.Thislocalstateisagainbuiltatopthestreamabstractionandhenceisintrinsicallyresilienttohostfailure.

YARN-independentframeworksAninterestingpointtonoteisthattwooftheprecedingprojects(SamzaandSpark)runatopYARNbutarenotspecifictoYARN.Sparkstartedoutasastandaloneserviceandhasimplementationsforotherschedulers,suchasApacheMesosortorunonAmazonEC2.ThoughSamzarunsonlyonYARNtoday,itsarchitectureexplicitlyisnotYARN-specific,andtherearediscussionsaboutprovidingrealizationsonotherplatforms.

IftheYARNmodelofpushingasmuchaspossibleintotheapplicationhasitsdownsidesthroughimplementationcomplexity,thenthisdecouplingisoneofitsmajorbenefits.AnapplicationwrittentouseYARNneednotbetiedtoit;bydefinition,allthefunctionalityfortheactualapplicationlogicandmanagementisencapsulatedwithintheapplicationcodeandisindependentofYARNoranotherframework.Thisis,ofcourse,notsayingthatdesigningascheduler-independentapplicationisatrivialtask,butit’snowatractabletask;thiswasabsolutelynotthecasepre-YARN.

http://samza.apache.org

http://kafka.apache.org

YARNtodayandbeyondThoughYARNhasbeenusedinproduction(atYahoo!inparticular)forsometime,thefinalGAversionwasnotreleaseduntillate2012.TheinterfacestoYARNwerealsosomewhatfluiduntilquitelateinthedevelopmentcycle.Consequently,thefullyforwardcompatibleYARNasofHadoop2.2isstillrelativelynew.

YARNisfullyfunctionaltoday,andthefuturedirectionwillseeextensionstoitscurrentcapabilities.Perhapsmostnotableamongthesewillbetheabilitytospecifyandcontrolcontainerresourcesonmoredimensions.Currently,onlylocation,memoryandCPUspecificationsarepossible,andthiswillbeexpandedintoareassuchasstorageandnetworkI/O.

Inaddition,theApplicationMastercurrentlyhaslittlecontroloverthemanagementofhowcontainersareco-locatedornot.Finer-grainedcontrolherewillallowtheApplicationMastertospecifypoliciesaroundwhencontainersmayormaynotbescheduledonthesamenode.Inaddition,thecurrentresourceallocationmodelisquitestatic,anditwillbeusefultoallowanapplicationtodynamicallychangetheresourcesallocatedtoarunningcontainer.

SummaryThischapterexploredhowtoprocessthoselargevolumesofdatathatwediscussedsomuchinthepreviouschapter.Inparticularwecovered:

HowMapReducewastheonlyprocessingmodelavailableinHadoop1anditsconceptualmodelTheJavaAPItoMapReduce,andhowtousethistobuildsomeexamples,fromawordcounttosentimentanalysisofTwitterhashtagsThedetailsofhowMapReduceisimplementedinpractice,andwewalkedthroughtheexecutionofaMapReducejobHowHadoopstoresdataandtheclassesinvolvedtorepresentinputandoutputformatsandrecordreadersandwritersThelimitationsofMapReducethatledtothedevelopmentofYARN,openingthedoortomultiplecomputationalmodelsontheHadoopplatformTheYARNarchitectureandhowapplicationsarebuiltatopit

Inthenexttwochapters,wewillmoveawayfromstrictlybatchprocessinganddelveintotheworldofnearreal-timeanditerativeprocessing,usingtwooftheYARN-hostedframeworksweintroducedinthischapter,namelySamzaandSpark.

Chapter4.Real-timeComputationwithSamzaThepreviouschapterdiscussedYARN,andfrequentlymentionedthebreadthofcomputationalmodelsandprocessingframeworksoutsideoftraditionalbatch-basedMapReducethatitenablesontheHadoopplatform.Inthischapterandthenext,wewillexploretwosuchprojectsinsomedepth,namelyApacheSamzaandApacheSpark.Wechosetheseframeworksastheydemonstratetheusageofstreamanditerativeprocessingandalsoprovideinterestingmechanismstocombineprocessingparadigms.InthischapterwewillexploreSamzaandcoverthefollowingtopics:

WhatSamzaisandhowitintegrateswithYARNandotherprojectssuchasApacheKafkaHowSamzaprovidesasimplecallback-basedinterfaceforstreamprocessingHowSamzacomposesmultiplestreamprocessingjobsintomorecomplexworkflowsHowSamzasupportspersistentlocalstatewithintasksandhowthisgreatlyenricheswhatitcanenable

StreamprocessingwithSamzaToexploreapurestream-processingplatform,wewilluseSamza,whichisavailableathttps://samza.apache.org.Thecodeshownherewastestedwiththecurrent0.8releaseandwe’llkeeptheGitHubrepositoryupdatedastheprojectcontinuestoevolve.

SamzawasbuiltatLinkedInanddonatedtotheApacheSoftwareFoundationinSeptember2013.Overtheyears,LinkedInhasbuiltamodelthatconceptualizesmuchoftheirdataasstreams,andfromthistheysawtheneedforaframeworkthatcanprovideadeveloper-friendlymechanismtoprocesstheseubiquitousdatastreams.

TheteamatLinkedInrealizedthatwhenitcametodataprocessing,muchoftheattentionwenttotheextremeendsofthespectrum,forexample,RPCworkloadsareusuallyimplementedassynchronoussystemswithverylowlatencyrequirementsorbatchsystemswheretheperiodicityofjobsisoftenmeasuredinhours.ThegroundinbetweenhasbeenrelativelypoorlysupportedandthisistheareathatSamzaistargetedat;mostofitsjobsexpectresponsetimesrangingfrommillisecondstominutes.Theyalsoassumethatdataarrivesinatheoreticallyinfinitestreamofcontinuousmessages.

https://samza.apache.org

HowSamzaworksTherearenumerousstream-processingsystemssuchasStorm(http://storm.apache.org),intheopensourceworld,andmanyother(mostlycommercial)toolssuchascomplexeventprocessing(CEP)systemsthatalsotargetprocessingoncontinuousmessagestreams.Thesesystemshavemanysimilaritiesbutalsosomemajordifferences.

ForSamza,perhapsthemostsignificantdifferenceisitsassumptionsaboutmessagedelivery.Manysystemsworkveryhardtoreducethelatencyofeachmessage,sometimeswithanassumptionthatthegoalistogetthemessageintoandoutofthesystemasfastaspossible.Samzaassumesalmosttheopposite;itsstreamsarepersistentandresilientandanymessagewrittentoastreamcanbere-readforaperiodoftimeafteritsfirstarrival.Aswewillsee,thisgivessignificantcapabilityaroundfaulttolerance.Samzaalsobuildsonthismodeltoalloweachofitstaskstoholdresilientlocalstate.

SamzaismostlyimplementedinScalaeventhoughitspublicAPIsarewritteninJava.We’llshowJavaexamplesinthischapter,butanyJVMlanguagecanbeusedtoimplementSamzaapplications.We’lldiscussScalawhenweexploreSparkinthenextchapter.

http://storm.apache.org

Samzahigh-levelarchitectureSamzaviewstheworldashavingthreemainlayersorcomponents:thestreaming,execution,andprocessinglayers.

Samzaarchitecture

Thestreaminglayerprovidesaccesstothedatastreams,bothforconsumptionandpublication.TheexecutionlayerprovidesthemeansbywhichSamzaapplicationscanberun,haveresourcessuchasCPUandmemoryallocated,andhavetheirlifecyclesmanaged.TheprocessinglayeristheactualSamzaframeworkitself,anditsinterfacesallowper-messagefunctionality.

SamzaprovidespluggableinterfacestosupportthefirsttwolayersthoughthecurrentmainimplementationsuseKafkaforstreamingandYARNforexecution.We’lldiscussthesefurtherinthefollowingsections.

Samza’sbestfriend–ApacheKafkaSamzaitselfdoesnotimplementtheactualmessagestream.Instead,itprovidesaninterfaceforamessagesystemwithwhichitthenintegrates.ThedefaultstreamimplementationisbuiltuponApacheKafka(http://kafka.apache.org),amessagingsystemalsobuiltatLinkedInbutnowasuccessfulandwidelyadoptedopensourceproject.

KafkacanbeviewedasamessagebrokerakintosomethinglikeRabbitMQorActiveMQ,butasmentionedearlier,itwritesallmessagestodiskandscalesoutacrossmultiplehostsasacorepartofitsdesign.Kafkausestheconceptofapublish/subscribemodelthroughnamedtopicstowhichproducerswritemessagesandfromwhichconsumersreadmessages.Theseworkmuchliketopicsinanyothermessagingsystem.

BecauseKafkawritesallmessagestodisk,itmightnothavethesameultra-lowlatencymessagethroughputasothermessagingsystems,whichfocusongettingthemessageprocessedasfastaspossibleanddon’taimtostorethemessagelongterm.Kafkacan,however,scaleexceptionallywellanditsabilitytoreplayamessagestreamcanbeextremelyuseful.Forexample,ifaconsumingclientfails,thenitcanre-readmessagesfromaknowngoodpointintime,orifadownstreamalgorithmchanges,thentrafficcanbereplayedtoutilizethenewfunctionality.

Whenscalingacrosshosts,Kafkapartitionstopicsandsupportspartitionreplicationforfaulttolerance.EachKafkamessagehasakeyassociatedwiththemessageandthisisusedtodecidetowhichpartitionagivenmessageissent.Thisallowssemanticallyusefulpartitioning,forexample,ifthekeyisauserIDinthesystem,thenallmessagesforagivenuserwillbesenttothesamepartition.Kafkaguaranteesordereddeliverywithineachpartitionsothatanyclientreadingapartitioncanknowthattheyarereceivingallmessagesforeachkeyinthatpartitionintheorderinwhichtheyarewrittenbytheproducer.

Samzaperiodicallywritesoutcheckpointsofthepositionuptowhichithasreadinallthestreamsitisconsuming.ThesecheckpointmessagesarethemselveswrittentoaKafkatopic.Thus,whenaSamzajobstartsup,eachtaskcanrereaditscheckpointstreamtoknowfromwhichpositioninthestreamtostartprocessingmessages.ThismeansthatineffectKafkaalsoactsasabuffer;ifaSamzajobcrashesoristakendownforupgrade,nomessageswillbelost.Instead,thejobwilljustrestartfromthelastcheckpointedpositionwhenitrestarts.Thisbufferfunctionalityisalsoimportant,asitmakesiteasierformultipleSamzajobstorunaspartofacomplexworkflow.WhenKafkatopicsarethepointsofcoordinationbetweenthejobs,onejobmightconsumeatopicbeingwrittentobyanother;insuchcases,Kafkacanhelpsmoothoutissuescausedduetoanygivenjobrunningslowerthanothers.Traditionally,thebackpressurecausedbyaslowrunningjobcanbearealissueinasystemcomprisedofmultiplejobstages,butKafkaastheresilientbufferallowseachjobtoreadandwriteatitsownrate.NotethatthisisanalogoustohowmultiplecoordinatingMapReducejobswilluseHDFSforsimilarpurposes.

Kafkaprovidesat-leastoncemessagedeliverysemantics,thatistosaythatanymessage

http://kafka.apache.org

writtentoKafkawillbeguaranteedtobeavailabletoaclientoftheparticularpartition.Messagesmightbeprocessedbetweencheckpointshowever;itispossibleforduplicatemessagestobereceivedbytheclient.Thereareapplication-specificmechanismstomitigatethis,andbothKafkaandSamzahaveexactly-oncesemanticsontheirroadmaps,butfornowitissomethingyoushouldtakeintoconsiderationwhendesigningjobs.

Wewon’texplainKafkafurtherbeyondwhatweneedtodemonstrateSamza.Ifyouareinterested,checkoutitswebsiteandwiki;thereisalotofgoodinformation,includingsomeexcellentpapersandpresentations.

YARNintegrationAsmentionedearlier,justasSamzautilizesKafkaforitsstreaminglayerimplementation,itusesYARNfortheexecutionlayer.JustlikeanyYARNapplicationdescribedinChapter3,Processing–MapReduceandBeyond,SamzaprovidesanimplementationofbothanApplicationMaster,whichcontrolsthelifecycleoftheoveralljob,plusimplementationsofSamza-specificfunctionality(calledtasks)thatareexecutedineachcontainer.JustasKafkapartitionsitstopics,tasksarethemechanismbywhichSamzapartitionsitsprocessing.EachKafkapartitionwillbereadbyasingleSamzatask.IfaSamzajobconsumesmultiplestreams,thenagiventaskwillbetheonlyconsumerwithinthejobforeverystreampartitionassignedtoit.

TheSamzaframeworkistoldbyeachjobconfigurationabouttheKafkastreamsthatareofinteresttothejob,andSamzacontinuouslypollsthesestreamstodetermineifanynewmessageshavearrived.Whenanewmessageisavailable,theSamzataskinvokesauser-definedcallbacktoprocessthemessage,amodelthatshouldn’tlooktooalientoMapReducedevelopers.ThismethodisdefinedinaninterfacecalledStreamTaskandhasthefollowingsignature:

publicvoidprocess(IncomingMessageEnvelopeenvelope,

MessageCollectorcollector,

TaskCoordinatorcoordinator)

ThisisthecoreofeachSamzataskanddefinesthefunctionalitytobeappliedtoreceivedmessages.ThereceivedmessagethatistobeprocessediswrappedintheIncomingMessageEnvelope;outputmessagescanbewrittentotheMessageCollector,andtaskmanagement(suchasShutdown)canbeperformedviatheTaskCoordinator.

Asmentioned,SamzacreatesonetaskinstanceforeachpartitionintheunderlyingKafkatopic.EachYARNcontainerwillmanageoneormoreofthesetasks.TheoverallmodelthenisoftheSamzaApplicationMastercoordinatingmultiplecontainers,eachofwhichisresponsibleforoneormoreStreamTaskinstances.

AnindependentmodelThoughwewilltalkexclusivelyofKafkaandYARNastheprovidersofSamza’sstreamingandexecutionlayersinthischapter,itisimportanttorememberthatthecoreSamzasystemuseswell-definedinterfacesforboththestreamandexecutionsystems.Thereareimplementationsofmultiplestreamsources(we’llseeoneinthenextsection)andalongsidetheYARNsupport,SamzashipswithaLocalJobRunnerclass.ThisalternativemethodofrunningtaskscanexecuteStreamTaskinstancesin-processontheJVMinsteadofrequiringafullYARNcluster,whichcansometimesbeausefultestinganddebuggingtool.ThereisalsoadiscussionofSamzaimplementationsontopofotherclustermanagerorvirtualizationframeworks.

HelloSamza!SincenoteveryonealreadyhasZooKeeper,Kafka,andYARNclustersreadytobeused,theSamzateamhascreatedawonderfulwaytogetstartedwiththeproduct.InsteadofjusthavingaHelloworld!program,thereisarepositorycalledHelloSamza,whichisavailablebycloningtherepositoryatgit://git.apache.org/samza-hello-samza.git.

ThiswilldownloadandinstalldedicatedinstancesofZooKeeper,Kafka,andYARN(the3majorprerequisitesforSamza),creatingafullstackuponwhichyoucansubmitSamzajobs.

TherearealsoanumberofexampleSamzajobsthatprocessdatafromWikipediaeditnotifications.Takealookatthepageathttp://samza.apache.org/startup/hello-samza/0.8/andfollowtheinstructionsgiventhere.(Atthetimeofwriting,Samzaisstillarelativelyyoungprojectandwe’drathernotincludedirectinformationabouttheexamples,whichmightbesubjecttochange).

FortheremainderoftheSamzaexamplesinthischapter,we’llassumeyouareeitherusingtheHelloSamzapackagetoprovidethenecessarycomponents(ZooKeeper/Kafka/YARN)oryouhaveintegratedwithotherinstancesofeach.

ThisexamplehasthreedifferentSamzajobsthatbuilduponeachother.ThefirstreadstheWikipediaedits,thesecondparsestheserecords,andthethirdproducesstatisticsbasedontheprocessedrecords.We’llbuildourownmultistreamworkflowshortly.

OneinterestingpointistheWikipediaFeedexamplehere;itusesWikipediaasitsmessagesourceinsteadofKafka.Specifically,itprovidesanotherimplementationoftheSamzaSystemConsumerinterfacetoallowSamzatoreadmessagesfromanexternalsystem.Asmentionedearlier,SamzaisnottiedtoKafkaand,asthisexampleshows,buildinganewstreamimplementationdoesnothavetobeagainstagenericinfrastructurecomponent;itcanbequitejob-specific,astheworkrequiredisnothuge.

TipNotethatthedefaultconfigurationforbothZooKeeperandKafkawillwritesystemdatatodirectoriesunder/tmp,whichwillbewhatyouhavesetifyouuseHelloSamza.BecarefulifyouareusingaLinuxdistributionthatpurgesthecontentsofthisdirectoryonareboot.Ifyouplantocarryoutanysignificanttesting,thenit’sbesttoreconfigurethesecomponentstouselessephemerallocations.Changetherelevantconfigfilesforeachservice;theyarelocatedintheservicedirectoryunderthehello-samza/deploydirectory.

http://git://git.apache.org/samza-hello-samza.git

http://samza.apache.org/startup/hello-samza/0.8/

BuildingatweetparsingjobLet’sbuildourownsimplejobimplementationtoshowthefullcoderequired.We’lluseparsingoftheTwitterstreamastheexamplesinthischapterandwilllatersetupapipefromourclientconsumingmessagesfromtheTwitterAPIintoaKafkatopic.So,weneedaSamzataskthatwillreadthestreamofJSONmessages,extracttheactualtweettext,andwritethesetoatopicoftweets.

HereisthemaincodefromTwitterParseStreamTask.java,availableathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterParseStreamTask.java

packagecom.learninghadoop2.samza.tasks;

publicclassTwitterParseStreamTaskimplementsStreamTask{

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

Stringmsg=((String)envelope.getMessage());

try{

JSONParserparser=newJSONParser();

Objectobj=parser.parse(msg);

JSONObjectjsonObj=(JSONObject)obj;

Stringtext=(String)jsonObj.get("text");

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweets-parsed"),text));

}catch(ParseExceptionpe){}

}

}

}

Thecodeislargelyself-explanatory,butthereareafewpointsofinterest.WeuseJSONSimple(http://code.google.com/p/json-simple/)forourrelativelystraightforwardJSONparsingrequirements;we’llalsouseitlaterinthisbook.

TheIncomingMessageEnvelopeanditscorrespondingOutputMessageEnvelopearethemainstructuresconcernedwiththeactualmessagedata.Alongwiththemessagepayload,theenvelopewillalsohavedataconcerningthesystem,topicname,and(optionally)partitionnumberinadditiontoothermetadata.Forourpurposes,wejustextractthemessagebodyfromtheincomingmessageandsendthetweettextweextractfromitviaanewOutgoingMessageEnvelopetoatopiccalledtweets-parsedwithinasystemcalledkafka.Notethelowercasename—we’llexplainthisinamoment.

ThetypeofmessageintheIncomingMessageEnvelopeisjava.lang.Object.Samzadoesnotcurrentlyenforceadatamodelandhencedoesnothavestrongly-typedmessagebodies.Therefore,whenextractingthemessagecontents,anexplicitcastisusuallyrequired.Sinceeachtaskneedstoknowtheexpectedmessageformatofthestreamsitprocesses,thisisnottheodditythatitmayappeartobe.

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterParseStreamTask.java

http://code.google.com/p/json-simple/

TheconfigurationfileTherewasnothinginthepreviouscodethatsaidwherethemessagescamefrom;theframeworkjustpresentsthemtotheStreamTaskimplementation,butobviouslySamzaneedstoknowfromwheretofetchmessages.Thereisaconfigurationfileforeachjobthatdefinesthisandmore.Thefollowingcanbefoundastwitter-parse.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-parser.properties:

#Job

job.factory.class=org.apache.samza.job.yarn.YarnJobFactory

job.name=twitter-parser

#YARN

yarn.package.path=file:///home/gturkington/samza/build/distributions/learni

nghadoop2-0.1.tar.gz

#Task

task.class=com.learninghadoop2.samza.tasks.TwitterParseStreamTask

task.inputs=kafka.tweets

task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointMa

nagerFactory

task.checkpoint.system=kafka

#Normally,thiswouldbe3,butwehaveonlyonebroker.

task.checkpoint.replication.factor=1

#Serializers

serializers.registry.string.class=org.apache.samza.serializers.StringSerdeF

actory

#Systems

systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactor

y

systems.kafka.streams.tweets.samza.msg.serde=string

systems.kafka.streams.tweets-parsed.samza.msg.serde=string

systems.kafka.consumer.zookeeper.connect=localhost:2181/

systems.kafka.consumer.auto.offset.reset=largest

systems.kafka.producer.metadata.broker.list=localhost:9092

systems.kafka.producer.producer.type=sync

systems.kafka.producer.batch.num.messages=1

Thismaylooklikealot,butfornowwe’lljustconsiderthehigh-levelstructureandsomekeysettings.ThejobsectionsetsYARNastheexecutionframework(asopposedtothelocaljobrunnerclass)andgivesthejobaname.Ifweweretorunmultiplecopiesofthissamejob,wewouldalsogiveeachcopyauniqueID.Thetasksectionspecifiestheimplementationclassofourtaskandalsothenameofthestreamsforwhichitshouldreceivemessages.SerializerstellSamzahowtoreadandwritemessagestoandfromthestreamandthesystemsectiondefinessystemsbynameandassociatesimplementationclasseswiththem.

Inourcase,wedefineonlyonesystemcalledkafkaandwerefertothissystemwhen

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-parser.properties

sendingourmessageintheprecedingtask.Notethatthisnameisarbitraryandwecouldcallitwhateverwewant.Obviously,forclarityitmakessensetocalltheKafkasystembythesamenamebutthisisonlyaconvention.Inparticular,sometimesyouwillneedtogivedifferentnameswhendealingwithmultiplesystemsthataresimilartoeachother,orsometimesevenwhentreatingthesamesystemdifferentlyindifferentpartsofaconfigurationfile.

Inthissection,wewillalsospecifytheSerDetobeassociatedwiththestreamsusedbythetask.RecallthatKafkamessageshaveabodyandanoptionalkeythatisusedtodeterminetowhichpartitionthemessageissent.Samzaneedstoknowhowtotreatthecontentsofthekeysandmessagesforthesestreams.Samzahassupporttotreattheseasrawbytesorspecifictypessuchasstring,integer,andJSON,asmentionedearlier.

Therestoftheconfigurationwillbemostlyunchangedfromjobtojob,asitincludesthingssuchasthelocationoftheZooKeeperensembleandKafkaclusters,andspecifieshowstreamsaretobecheckpointed.Samzaallowsawidevarietyofcustomizationsandthefullconfigurationoptionsaredetailedathttp://samza.apache.org/learn/documentation/0.8/jobs/configuration-table.html.

http://samza.apache.org/learn/documentation/0.8/jobs/configuration-table.html

GettingTwitterdataintoKafkaBeforewerunthejob,wedoneedtogetsometweetsintoKafka.Let’screateanewKafkatopiccalledtweetstowhichwe’llwritethetweets.

ToperformthisandotherKafka-relatedoperations,we’llusecommand-linetoolslocatedwithinthebindirectoryoftheKafkadistribution.IfyouarerunningajobfromwithinthestackcreatedaspartoftheHelloSamzaapplication;thiswillbedeploy/kafka/bin.

kafka-topics.shisageneral-purposetoolthatcanbeusedtocreate,update,anddescribetopics.MostofitsusagesrequireargumentstospecifythelocationofthelocalZooKeepercluster,whereKafkabrokersstoretheirdetails,andthenameofthetopictobeoperatedupon.Tocreateanewtopic,runthefollowingcommand:

$kafka-topics.sh--zookeeperlocalhost:2181--create–topictweets--

partitions1--replication-factor1

Thiscreatesatopiccalledtweetsandexplicitlysetsitsnumberofpartitionsandreplicationfactorto1.ThisissuitableifyouarerunningKafkawithinalocaltestVM,butclearlyproductiondeploymentswillhavemorepartitionstoscaleouttheloadacrossmultiplebrokersandareplicationfactorofatleast2toprovidefaulttolerance.

Usethelistoptionofthekafka-topics.shtooltosimplyshowthetopicsinthesystem,orusedescribetogetmoredetailedinformationonspecifictopics:

$kafka-topics.sh--zookeeperlocalhost:2181--describe--topictweets

Topic:tweetsPartitionCount:1ReplicationFactor:1Configs:

Topic:tweetsPartition:0Leader:0Replicas:0Isr:0

Themultiple0sarepossiblyconfusingasthesearelabelsandnotcounts.EachbrokerinthesystemhasanIDthatusuallystartsfrom0,asdothepartitionswithineachtopic.TheprecedingoutputistellingusthatthetopiccalledtweetshasasinglepartitionwithID0,thebrokeractingastheleaderforthatpartitionisbroker0,andthesetofin-syncreplicas(ISR)forthispartitionisagainonlybroker0.Thislastvalueisparticularlyimportantwhendealingwithreplication.

We’lluseourPythonutilityfrompreviouschapterstopullJSONtweetsfromtheTwitterfeed,andthenuseaKafkaCLImessageproducertowritethemessagestoaKafkatopic.Thisisn’taterriblyefficientwayofdoingthings,butitissuitableforillustrationpurposes.AssumingourPythonscriptisinourhomedirectory,runthefollowingcommandfromwithintheKafkabindirectory:

$python~/stream.py–j|./kafka-console-producer.sh--broker-list

localhost:9092--topictweets

ThiswillrunindefinitelysobecarefulnottoleaveitrunningovernightonatestVMwithsmalldiskspace,notthattheauthorshaveeverdonesuchathing.

RunningaSamzajobTorunaSamzajob,weneedourcodetobepackagedalongwiththeSamzacomponentsrequiredtoexecuteitintoa.tar.gzarchivethatwillbereadbytheYARNNodeManager.Thisisthefilereferredtobytheyarn.file.packagepropertyintheSamzataskconfigurationfile.

WhenusingthesinglenodeHelloSamzawecanjustuseanabsolutepathonthefilesystem,asseeninthepreviousconfigurationexample.ForjobsonlargerYARNgrids,theeasiestwayistoputthepackageontoHDFSandrefertoitbyanhdfs://URIoronawebserver(SamzaprovidesamechanismtoallowYARNtoreadthefileviaHTTP).

BecauseSamzahasmultiplesubcomponentsandeachsubcomponenthasitsowndependencies,thefullYARNpackagecanendupcontainingalotofJARfiles(over100!).Inaddition,youneedtoincludeyourcustomcodefortheSamzataskaswellassomescriptsfromwithintheSamzadistribution.It’snotsomethingtobedonebyhand.Inthesamplecodeforthischapter,foundathttps://github.com/learninghadoop2/book-examples/tree/master/ch4,wehavesetupasamplestructuretoholdthecodeandconfigfilesandprovidedsomeautomationviaGradletobuildthenecessarytaskarchiveandstartthetasks.

WhenintherootoftheSamzaexamplecodedirectoryforthisbook,performthefollowingcommandtobuildasinglefilearchivecontainingalltheclassesofthischaptercompiledtogetherandbundledwithalltheotherrequiredfiles:

$./gradlewtargz

ThisGradletaskwillnotonlycreatethenecessary.tar.gzarchiveinthebuild/distributionsdirectory,butwillalsostoreanexpandedversionofthearchiveunderbuild/samza-package.Thiswillbeuseful,aswewilluseSamzascriptsstoredinthebindirectoryofthearchivetoactuallysubmitthetasktoYARN.

Sonow,let’srunourjob.Weneedtohavefilepathsfortwothings:theSamzarun-job.shscripttosubmitajobtoYARNandtheconfigurationfileforourjob.Sinceourcreatedjobpackagehasallthecompiledtasksbundledtogether,itisbyusingadifferentconfigurationfilethatspecifiesaspecifictaskimplementationclassinthetask.classpropertythatwetellSamzawhichtasktorun.Toactuallyrunthetask,wecanrunthefollowingcommandfromwithintheexplodedprojectarchiveunderbuild/samza-archives:

$bin/run-job.sh--config-

factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-

path=]config/twitter-parser.properties

Forconvenience,weaddedaGradletasktorunthisjob:

$./gradlewrunTwitterParser

Toseetheoutputofthejob,we’llusetheKafkaCLIclienttoconsumemessages:


$./kafka-console-consumer.sh–zookeeperlocalhost:2181–topictweets-

parsed

Youshouldseeacontinuousstreamoftweetsappearingontheclient.

NoteNotethatwedidnotexplicitlycreatethetopiccalledtweets-parsed.Kafkacanallowtopicstobecreateddynamicallywheneitheraproducerorconsumertriestousethetopic.Inmanysituations,thoughthedefaultpartitioningandreplicationvaluesmaynotbesuitable,andexplicittopiccreationwillberequiredtoensurethesecriticaltopicattributesarecorrectlydefined.

SamzaandHDFSYoumayhavenoticedthatwejustmentionedHDFSforthefirsttimeinourdiscussionofSamza.ThoughSamzaintegratestightlywithYARN,ithasnodirectintegrationwithHDFS.Atalogicallevel,Samza’sstream-implementingsystems(suchasKafka)areprovidingthestoragelayerthatisusuallyprovidedbyHDFSfortraditionalHadoopworkloads.IntheterminologyofSamza’sarchitecture,asdescribedearlier,YARNistheexecutionlayerinbothmodels,whereasSamzausesastreaminglayerforitssourceanddestinationdata,frameworkssuchasMapReduceuseHDFS.ThisisagoodexampleofhowYARNenablesalternativecomputationalmodelsthatnotonlyprocessdataverydifferentlythanbatch-orientedMapReduce,butthatcanalsouseentirelydifferentstoragesystemsfortheirsourcedata.

WindowingfunctionsIt’sfrequentlyusefultogeneratesomedatabasedonthemessagesreceivedonastreamoveracertaintimewindow.Anexampleofthismaybetorecordthetopnattributevaluesmeasuredeveryminute.SamzasupportsthisthroughtheWindowableTaskinterface,whichhasthefollowingsinglemethodtobeimplemented:

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator);

ThisshouldlooksimilartotheprocessmethodintheStreamTaskinterface.However,becausethemethodiscalledonatimeschedule,itsinvocationisnotassociatedwithareceivedmessage.TheMessageCollectorandTaskCoordinatorparametersarestillthere,however,asmostwindowabletaskswillproduceoutputmessagesandmayalsowishtoperformsometaskmanagementactions.

Let’stakeourprevioustaskandaddawindowfunctionthatwilloutputthenumberoftweetsreceivedineachwindowedtimeperiod.ThisisthemainclassimplementationofTwitterStatisticsStreamTask.javafoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatisticsStreamTask.java

publicclassTwitterStatisticsStreamTaskimplementsStreamTask,

WindowableTask{

privateinttweets=0;

@Override



tweets++;

}

@Override


coordinator){


SystemStream("kafka","tweet-stats"),""+tweets));

//Resetcountsafterwindowing.

tweets=0;

}

}

TheTwitterStatisticsStreamTaskclasshasaprivatemembervariablecalledtweetsthatisinitializedto0andisincrementedineverycalltotheprocessmethod.Wethereforeknowthatthisvariablewillbeincrementedforeachmessagepassedtothetaskfromtheunderlyingstreamimplementation.EachSamzacontainerhasasinglethreadrunninginaloopthatexecutestheprocessandwindowmethodsonallthetaskswithinthecontainer.Thismeansthatwedonotneedtoguardinstancevariablesagainstconcurrentmodifications;onlyonemethodoneachtaskwithinacontainerwillbeexecutingsimultaneously.

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatisticsStreamTask.java

Inourwindowmethod,wesendamessagetoanewtopicwecalltweet-statsandthenresetthetweetsvariable.ThisisprettystraightforwardandtheonlymissingpieceishowSamzawillknowwhentocallthewindowmethod.Wespecifythisintheconfigurationfile:

task.window.ms=5000

ThistellsSamzatocallthewindowmethodoneachtaskinstanceevery5seconds.Torunthewindowtask,thereisaGradletask:

$./gradlewrunTwitterStatistics

Ifweusekafka-console-consumer.shtolistenonthetweet-statsstreamnow,wewillseethefollowingoutput:

Numberoftweets:5012

Numberoftweets:5398

NoteNotethatthetermwindowinthiscontextreferstoSamzaconceptuallyslicingthestreamofmessagesintotimerangesandprovidingamechanismtoperformprocessingateachrangeboundary.Samzadoesnotdirectlyprovideanimplementationoftheotheruseofthetermwithregardstoslidingwindows,whereaseriesofvaluesisheldandprocessedovertime.However,thewindowabletaskinterfacedoesprovidetheplumbingtoimplementsuchslidingwindows.

MultijobworkflowsAswesawwiththeHelloSamzaexamples,someoftherealpowerofSamzacomesfromcompositionofmultiplejobsandwe’lluseatextcleanupjobtostartdemonstratingthiscapability.

Inthefollowingsection,we’llperformtweetsentimentanalysisbycomparingtweetswithasetofEnglishpositiveandnegativewords.SimplyapplyingthistotherawTwitterfeedwillhaveverypatchyresults,however,givenhowrichlymultilingualtheTwitterstreamis.Wealsoneedtoconsiderthingssuchastextcleanup,capitalization,frequentcontractions,andsoon.Asanyonewhohasworkedwithanynon-trivialdatasetknows,theactofmakingthedatafitforprocessingisusuallywherealargeamountofeffort(oftenthemajority!)goes.

Sobeforewetryanddetecttweetsentiments,let’sdosomesimpletextcleanup;inparticular,we’llselectonlyEnglishlanguagetweetsandwewillforcetheirtexttobelowercasebeforesendingthemtoanewoutputstream.

Languagedetectionisadifficultproblemandforthiswe’lluseafeatureoftheApacheTikalibrary(http://tika.apache.org).Tikaprovidesawidearrayoffunctionalitytoextracttextfromvarioussourcesandthentoextractfurtherinformationfromthattext.IfusingourGradlescripts,theTikadependencyisalreadyspecifiedandwillautomaticallybeincludedinthegeneratedjobpackage.Ifbuildingthroughanothermechanism,youwillneedtodownloadtheTikaJARfilefromthehomepageandaddittoyourYARNjobpackage.ThefollowingcodecanbefoundasTextCleanupStreamTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TextCleanupStreamTask.java

publicclassTextCleanupStreamTaskimplementsStreamTask{

@Override



Stringrawtext=((String)envelope.getMessage());

if("en".equals(detectLanguage(rawtext))){


SystemStream("kafka","english-tweets"),

rawtext.toLowerCase()));

}

}

privateStringdetectLanguage(Stringtext){

LanguageIdentifierli=newLanguageIdentifier(text);

returnli.getLanguage();

}

}

ThistaskisquitestraightforwardthankstotheheavyliftingperformedbyTika.WecreateautilitymethodthatwrapsthecreationanduseofaTika,LanguageDetector,andthenwe

http://tika.apache.org

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TextCleanupStreamTask.java

callthismethodonthemessagebodyofeachincomingmessageintheprocessmethod.Weonlywritetotheoutputstreamiftheresultofapplyingthisutilitymethodis"en",thatis,thetwo-lettercodeforEnglish.

Theconfigurationfileforthistaskissimilartothatofourprevioustask,withthespecificvaluesforthetasknameandimplementingclass.Itisintherepositoryastextcleanup.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/textcleanup.properties.Wealsoneedtospecifytheinputstream:

task.inputs=kafka.tweets-parsed

ThisisimportantbecauseweneedthistasktoparsethetweettextthatwasextractedintheearliertaskandavoidduplicatingtheJSONparsinglogicthatisbestencapsulatedinoneplace.Wecanrunthistaskwiththefollowingcommand:

$./gradlewrunTextCleanup

Now,wecanrunallthreetaskstogether;TwitterParseStreamTaskandTwitterStatisticsStreamTaskwillconsumetherawtweetstream,whileTextCleanupStreamTaskwillconsumetheoutputfromTwitterParseStreamTask.

Dataprocessingonstreams

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/textcleanup.properties

TweetsentimentanalysisWe’llnowimplementatasktoperformtweetsentimentanalysissimilartowhatwedidusingMapReduceinthepreviouschapter.ThiswillalsoshowusausefulmechanismofferedbySamza:bootstrapstreams.

BootstrapstreamsGenerallyspeaking,moststream-processingjobs(inSamzaoranotherframework)willstartprocessingmessagesthatarriveaftertheystartupandgenerallyignorehistoricalmessages.Becauseofitsconceptofreplayablestreams,Samzadoesn’thavethislimitation.

Inoursentimentanalysisjob,wehadtwosetsofreferenceterms:positiveandnegativewords.Thoughwe’venotshownitsofar,Samzacanconsumemessagesfrommultiplestreamsandtheunderlyingmachinerywillpollallnamedstreamsandprovidetheirmessages,oneatatime,totheprocessmethod.Wecanthereforecreatestreamsforthepositiveandnegativewordsandpushthedatasetsontothosestreams.Atfirstglance,wecouldplantorewindthesetwostreamstotheearliestpointandreadtweetsastheyarrive.TheproblemisthatSamzawon’tguaranteeorderingofmessagesfrommultiplestreams,andeventhoughthereisamechanismtogivestreamshigherpriority,wecan’tassumethatallnegativeandpositivewordswillbeprocessedbeforethefirsttweetarrives.

Forsuchtypesofscenarios,Samzahastheconceptofbootstrapstreams.Ifataskhasanybootstrapstreamsdefined,thenitwillreadthesestreamsfromtheearliestoffsetuntiltheyarefullyprocessed(technically,itwillreadthestreamstilltheygetcaughtup,sothatanynewwordssenttoeitherstreamwillbetreatedwithoutpriorityandwillarriveinterleavedbetweentweets).

We’llnowcreateanewjobcalledTweetSentimentStreamTaskthatreadstwobootstrapstreams,collectstheircontentsintoHashMaps,gathersrunningcountsforsentimenttrends,andusesawindowfunctiontooutputthisdataatintervals.Thiscodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterSentimentStreamTask.java

publicclassTwitterSentimentStreamTaskimplementsStreamTask,

WindowableTask{

privateSet<String>positiveWords=newHashSet<String>();

privateSet<String>negativeWords=newHashSet<String>();

privateinttweets=0;

privateintpositiveTweets=0;

privateintnegativeTweets=0;

privateintmaxPositive=0;

privateintmaxNegative=0;

@Override



if("positive-

words".equals(envelope.getSystemStreamPartition().getStream())){

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterSentimentStreamTask.java

positiveWords.add(((String)envelope.getMessage()));

}elseif("negative-

words".equals(envelope.getSystemStreamPartition().getStream())){

negativeWords.add(((String)envelope.getMessage()));

}elseif("english-

tweets".equals(envelope.getSystemStreamPartition().getStream())){

tweets++;

intpositive=0;

intnegative=0;

Stringwords=((String)envelope.getMessage());

for(Stringword:words.split("")){

if(positiveWords.contains(word)){

positive++;

}elseif(negativeWords.contains(word)){

negative++;

}

}

if(positive>negative){

positiveTweets++;

}

if(negative>positive){

negativeTweets++;

}

if(positive>maxPositive){

maxPositive=positive;

}

if(negative>maxNegative){

maxNegative=negative;

}

}

}

@Override


coordinator){

Stringmsg=String.format("Tweets:%dPositive:%dNegative:%d

MaxPositive:%dMinPositive:%d",tweets,positiveTweets,negativeTweets,

maxPositive,maxNegative);


SystemStream("kafka","tweet-sentiment-stats"),msg));


tweets=0;

positiveTweets=0;

negativeTweets=0;

maxPositive=0;

maxNegative=0;

}

}

Inthistask,weaddanumberofprivatemembervariablesthatwewillusetokeeparunningcountofthenumberofoveralltweets,howmanywerepositiveandnegative,andthemaximumpositiveandnegativecountsseeninasingletweet.

ThistaskconsumesfromthreeKafkatopics.Eventhoughwewillconfiguretwotobeusedasbootstrapstreams,theyareallstillexactlythesametypeofKafkatopicfromwhichmessagesarereceived;theonlydifferencewithbootstrapstreamsisthatwetellSamzatouseKafka’srewindingcapabilitiestofullyre-readeachmessageinthestream.Fortheotherstreamoftweets,wejuststartreadingnewmessagesastheyarrive.

Ashintedearlier,ifatasksubscribestomultiplestreams,thesameprocessmethodwillreceivemessagesfromeachstream.Thatiswhyweuseenvelope.getSystemStreamPartition().getStream()toextractthestreamnameforeachgivenmessageandthenactaccordingly.Ifthemessageisfromeitherofthebootstrappedstreams,weadditscontentstotheappropriatehashmap.Webreakatweetmessageintoitsconstituentwords,testeachwordforpositiveornegativesentiment,andthenupdatecountsaccordingly.Asyoucansee,thistaskdoesn’toutputthereceivedtweetstoanothertopic.

Sincewedon’tperformanydirectprocessing,thereisnopointindoingso;anyothertaskthatwishestoconsumemessagescanjustsubscribedirectlytotheincomingtweetsstream.However,apossiblemodificationcouldbetowritepositiveandnegativesentimenttweetstodedicatedstreamsforeach.

Thewindowmethodoutputsaseriesofcountsandthenresetsthevariables(asitdidbefore).NotethatSamzadoeshavesupporttodirectlyexposemetricsthroughJMX,whichcouldpossiblybeabetterfitforsuchsimplewindowingexamples.However,wewon’thavespacetocoverthataspectoftheprojectinthisbook.

Torunthisjob,weneedtomodifytheconfigurationfilebysettingthejobandtasknamesasusual,butwealsoneedtospecifymultipleinputstreamsnow:

task.inputs=kafka.english-tweets,kafka.positive-words,kafka.negative-words

Then,weneedtospecifythattwoofourstreamsarebootstrapstreamsthatshouldbereadfromtheearliestoffset.Specifically,wesetthreepropertiesforthestreams.Wesaytheyaretobebootstrapped,thatis,fullyreadbeforeotherstreams,andthisisachievedbyspecifyingthattheoffsetoneachstreamneedstoberesettotheoldest(first)position:

systems.kafka.streams.positive-words.samza.bootstrap=true

systems.kafka.streams.positive-words.samza.reset.offset=true

systems.kafka.streams.positive-words.samza.offset.default=oldest

systems.kafka.streams.negative-words.samza.bootstrap=true

systems.kafka.streams.negative-words.samza.reset.offset=true

systems.kafka.streams.negative-words.samza.offset.default=oldest

Wecanrunthisjobwiththefollowingcommand:

$./gradlewrunTwitterSentiment

Afterstartingthejob,lookattheoutputofthemessagesonthetweet-sentiment-statstopic.

Thesentimentdetectionjobwillbootstrapthepositiveandnegativewordstreamsbeforereadinganyofournewlydetectedlower-caseEnglishtweets.

Withthesentimentdetectionjob,wecannowvisualizeourfourcollaboratingjobsasshowninthefollowingdiagram:

Bootstrapstreamsandcollaboratingtasks

TipTocorrectlyrunthejobs,itmayseemnecessarytostarttheJSONparserjobfollowedbythecleanupjobbeforefinallystartingthesentimentjob,butthisisnotthecase.AnyunreadmessagesremainbufferedinKafka,soitdoesn’tmatterinwhichorderthejobsofamulti-jobworkflowarestarted.Ofcourse,thesentimentjobwilloutputcountsof0tweetsuntilitstartsreceivingdata,butnothingwillbreakifastreamjobstartsbeforethoseitdependson.

StatefultasksThefinalaspectofSamzathatwewillexploreishowitallowsthetasksprocessingstreampartitionstohavepersistentlocalstate.Inthepreviousexample,weusedprivatevariablestokeepatrackofrunningtotals,butsometimesitisusefulforatasktohavericherlocalstate.Anexamplecouldbetheactofperformingalogicaljoinontwostreams,whereitisusefultobuildupastatemodelfromonestreamandcomparethiswiththeother.

NoteNotethatSamzacanutilizeitsconceptofpartitionedstreamstogreatlyoptimizetheactofjoiningstreams.Ifeachstreamtobejoinedusesthesamepartitionkey(forexample,auserID),theneachtaskconsumingthesestreamswillreceiveallmessagesassociatedwitheachIDacrossallthestreams.

Samzahasanotherabstractionsimilartoitsnotionoftheframeworktomanageitsjobsandthatwhichimplementsitstasks.Itdefinesanabstractkey-valuestorethatcanhavemultipleconcreteimplementations.Samzausesexistingopensourceprojectsfortheon-diskimplementationsandusedLevelDBasofv0.7andaddedRocksDBasofv0.8.Thereisalsoanin-memorystorethatdoesnotpersistthekey-valuedatabutthatmaybeusefulintestingorpotentiallyveryspecificproductionworkloads.

Eachtaskcanwritetothiskey-valuestoreandSamzamanagesitspersistencetothelocalimplementation.Tosupportpersistentstates,thestoreisalsomodeledasastreamandallwritestothestorearealsopushedintoastream.Ifataskfails,thenonrestart,itcanrecoverthestateofitslocalkey-valuestorebyreplayingthemessagesinthebackingtopic.Anobviousconcernherewillbethenumberofmessagesthatneedtobereplayed;however,whenusingKafka,forexample,itcompactsmessageswiththesamekeysothatonlythelatestupdateremainsinthetopic.

We’llmodifyourprevioustweetsentimentexampletoaddalifetimecountofthemaximumpositiveandnegativesentimentseeninanytweet.ThefollowingcodecanbefoundasTwitterStatefulSentimentStateTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatefulSentimentStreamTask.javaNotethattheprocessmethodisthesameasTwitterSentimentStateTask.java,sowehaveomittedithereforspacereasons:

publicclassTwitterStatefulSentimentStreamTaskimplementsStreamTask,

WindowableTask,InitableTask{

privateSet<String>positiveWords=newHashSet<String>();

privateSet<String>negativeWords=newHashSet<String>();

privateinttweets=0;

privateintpositiveTweets=0;

privateintnegativeTweets=0;

privateintmaxPositive=0;

privateintmaxNegative=0;

privateKeyValueStore<String,Integer>store;

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatefulSentimentStreamTask.java

@SuppressWarnings("unchecked")

@Override

publicvoidinit(Configconfig,TaskContextcontext){

this.store=(KeyValueStore<String,Integer>)

context.getStore("tweet-store");

}

@Override



...

}

@Override


coordinator){

IntegerlifetimeMaxPositive=store.get("lifetimeMaxPositive");

IntegerlifetimeMaxNegative=store.get("lifetimeMaxNegative");

if((lifetimeMaxPositive==null)||(maxPositive>

lifetimeMaxPositive)){

lifetimeMaxPositive=maxPositive;

store.put("lifetimeMaxPositive",lifetimeMaxPositive);

}

if((lifetimeMaxNegative==null)||(maxNegative>

lifetimeMaxNegative)){

lifetimeMaxNegative=maxNegative;

store.put("lifetimeMaxNegative",lifetimeMaxNegative);

}

Stringmsg=

String.format(

"Tweets:%dPositive:%dNegative:%dMaxPositive:%d

MaxNegative:%dLifetimeMaxPositive:%dLifetimeMaxNegative:%d",

tweets,positiveTweets,negativeTweets,maxPositive,

maxNegative,lifetimeMaxPositive,

lifetimeMaxNegative);


SystemStream("kafka","tweet-stateful-sentiment-stats"),msg));


tweets=0;

positiveTweets=0;

negativeTweets=0;

maxPositive=0;

maxNegative=0;

}

}

ThisclassimplementsanewinterfacecalledInitableTask.Thishasasinglemethodcalledinitandisusedwhenataskneedstoconfigureaspectsofitsconfigurationbeforeitbeginsexecution.Weusetheinit()methodheretocreateaninstanceoftheKeyValueStoreclassandstoreitinaprivatemembervariable.

KeyValueStore,asthenamesuggests,providesafamiliarput/gettypeinterface.Inthiscase,wespecifythatthekeysareofthetypeStringandthevaluesareIntegers.Inourwindowmethod,weretrieveanypreviouslystoredvaluesforthemaximumpositiveandnegativesentimentandifthecountinthecurrentwindowishigher,updatethestoreaccordingly.Then,wejustoutputtheresultsofthewindowmethodasbefore.

Asyoucansee,theuserdoesnotneedtodealwiththedetailsofeitherthelocalorremotepersistenceoftheKeyValueStoreinstance;thisisallhandledbySamza.Theefficiencyofthemechanismalsomakesittractablefortaskstoholdsizeableamountoflocalstate,whichcanbeparticularlyvaluableincasessuchaslong-runningaggregationsorstreamjoins.

Theconfigurationfileforthejobcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-stateful-sentiment.properties.Itneedstohaveafewentriesadded,whichareasfollows:

stores.tweet-

store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory

stores.tweet-store.changelog=kafka.twitter-stats-state

stores.tweet-store.key.serde=string

stores.tweet-store.msg.serde=integer

Thefirstlinespecifiestheimplementationclassforthestore,thesecondlinespecifiestheKafkatopictobeusedforpersistentstate,andthelasttwolinesspecifythetypeofthestorekeyandvalue.

Torunthisjob,usethefollowingcommand:

$./gradlewrunTwitterStatefulSentiment

Forconvenience,thefollowingcommandwillstartupfourjobs:theJSONparser,thetextcleanup,thestatisticsjobandthestatefulsentimentjobs:

$./gradlewrunTasks

Samzaisapurestream-processingsystemthatprovidespluggableimplementationsofitsstorageandexecutionlayers.ThemostcommonlyusedpluginsareYARNandKafka,andthesedemonstratehowSamzacanintegratetightlywithHadoopYARNwhileusingacompletelydifferentstoragelayer.Samzaisstillarelativelynewprojectandthecurrentfeaturesareonlyasubsetofwhatisenvisaged.Itisrecommendedtoconsultitswebpagetogetthelatestinformationonitscurrentstatus.

https://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-stateful-sentiment.properties

SummaryThischapterfocusedmuchmoreonwhatcanbedoneonHadoop2,andinparticularYARN,thanthedetailsofHadoopinternals.Thisisalmostcertainlyagoodthing,asitdemonstratesthatHadoopisrealizingitsgoalofbecomingamuchmoreflexibleandgenericdataprocessingplatformthatisnolongertiedtobatchprocessing.Inparticular,wehighlightedhowSamzashowsthattheprocessingframeworksthatcanbeimplementedonYARNcaninnovateandenablefunctionalityvastlydifferentfromthatavailableinHadoop1.

Inparticular,wesawhowSamzagoestotheoppositeendofthelatencyspectrumfrombatchprocessingandenablesper-messageprocessingofindividualmessagesastheyarrive.

WealsosawhowSamzaprovidesacallbackmechanismthatMapReducedeveloperswillbefamiliarwith,butusesitforaverydifferentprocessingmodel.WealsodiscussedthewaysinwhichSamzautilizesYARNasitsmainexecutionframeworkandhowitimplementsthemodeldescribedinChapter3,Processing–MapReduceandBeyond.

Inthenextchapter,wewillswitchgearsandexploreApacheSpark.ThoughithasaverydifferentdatamodelthanSamza,we’llseethatitdoesalsohaveanextensionthatsupportsprocessingofrealtimedatastreams,includingtheoptionofKafkaintegration.However,bothprojectsaresodifferentthattheyarecomplimentarymorethanincompetition.

Chapter5.IterativeComputationwithSparkInthepreviouschapter,wesawhowSamzacanenablenearreal-timestreamdataprocessingwithinHadoop.ThisisquiteastepawayfromthetraditionalbatchprocessingmodelofMapReduce,butstillkeepswiththemodelofprovidingawell-definedinterfaceagainstwhichbusinesslogictaskscanbeimplemented.InthischapterwewillexploreApacheSpark,whichcanbeviewedbothasaframeworkonwhichapplicationscanbebuiltaswellasaprocessingframeworkinitsownright.NotonlyareapplicationsbeingbuiltonSpark,butentirecomponentswithintheHadoopecosystemarealsobeingreimplementedtouseSparkastheirunderlyingprocessingframework.Inparticular,wewillcoverthefollowingtopics:

WhatSparkisandhowitscoresystemcanrunonYARNThedatamodelprovidedbySparkthatenableshugelyscalableandhighlyefficientdataprocessingThebreadthofadditionalSparkcomponentsandrelatedprojects

It’simportanttonoteupfrontthatalthoughSparkhasitsownmechanismtoprocessstreamingdata,thisisbutonepartofwhatSparkhastooffer.It’sbesttothinkofitasamuchbroaderinitiative.

ApacheSparkApacheSpark(https://spark.apache.org/)isadataprocessingframeworkbasedonageneralizationofMapReduce.ItwasoriginallydevelopedbytheAMPLabatUCBerkeley(https://amplab.cs.berkeley.edu/).LikeTez,SparkactsasanexecutionenginethatmodelsdatatransformationsasDAGsandstrivestoeliminatetheI/OoverheadofMapReduceinordertoperformiterativecomputationatscale.WhileTez’smaingoalwastoprovideafasterexecutionengineforMapReduceonHadoop,SparkhasbeendesignedbothasastandaloneframeworkandanAPIforapplicationdevelopment.Thesystemisdesignedtoperformgeneral-purposein-memorydataprocessing,streamworkflows,aswellasinteractiveanditerativecomputation.

SparkisimplementedinScala,whichisastaticallytypedprogramminglanguagefortheJavaVMandexposesnativeprogramminginterfacesforJavaandPythoninadditiontoScalaitself.NotethatthoughJavacodecancalltheScalainterfacedirectly,therearesomeaspectsofthetypesystemthatmakesuchcodeprettyunwieldy,andhenceweusethenativeJavaAPI.

ScalashipswithaninteractiveshellsimilartothatofRubyandPython;thisallowsuserstorunSparkinteractivelyfromtheinterpretertoqueryanydataset.

TheScalainterpreteroperatesbycompilingaclassforeachlinetypedbytheuser,loadingitintotheJVM,andinvokingafunctiononit.Thisclassincludesasingletonobjectthatcontainsthevariablesorfunctionsonthatlineandrunstheline’scodeinaninitializemethod.Inadditiontoitsrichprogramminginterfaces,Sparkisbecomingestablishedasanexecutionengine,withpopulartoolsoftheHadoopecosystem(suchasPigandHive)beingportedtotheframework.

https://spark.apache.org/

https://amplab.cs.berkeley.edu/

ClustercomputingwithworkingsetsSpark’sarchitectureiscenteredaroundtheconceptofResilientDistributedDatasets(RDDs),whichisaread-onlycollectionofScalaobjectspartitionedacrossasetofmachinesthatcanpersistinmemory.Thisabstractionwasproposedina2012researchpaper,ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,whichcanbefoundathttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.

ASparkapplicationconsistsofadriverprogramthatexecutesparalleloperationsonaclusterofworkersandlong-livedprocessesthatcanstoredatapartitionsinmemorybydispatchingfunctionsthatrunasparalleltasks,asshowninthefollowingdiagram:

Sparkclusterarchitecture

ProcessesarecoordinatedviaaSparkContextinstance.SparkContextconnectstoaresourcemanager(suchasYARN),requestsexecutorsonworkernodes,andsendstaskstobeexecuted.Executorsareresponsibleforrunningtasksandmanagingmemorylocally.

Sparkallowsyoutosharevariablesbetweentasks,orbetweentasksandthedriver,usinganabstractionknownassharedvariables.Sparksupportstwotypesofsharedvariables:broadcastvariables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichareadditivevariablessuchascountersandsums.

ResilientDistributedDatasets(RDDs)AnRDDisstoredinmemory,sharedacrossmachinesandisusedinMapReduce-likeparalleloperations.Faulttoleranceisachievedthroughthenotionoflineage:ifapartitionofanRDDislost,theRDDhasenoughinformationabouthowitwasderivedfromotherRDDstobeabletorebuildjustthatpartition.AnRDDcanbebuiltinfourways:

ByreadingdatafromafilestoredinHDFSBydividing–parallelizing–aScalacollectionintoanumberofpartitionsthataresenttoworkers

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

BytransforminganexistingRDDusingparalleloperatorsBychangingthepersistenceofanexistingRDD

SparkshineswhenRDDscanfitinmemoryandcanbecachedacrossoperations.TheAPIexposesmethodstopersistRDDsandallowsforseveralpersistencestrategiesandstoragelevels,allowingforspilltodiskaswellasspace-efficientbinaryserialization.

ActionsOperationsareinvokedbypassingfunctionstoSpark.Thesystemdealswithvariablesandsideeffectsaccordingtothefunctionalprogrammingparadigm.Closurescanrefertovariablesinthescopewheretheyarecreated.Examplesofactionsarecount(returnsthenumberofelementsinthedataset),andsave(outputsthedatasettostorage).OtherparalleloperationsonRDDsincludethefollowing:

map:appliesafunctiontoeachelementofthedatasetfilter:selectselementsfromadatasetbasedonuser-providedcriteriareduce:combinesdatasetelementsusinganassociativefunctioncollect:sendsallelementsofthedatasettothedriverprogramforeach:passeseachelementthroughauser-providedfunctiongroupByKey:groupsitemstogetherbyaprovidedkeysortByKey:sortsitemsbykey

DeploymentSparkcanrunbothinlocalmode,similartoaHadoopsingle-nodesetup,oratoparesourcemanager.Currentlysupportedresourcemanagersinclude:

SparkStandaloneClusterModeYARNApacheMesos

SparkonYARNAnad-hoc-consolidatedJARneedstobebuiltinordertodeploySparkonYARN.SparklaunchesaninstanceofthestandalonedeployedclusterwithintheResourceManager.ClouderaandMapRbothshipwithSparkonYARNaspartoftheirsoftwaredistribution.Atthetimeofwriting,SparkisavailableforHortonworks’sHDPasatechnologypreview(http://hortonworks.com/hadoop/spark/).

SparkonEC2Sparkcomeswithadeploymentscript,spark-ec2,locatedintheec2directory.ThisscriptautomaticallysetsupSparkandHDFSonaclusterofEC2instances.InordertolaunchaSparkclusterontheAmazoncloud,gototheec2directoryandrunthefollowingcommand:

./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch<cluster-

name>

Here,<keypair>isthenameofyourEC2keypair,<key-file>istheprivatekeyfileforthekeypair,<num-slaves>isthenumberofslavenodestobelaunched,and<cluster-name>isthenametobegiventoyourcluster.SeeChapter1,Introduction,formoredetailsregardingthesetupofkeypairs,andverifythattheclusterschedulerisupandseesalltheslavesbygoingtoitswebUI,theaddressofwhichwillbeprintedoncethescriptcompletes.

YoucanspecifyapathinS3astheinputthroughaURIoftheforms3n://<bucket>/path.YouwillalsoneedtosetyourAmazonsecuritycredentials,eitherbysettingtheenvironmentvariablesAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYbeforeyourprogramisexecuted,orthroughSparkContext.hadoopConfiguration.

http://hortonworks.com/hadoop/spark/

GettingstartedwithSparkSparkbinariesandsourcecodeareavailableontheprojectwebsiteathttp://spark.apache.org/.TheexamplesinthefollowingsectionhavebeentestedusingSpark1.1.0builtfromsourceontheClouderaCDH5.0QuickStartVM.

Downloadanduncompressthegziparchivewiththefollowingcommands:

$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz

$tarxvzfspark-1.1.0.tgz

$cdspark-1.1.0

SparkisbuiltonScala2.10andusessbt(https://github.com/sbt/sbt)tobuildthesourcecoreandrelatedexamples:

$./sbt/sbt-Dhadoop.version=2.2.0-Pyarnassembly

Withthe-Dhadoop.version=2.2.0and-Pyarnoptions,weinstructsbttobuildagainstHadoopversions2.2.0orhigherandenableYARNsupport.

StartSparkinstandalonemodewiththefollowingcommand:

$./sbin/start-all.sh

Thiscommandwilllaunchalocalmasterinstanceatspark://localhost:7077aswellasaworkernode.

Awebinterfacetothemasternodecanbeaccessedathttp://localhost:8080/andcanbeseeninthefollowingscreenshot:

Masternodewebinterface

Sparkcanruninteractivelythroughspark-shell,whichisamodifiedversionoftheScalashell.Asafirstexample,wewillimplementawordcountoftheTwitterdatasetweusedinChapter3,Processing-MapReduceandBeyond,usingtheScalaAPI.

Startaninteractivespark-shellsessionbyrunningthefollowingcommand:

$./bin/spark-shell

TheshellinstantiatesaSparkContextobject,sc,thatisresponsibleforhandlingdriverconnectionstoworkers.Wewilldescribeitssemanticslaterinthischapter.

http://spark.apache.org/

https://github.com/sbt/sbt

Tomakethingsabiteasier,let’screateasampletextualdatasetthatcontainsonestatusupdateperline:

$stream.py-t-n1000>sample.txt

Then,copyittoHDFS:

$hdfsdfs-putsample.txt/tmp

Withinspark-shell,wefirstcreateanRDD-file-fromthesampledata:

valfile=sc.textFile("/tmp/sample.txt")

Then,weapplyaseriesoftransformationstocountthewordoccurrencesinthefile.Notethattheoutputofthetransformationchain-counts-isstillanRDD:

valcounts=file.flatMap(line=>line.split(""))

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

Thischainoftransformationscorrespondstothemapandreducephasesthatwearefamiliarwith.Inthemapphase,weloadeachlineofthedataset(flatMap),tokenizeeachtweetintoasequenceofwords,counttheoccurrenceofeachword(map),andemit(key,value)pairs.Inthereducephase,wegroupbykey(word)andsumvalues(m,n)togethertoobtainwordcounts.

Finally,weprintthefirsttenelements,counts.take(10),totheconsole:

counts.take(10).foreach(println)

WritingandrunningstandaloneapplicationsSparkallowsstandaloneapplicationstobewrittenusingthreeAPIs:Scala,Java,andPython.

ScalaAPIThefirstthingaSparkdrivermustdoistocreateaSparkContextobject,whichtellsSparkhowtoaccessacluster.Afterimportingclassesandimplicitconversionsintoaprogram,asinthefollowing:

importorg.apache.spark.SparkContext

importorg.apache.spark.SparkContext._

TheSparkContextobjectcanbecreatedwiththefollowingconstructor:

newSparkContext(master,appName,[sparkHome])

ItcanalsobecreatedthroughSparkContext(conf),whichtakesaSparkConfobject.

ThemasterparameterisastringthatspecifiesaclusterURItoconnectto(suchasspark://localhost:7077)oralocalstringtoruninlocalmode.TheappNametermistheapplicationnamethatwillbeshownintheclusterwebUI.

ItisnotpossibletooverridethedefaultSparkContextclass,norisitpossibletocreateanewonewithinarunningSparkshell.ItishoweverpossibletospecifywhichmasterthecontextconnectstousingtheMASTERenvironmentvariable.Forexample,torunspark-shellonfourcores,usethefollowing:

$MASTER=local[4]./bin/spark-shell

JavaAPITheorg.apache.spark.api.javapackageexposesalltheSparkfeaturesavailableintheScalaversiontoJava.TheJavaAPIhasaJavaSparkContextclassthatreturnsinstancesoforg.apache.spark.api.java.JavaRDDandworkswithJavacollectionsinsteadofScalaones.

ThereareafewkeydifferencesbetweentheJavaandScalaAPIs:

Java7doesnotsupportanonymousorfirst-classfunctions;therefore,functionsmustbeimplementedbyextendingtheorg.apache.spark.api.java.function.Function,Function2,andotherclasses.AsofSparkversion1.0theAPIhasbeenrefactoredtosupportJava8lambdaexpressions.WithJava8,Functionclassescanbereplacedwithinlineexpressionsthatactasashorthandforanonymousfunctions.TheRDDmethodsreturnJavacollectionsKey-valuepairs,whicharesimplywrittenas(key,value)inScala,arerepresentedbythescala.Tuple2class.Tomaintaintypesafety,someRDDandfunctionmethods,suchasthosethathandlekeypairsanddoubles,areimplementedasspecializedclasses.

WordCountinJavaAnexampleofWordCountinJavaisincludedwiththeSparksourcecodedistributionatexamples/src/main/java/org/apache/spark/examples/JavaWordCount.java.

Firstofall,wecreateacontextusingtheJavaSparkContextclass:

JavaSparkContextsc=newJavaSparkContext(master,"JavaWordCount",

System.getenv("SPARK_HOME"),

JavaSparkContext.jarOfClass(JavaWordCount.class));

JavaRDD<String>data=sc.textFile(infile,1);

JavaRDD<String>words=data.flatMap(newFlatMapFunction<String,

String>(){

@Override

publicIterable<String>call(Strings){

returnArrays.asList(s.split(""));

}

});

JavaPairRDD<String,Integer>ones=words.map(newPairFunction<String,

String,Integer>(){

@Override

publicTuple2<String,Integer>call(Strings){

returnnewTuple2<String,Integer>(s,1);

}

});

JavaPairRDD<String,Integer>counts=ones.reduceByKey(new

Function2<Integer,Integer,Integer>(){

@Override

publicIntegercall(Integeri1,Integeri2){

returni1+i2;

}

});

WethenbuildanRDDfromtheHDFSlocationinfile.Inthefirststepofthetransformationchain,wetokenizeeachtweetinthedatasetandreturnalistofwords.WeuseaninstanceofJavaPairRDD<String,Integer>tocountoccurrencesofeachword.Finally,wereducetheRDDtoanewJavaPairRDD<String,Integer>instancethatcontainsalistoftuples,eachrepresentingawordandthenumberoftimesitwasfoundinthedataset.

PythonAPIPySparkrequiresPythonversion2.6orhigher.RDDssupportthesamemethodsastheirScalacounterpartsbuttakePythonfunctionsandreturnPythoncollectiontypes.Lambdasyntax(https://docs.python.org/2/reference/expressions.html)isusedtopassfunctionstoRDDs.

ThewordcountinpysparkisrelativelysimilartoitsScalacounterpart:

tweets=sc.textFile("/tmp/sample.txt")

counts=tweets.flatMap(lambdatweet:tweet.split(''))\

https://docs.python.org/2/reference/expressions.html

.map(lambdaword:(word,1))\

.reduceByKey(lambdam,n:m+n)

Thelambdaconstructcreatesanonymousfunctionsatruntime.lambdatweet:tweet.split('')createsafunctionthattakesastringtweetastheinputandoutputsalistofstringssplitbywhitespace.Spark’sflatMapappliesthisfunctiontoeachlineofthetweetsdataset.Inthemapphase,foreachwordtoken,lambdaword:(word,1)returns(word,1)tuplesthatindicatetheoccurrenceofawordinthedataset.InreduceByKey,wegroupthesetuplesbykey-word-andsumthevaluestogethertoobtainthewordcountwithlambdam,n:m+n.

TheSparkecosystemApacheSparkpowersanumberoftools,bothasalibraryandasanexecutionengine.

SparkStreamingSparkStreaming(foundathttp://spark.apache.org/docs/latest/streaming-programming-guide.html)isanextensionoftheScalaAPIthatallowsdataingestionfromstreamssuchasKafka,Flume,Twitter,ZeroMQ,andTCPsockets.

SparkStreamingreceivesliveinputdatastreamsanddividesthedataintobatches(arbitrarilysizedtimewindows),whicharethenprocessedbytheSparkcoreenginetogeneratethefinalstreamofresultsinbatches.Thishigh-levelabstractioniscalledDStream(org.apache.spark.streaming.dstream.DStreams)andisimplementedasasequenceofRDDs.DStreamallowsfortwokindsofoperations:transformationsandoutputoperations.TransformationsworkononeormoreDStreamstocreatenewDStreams.Aspartofachainoftransformations,datacanbepersistedeithertoastoragelayer(HDFS)oranoutputchannel.SparkStreamingallowsfortransformationsoveraslidingwindowofdata.Awindow-basedoperationneedstospecifytwoparameters:thewindowlength,thedurationofthewindowandtheslideinterval,theintervalatwhichthewindow-basedoperationisperformed.

http://spark.apache.org/docs/latest/streaming-programming-guide.html

GraphXGraphX(foundathttps://spark.apache.org/docs/latest/graphx-programming-guide.html)isanAPIforgraphcomputationthatexposesasetofoperatorsandalgorithmsforgraph-orientedcomputationaswellasanoptimizedvariantofPregel.

https://spark.apache.org/docs/latest/graphx-programming-guide.html

MLlibMLlib(foundathttp://spark.apache.org/docs/latest/mllib-guide.html)providescommonMachineLearning(ML)functionality,includingtestsanddatagenerators.MLlibcurrentlysupportsfourtypesofalgorithms:binaryclassification,regression,clustering,andcollaborativefiltering.

http://spark.apache.org/docs/latest/mllib-guide.html

SparkSQLSparkSQLisderivedfromShark,whichisanimplementationoftheHivedatawarehousingsystemthatusesSparkasanexecutionengine.WewilldiscussHiveinChapter7,HadoopandSQL.WithSparkSQL,itispossibletomixSQL-likequerieswithScalaorPythoncode.TheresultsetsreturnedbyaqueryarethemselvesRDDs,andassuch,theycanbemanipulatedbySparkcoremethodsorMLlibandGraphX.

ProcessingdatawithApacheSparkInthissection,wewillimplementtheexamplesfromChapter3,Processing–MapReduceandBeyond,usingtheScalaAPI.Wewillconsiderboththebatchandreal-timeprocessingscenarios.WewillshowyouhowSparkStreamingcanbeusedtocomputestatisticsontheliveTwitterstream.

BuildingandrunningtheexamplesScalasourcecodefortheexamplescanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch5.Wewillbeusingsbttobuild,manage,andexecutecode.

Thebuild.sbtfilecontrolsthecodebasemetadataandsoftwaredependencies;theseincludetheversionoftheScalainterpreterthatSparklinksto,alinktotheAkkapackagerepositoryusedtoresolveimplicitdependencies,aswellasdependenciesonSparkandHadooplibraries.

Thesourcecodeforallexamplescanbecompiledwith:

$sbtcompile

Or,itcanbepackagedintoaJARfilewith:

$sbtpackage

Ahelperscripttoexecutecompiledclassescanbegeneratedwith:

$sbtadd-start-script-tasks

$sbtstart-script

Thehelpercanbeinvokedasfollows:

$target/start<classname><master><param1>…<paramn>

Here,<master>istheURIofthemasternode.AninteractiveScalasessioncanbeinvokedviasbtwiththefollowingcommand:

$sbtconsole

ThisconsoleisnotthesameastheSparkinteractiveshell;rather,itisanalternativewaytoexecutecode.InordertorunSparkcodeinitwewillneedtomanuallyimportandinstantiateaSparkContextobject.Allexamplespresentedinthissectionexpectatwitter4j.propertiesfilecontainingtheconsumerkeyandsecretandtheaccesstokenstobepresentinthesamedirectorywheresbtorspark-shellisbeinginvoked:

oauth.consumerKey=

oauth.consumerSecret=

oauth.accessToken=

oauth.accessTokenSecret=

RunningtheexamplesonYARNToruntheexamplesonaYARNgrid,wefirstbuildaJARfileusing:

$sbtpackage

Then,weshipittotheresourcemanagerusingthespark-submitcommand:

./bin/spark-submit--classapplication.to.execute--masteryarn-cluster

[options]target/scala-2.10/chapter-4_2.10-1.0.jar[<param1>…<paramn>]


Unlikethestandalonemode,wedon’tneedtospecifya<master>URI.InYARN,theResourceManagerisselectedfromtheclusterconfiguration.MoreinformationonlaunchingsparkinYARNcanbefoundathttp://spark.apache.org/docs/latest/running-on-yarn.html.

FindingpopulartopicsUnliketheearlierexampleswiththeSparkshellweinitializeaSparkContextaspartoftheprogram.WepassthreeargumentstotheSparkContextconstructor:thetypeofschedulerwewanttouse,anamefortheapplication,andthedirectorywhereSparkisinstalled:



importscala.util.matching.Regex

objectHashtagCount{

defmain(args:Array[String]){

[…]

valsc=newSparkContext(master,

"HashtagCount",

System.getenv("SPARK_HOME"))

valfile=sc.textFile(inputFile)

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=file.flatMap(line=>

(patternfindAllInline).toList)



counts.saveAsTextFile(outputPath)

}

}

WecreateaninitialRDDfromadatasetstoredinHDFS-inputFile-andapplylogicthatissimilartotheWordCountexample.

Foreachtweetinthedataset,weextractanarrayofstringsthatmatchthehashtagpattern(patternfindAllInline).toArray,andwecountanoccurrenceofeachstringusingthemapoperator.ThisgeneratesanewRDDasalistoftuplesintheform:

(word,1),(word2,1),(word,1)

Finally,wecombinetogetherelementsofthisRDDusingthereduceByKey()method.WestoretheRDDgeneratedbythislaststepbackintoHDFSwithsaveAsTextFile.

Thecodeforthestandalonedrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagCount.scala

AssigningasentimenttotopicsThesourcecodeofthisexamplecanbefoundat

http://spark.apache.org/docs/latest/running-on-yarn.html

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagCount.scala

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagSentiment.scalaandthecodeisasfollows:



importscala.util.matching.Regex

importscala.io.Source

objectHashtagSentiment{

defmain(args:Array[String]){

[…]

valsc=newSparkContext(master,

"HashtagSentiment",

System.getenv("SPARK_HOME"))


valpositive=Source.fromFile(positiveWordsPath)

.getLines

.filterNot(_startsWith";")

.toSet

valnegative=Source.fromFile(negativeWordsPath)

.getLines

.filterNot(_startsWith";")

.toSet


valcounts=file.flatMap(line=>(patternfindAllInline).map({

word=>(word,sentimentScore(line,positive,negative))

})).reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})

valsentiment=counts.map({hashtagScore=>

valhashtag=hashtagScore._1

valscore=hashtagScore._2

valnormalizedScore=score._1/score._2

(hashtag,normalizedScore)

})

sentiment.saveAsTextFile(outputPath)

}

}

First,wereadalistofpositiveandnegativewordsintoScalaSetobjectsandfilteroutcomments(stringsbeginningwith;).

Whenahashtagisfound,wecallafunction-sentimentScore-toestimatethesentimentexpressedbythatgiventext.ThisfunctionimplementsthesamelogicweusedinChapter3,Processing–MapReduceandBeyond,toestimatethesentimentofatweet.Ittakesasinputparametersthetweet’stext,str,andalistofpositiveandnegativewordsasSet[String]objects.Thereturnvalueisthedifferencebetweenthepositiveandnegativescoresandthenumberofwordsinthetweets.InSpark,werepresentthisreturnvalueasapairofDoubleandIntegerobjects:

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagSentiment.scala

defsentimentScore(str:String,positive:Set[String],

negative:Set[String]):(Double,Int)={

varpositiveScore=0;varnegativeScore=0;

str.split("""\s+""").foreach{w=>

if(positive.contains(w)){positiveScore+=1;}

if(negative.contains(w)){negativeScore+=1;}

}

((positiveScore-negativeScore).toDouble,

str.split("""\s+""").length)

}

Wereducethemapoutputbyaggregatingbythekey(thehashtag).Inthisphase,weemitatriplemadeofthehashtag,thesumofthedifferencebetweenpositiveandnegativescores,andthenumberofwordspertweet.WeuseanadditionalmapsteptonormalizethesentimentscoreandstoretheresultinglistofhashtagandsentimentpairstoHDFS.

DataprocessingonstreamsThepreviousexamplecanbeeasilyadjustedtoworkonareal-timestreamofdata.Inthisandthefollowingsection,wewillusespark-streaming-twittertoperformsomesimpleanalyticstasksonthereal-timefirehose:

valwindow=10

valssc=newStreamingContext(master,"TwitterStreamEcho",

Seconds(window),System.getenv("SPARK_HOME"))

valstream=TwitterUtils.createStream(ssc,auth)

valtweets=stream.map(tweet=>(tweet.getText()))

tweets.print()

ssc.start()

ssc.awaitTermination()

}

TheScalasourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/TwitterStreamEcho.scala

Thetwokeypackagesweneedtoimportare:

importorg.apache.spark.streaming.{Seconds,StreamingContext}

importorg.apache.spark.streaming.twitter._

WeinitializeanewStreamingContextssconalocalclusterusinga10-secondwindowandusethiscontexttocreateaDStreamoftweetswhosetextweprint.

Uponsuccessfulexecution,Twitter’sreal-timefirehosewillbeechoedintheterminalinbatchesof10secondsworthofdata.NoticethatthecomputationwillcontinueindefinitelybutcanbeinterruptedatanymomentbypressingCtrl+C.

TheTwitterUtilsobjectisawrappertotheTwitter4jlibrary(http://twitter4j.org/en/index.html)thatshipswithspark-streaming-twitter.AsuccessfulcalltoTwitterUtils.createStreamwillreturnaDStreamofTwitter4jobjects(TwitterInputDStream).Intheprecedingexample,weusedthegetText()methodtoextractthetweettext;however,noticethatthetwitter4jobjectexposesthefullTwitterAPI.Forinstance,wecanprintastreamofuserswiththefollowingcall:

valusers=stream.map(tweet=>(tweet.getUser().getId(),

tweet.getUser().getName()))

users.print()

StatemanagementSparkStreamingprovidesanadhocDStreamtokeepthestateofeachkeyinanRDDandtheupdateStateByKeymethodtomutatestate.

Wecanreusethecodeofthebatchexampletoassignandupdatesentimentscoresonstreams:

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/TwitterStreamEcho.scala

http://twitter4j.org/en/index.html

objectStreamingHashTagSentiment{

[…]

valcounts=text.flatMap(line=>(patternfindAllInline)

.toList

.map(word=>(word,sentimentScore(line,positive,negative))))

.reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})

valsentiment=counts.map({hashtagScore=>

valhashtag=hashtagScore._1

valscore=hashtagScore._2

valnormalizedScore=score._1/score._2

(hashtag,normalizedScore)

})

valstateDstream=sentiment

.updateStateByKey[Double](updateFunc)

stateDstream.print

ssc.checkpoint("/tmp/checkpoint")

ssc.start()

}

AstateDStreamiscreatedbycallinghashtagSentiment.updateStateByKey.

TheupdateFuncfunctionimplementsthestatemutationlogic,whichisacumulativesumofsentimentscoresoveraperiodoftime:

valupdateFunc=(values:Seq[Double],state:Option[Double])=>{

valcurrentScore=values.sum

valpreviousScore=state.getOrElse(0.0)

Some((currentScore+previousScore)*decayFactor)

}

decayFactorisaconstantvalue,lessthanorequaltozero,thatweusetoproportionallydecreasethescoreovertime.Intuitively,thiswillfadehashtagsiftheyarenottrendinganymore.SparkStreamingwritesintermediatedataforstatefuloperationstoHDFS,soweneedtocheckpointtheStreamingcontextwithssc.checkpoint.

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/StreamingHashTagSentiment.scala

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/StreamingHashTagSentiment.scala

DataanalysiswithSparkSQLSparkSQLcaneasethetaskofrepresentingandmanipulatingstructureddata.WewillloadaJSONfileintoatemporarytableandcalculatesimplestatisticsbyblendingSQLstatementsandScalacode:

objectSparkJson{

[…]


valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

importsqlContext._

valtweets=sqlContext.jsonFile(inFile)

tweets.printSchema()

//RegistertheSchemaRDDasatable

tweets.registerTempTable("tweets")

valtext=sqlContext.sql("SELECTtext,user.idFROMtweets")

//Findthetenmostpopularhashtags


valcounts=text.flatMap(sqlRow=>(patternfindAllIn

sqlRow(0).toString).toList)



counts.registerTempTable("hashtag_frequency")

counts.printSchema

valtop10=sqlContext.sql("SELECT_1ashashtag,_2asfrequencyFROM

hashtag_frequencyorderbyfrequencydesclimit10")

top10.foreach(println)

}

Aswithpreviousexamples,weinstantiateaSparkContextscandloadthedatasetofJSONtweets.Wethencreateaninstanceoforg.apache.spark.sql.SQLContextbasedontheexistingsc.TheimportsqlContext._givesaccesstoallfunctionsandimplicitconventionsforsqlContext.Weloadthetweets’JSONdatasetusingsqlContext.jsonFile.TheresultingtweetsobjectisaninstanceofSchemaRDD,whichisanewtypeofRDDintroducedbySparkSQL.TheSchemaRDDclassisconceptuallysimilartoatableinarelationaldatabase;itiscomposedofRowobjectsandaschemathatdescribesthecontentineachRow.Wecanseetheschemaforatweetbycallingtweets.printSchema().Beforewe’reabletomanipulatetweetswithSQLstatements,weneedtoregisterSchemaRDDasatableintheSQLContext.WethenextractthetextfieldofaJSONtweetwithanSQLquery.NotethattheoutputofsqlContext.sqlisanRDDagain.Assuch,wecanmanipulateitusingSparkcoremethods.Inourcase,wereusethelogicusedinpreviousexamplestoextracthashtagsandcounttheiroccurrences.Finally,weregistertheresultingRDDasatable,hashtag_frequency,andorderhashtagsby

frequencywithaSQLquery.

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SparkJson.scala.

SQLondatastreamsAtthetimeofwriting,aSQLContextcannotbedirectlyinstantiatedfromaStreamingContextobject.Itis,however,possibletoqueryaDStreambyregisteringaSchemaRDDforeachRDDinagivenstream:

objectSqlOnStream{

[…]

valssc=newStreamingContext(sc,Seconds(window))

valgson=newGson()

valdstream=TwitterUtils

.createStream(ssc,auth)

.map(gson.toJson(_))

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

importsqlContext._

dstream.foreachRDD(rdd=>{

rdd.foreach(println)

valjsonRDD=sqlContext.jsonRDD(rdd)

jsonRDD.registerTempTable("tweets")

jsonRDD.printSchema

sqlContext.sql(query)

})

ssc.checkpoint("/tmp/checkpoint")

ssc.start()

ssc.awaitTermination()

}

Inordertogetthetwoworkingtogether,wefirstcreateaSparkContextscthatweusetoinitializebothaStreamingContextsscandasqlContext.Asinpreviousexamples,weuseTwitterUtils.createStreamtocreateaDStreamRDDdstream.Inthisexample,weuseGoogle’sGsonJSONparsertoserializeeachtwitter4jobjecttoaJSONstring.ToexecuteSparkSQLqueriesonthestream,weregisteraSchemaRDDjsonRDDwithinadstream.foreachRDDloop.WeusethesqlContext.jsonRDDmethodtocreateanRDDfromabatchofJSONtweets.Atthispoint,wecanquerytheSchemaRDDusingthesqlContext.sqlmethod.

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SqlOnStream.scala.

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SparkJson.scala

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SqlOnStream.scala

ComparingSamzaandSparkStreamingItisusefultocompareSamzaandSparkStreamingtohelpidentifytheareasinwhicheachcanbestbeapplied.Asithasbeenhopefullymadeclearinthisbook,thesetechnologiesareverymuchcomplimentary.EventhoughSparkStreamingmightappearcompetitivewithSamza,wefeelbothproductsoffercompellingadvantagesincertainareas.

Samzashineswhentheinputdataistrulyastreamofdiscreteeventsandyouwishtobuildprocessingthatoperatesonthistypeofinput.SamzajobsrunningonKafkacanhavelatenciesintheorderofmilliseconds.Thisprovidesaprogrammingmodelfocusedontheindividualmessagesandisthebetterfitfortruenearreal-timeprocessingapplications.Thoughitlackssupporttobuildtopologiesofcollaboratingjobs,itssimplemodelallowssimilarconstructstobebuiltand,perhapsmoreimportantly,beeasilyreasonedabout.Itsmodelofpartitioningandscalingalsofocusesonsimplicity,whichagainmakesaSamzaapplicationveryeasytounderstandandgivesitasignificantadvantagewhendealingwithsomethingasintrinsicallycomplexasreal-timedata.

Sparkismuchmorethanastreamingproduct.Itssupportforbuildingdistributeddatastructuresfromexistingdatasetsandusingpowerfulprimitivestomanipulatethesegivesittheabilitytoprocesslargedatasetsatahigherlevelofgranularity.OtherproductsintheSparkecosystembuildadditionalinterfacesorabstractionsuponthiscommonbatchprocessingcore.ThisisverymuchadifferentfocustothemessagestreammodelofSamza.

ThisbatchmodelisalsodemonstratedwhenwelookatSparkStreaming;insteadofaper-messageprocessingmodel,itslicesthemessagestreamintoaseriesofRDDs.Withafastexecutionengine,thismeanslatenciesaslowas1second(http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf).Forworkloadsthatwishtoanalyzethestreaminsuchaway,thiswillbeabetterfitthanSamza’sper-messagemodel,whichrequiresadditionallogictoprovidesuchwindowing.

http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

SummaryThischapterexploredSparkandshowedyouhowitaddsiterativeprocessingasanewrichframeworkuponwhichapplicationscanbebuiltatopYARN.Inparticular,wehighlighted:

Thedistributeddata-structure-basedprocessingmodelofSparkandhowitallowsveryefficientin-memorydataprocessingThebroaderSparkecosystemandhowmultipleadditionalprojectsarebuiltatopittospecializethecomputationalmodelevenfurther

InthenextchapterwewillexploreApachePiganditsprogramminglanguage,PigLatin.WewillseehowthistoolcangreatlysimplifysoftwaredevelopmentforHadoopbyabstractingawaysomeoftheMapReduceandSparkcomplexity.

Chapter6.DataAnalysiswithApachePigInthepreviouschapters,weexploredanumberofAPIsfordataprocessing.MapReduce,Spark,TezandSamzaareratherlow-level,andwritingnon-trivialbusinesslogicwiththemoftenrequiressignificantJavadevelopment.Moreover,differentuserswillhavedifferentneeds.ItmightbeimpracticalforananalysttowriteMapReducecodeorbuildaDAGofinputsandoutputstoanswersomesimplequeries.Atthesametime,asoftwareengineeroraresearchermightwanttoprototypeideasandalgorithmsusinghigh-levelabstractionsbeforejumpingintolow-levelimplementationdetails.

Inthischapterandthefollowingone,wewillexploresometoolsthatprovideawaytoprocessdataonHDFSusinghigher-levelabstractions.InthischapterwewillexploreApachePig,and,inparticular,wewillcoverthefollowingtopics:

WhatApachePigisandthedataflowmodelitprovidesPigLatin’sdatatypesandfunctionsHowPigcanbeeasilyenhancedusingcustomusercodeHowwecanusePigtoanalyzetheTwitterstream

AnoverviewofPigHistorically,thePigtoolkitconsistedofacompilerthatgeneratedMapReduceprograms,bundledtheirdependencies,andexecutedthemonHadoop.PigjobsarewritteninalanguagecalledPigLatinandcanbeexecutedinbothinteractiveandbatchfashions.Furthermore,PigLatincanbeextendedusingUserDefinedFunctions(UDFs)writteninJava,Python,Ruby,Groovy,orJavaScript.

Pigusecasesincludethefollowing:

DataprocessingAdhocanalyticalqueriesRapidprototypingofalgorithmsExtractTransformLoadpipelines

Followingatrendwehaveseeninpreviouschapters,Pigismovingtowardsageneral-purposecomputingarchitecture.Asofversion0.13theExecutionEngineinterface(org.apache.pig.backend.executionengine)actsasabridgebetweenthefrontendandthebackendofPig,allowingPigLatinscriptstobecompiledandexecutedonframeworksotherthanMapReduce.Atthetimeofwriting,version0.13shipswithMRExecutionEngine(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRExecutionEngineandworkonalow-latencybackendbasedonTez(org.apache.pig.backend.hadoop.executionengine.tez.*)isexpectedtobeincludedinversion0.14(seehttps://issues.apache.org/jira/browse/PIG-3446).WorkonintegratingSparkiscurrentlyinprogressinthedevelopmentbranch(seehttps://issues.apache.org/jira/browse/PIG-4059).

Pig0.13comeswithanumberofperformanceenhancementsfortheMapReducebackend,inparticulartwofeaturestoreducelatencyofsmalljobs:directHDFSaccess(https://issues.apache.org/jira/browse/PIG-3642)andautolocalmode(https://issues.apache.org/jira/browse/PIG-3463).DirectHDFS,theopt.fetchproperty,isturnedonbydefault.WhendoingaDUMPinasimple(map-only)scriptthatcontainsonlyLIMIT,FILTER,UNION,STREAM,orFOREACHoperators,inputdataisfetchedfromHDFS,andthequeryisexecuteddirectlyinPig,bypassingMapReduce.Withautolocal,thepig.auto.local.enabledproperty,PigwillrunaqueryintheHadooplocalmodewhenthedatasizeissmallerthanpig.auto.local.input.maxbytes.Autolocalisoffbydefault.

PigwilllaunchMapReducejobsifbothmodesareofforifthequeryisnoteligibleforeither.Ifbothmodesareon,Pigwillcheckwhetherthequeryiseligiblefordirectaccessand,ifnot,fallbacktoautolocal.Failingthat,itwillexecutethequeryonMapReduce.

https://issues.apache.org/jira/browse/PIG-3446




GettingstartedWewillusethestream.pyscriptoptionstoextractJSONdataandretrieveaspecificnumberoftweets;wecanrunthiswithacommandsuchasthefollowing:

$pythonstream.py-j-n10000>tweets.json

Thetweets.jsonfilewillcontainoneJSONstringoneachlinerepresentingatweet.

RememberthattheTwitterAPIcredentialsneedtobemadeavailableasenvironmentvariablesorhardcodedinthescriptitself.

RunningPigPigisatoolthattranslatesstatementswritteninPigLatinandexecutesthemeitheronasinglemachineinstandalonemodeoronafullHadoopclusterwhenindistributedmode.Eveninthelatter,Pig’sroleistotranslatePigLatinstatementsintoMapReducejobsandthereforeitdoesn’trequiretheinstallationofadditionalservicesordaemons.Itisusedasacommand-linetoolwithitsassociatedlibraries.

ClouderaCDHshipswithApachePigversion0.12.Alternatively,thePigsourcecodeandbinarydistributionscanbeobtainedathttps://pig.apache.org/releases.html.

Ascanbeexpected,theMapReducemoderequiresaccesstoaHadoopclusterandHDFSinstallation.MapReducemodeisthedefaultmodeexecutedwhenrunningthePigcommandatthecommand-lineprompt.Scriptscanbeexecutedwiththefollowingcommand:

$pig-f<script>

Parameterscanbepassedviathecommandlineusing-param<param>=<val>,asfollows:

$pig–paraminput=tweets.txt

ParameterscanalsobespecifiedinaparamfilethatcanbepassedtoPigusingthe-param_file<file>option.Multiplefilescanbespecified.Ifaparameterispresentmultipletimesinthefile,thelastvaluewillbeusedandawarningwillbedisplayed.Aparameterfilecontainsoneparameterperline.Emptylinesandcomments(specifiedbystartingalinewith#)areallowed.WithinaPigscript,parametersareintheform$<parameter>.Thedefaultvaluecanbeassignedusingthedefaultstatement:%defaultinputtweets.json'.ThedefaultcommandwillnotworkwithinaGruntsession;we’lldiscussGruntinthenextsection.

Inlocalmode,allfilesareinstalledandrunusingthelocalhostandfilesystem.Specifylocalmodeusingthe-xflag:

$pig-xlocal

Inbothexecutionmodes,Pigprogramscanberuneitherinaninteractiveshellorinbatchmode.

https://pig.apache.org/releases.html

Grunt–thePiginteractiveshellPigcanruninaninteractivemodeusingtheGruntshell,whichisinvokedwhenweusethepigcommandattheterminalprompt.Intherestofthischapter,wewillassumethatexamplesareexecutedwithinaGruntsession.OtherthanexecutingPigLatinstatements,Gruntoffersanumberofutilitiesandaccesstoshellcommands:

fs:allowsuserstomanipulateHadoopfilesystemobjectsandhasthesamesemanticsastheHadoopCLIsh:executescommandsviatheoperatingsystemshellexec:launchesaPigscriptwithinaninteractiveGruntsessionkill:killsaMapReducejobhelp:printsalistofallavailablecommands

ElasticMapReducePigscriptscanbeexecutedonEMRbycreatingaclusterwith--applicationsName=Pig,Args=--version,<version>,asfollows:

$awsemrcreate-cluster\

--name"Pigcluster"\

--ami-version<amiversion>\

--instance-type<EC2instance>\

--instance-count<numberofnodes>\

--applicationsName=Pig,Args=--version,<version>\

--log-uri<S3bucket>\

--stepsType=PIG,\

Name="Pigscript",\

Args=[-f,s3://<scriptlocation>,\

-p,input=<inputparam>,\

-p,output=<outputparam>]

TheprecedingcommandwillprovisionanewEMRclusterandexecutes3://<scriptlocation>.Noticethatthescriptstobeexecutedandtheinput(-pinput)andoutput(-poutput)pathsareexpectedtobelocatedonS3.

AsanalternativetocreatinganewEMRcluster,itispossibletoaddPigstepstoanalready-instantiatedEMRclusterusingthefollowingcommand:

$awsemradd-steps\

--cluster-id<clusterid>\

--stepsType=PIG,\

Name="OtherPigscript",\

Args=[-f,s3://<scriptlocation>,\

-p,input=<inputparam>,\

-p,output=<outputparam>]

Intheprecedingcommand,<clusterid>istheIDoftheinstantiatedcluster.

ItisalsopossibletosshintothemasternodeandrunPigLatinstatementswithinaGruntsessionwiththefollowingcommand:

$awsemrssh--cluster-id<clusterid>--key-pair-file<keypair>

FundamentalsofApachePigTheprimaryinterfacetoprogramApachePigisPigLatin,aprocedurallanguagethatimplementsideasofthedataflowparadigm.

PigLatinprogramsaregenerallyorganizedasfollows:

ALOADstatementreadsdatafromHDFSAseriesofstatementsaggregatesandmanipulatesdataASTOREstatementwritesoutputtothefilesystemAlternatively,aDUMPstatementdisplaystheoutputtotheterminal

Thefollowingexampleshowsasequenceofstatementsthatoutputsthetop10hashtagsorderedbythefrequency,extractedfromthedatasetoftweets:

tweets=LOAD'tweets.json'

USINGJsonLoader('created_at:chararray,

id:long,

id_str:chararray,

text:chararray');

hashtags=FOREACHtweets{

GENERATEFLATTEN(

REGEX_EXTRACT(

text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)',1)

)astag;

}

hashtags_grpd=GROUPhashtagsBYtag;

hashtags_count=FOREACHhashtags_grpd{

GENERATE

group,

COUNT(hashtags)asoccurrencies;

}

hashtags_count_sorted=ORDERhashtags_countBYoccurrenciesDESC;

top_10_hashtags=LIMIThashtags_count_sorted10;

DUMPtop_10_hashtags;

First,weloadthetweets.jsondatasetfromHDFS,de-serializetheJSONfile,andmapittoafour-columnschemathatcontainsatweet’screationtime,itsIDinnumericalandstringform,andthetext.Foreachtweet,weextracthashtagsfromitstextusingaregularexpression.Weaggregateonhashtag,countthenumberofoccurrences,andorderbyfrequency.Finally,welimittheorderedrecordstothetop10mostfrequenthashtags.

AseriesofstatementslikethepreviousoneispickedupbythePigcompiler,transformedintoMapReducejobs,andexecutedonaHadoopcluster.Theplannerandoptimizerwillresolvedependenciesoninputandoutputrelationsandparallelizetheexecutionofstatementswhereverpossible.

StatementsarethebuildingblocksofprocessingdatawithPig.Theytakearelationasinputandproduceanotherrelationasoutput.InPigLatinterms,arelationcanbedefined

asabagoftuples,twodatatypeswewillusethroughouttheremainderofthischapter.

UsersexperiencedwithSQLandtherelationaldatamodelmightfindPigLatin’ssyntaxsomewhatfamiliar.Whilethereareindeedsimilaritiesinthesyntaxitself,PigLatinimplementsanentirelydifferentcomputationalmodel.PigLatinisprocedural,itspecifiestheactualdatatransformstobeperformed,whereasSQLisdeclarativeanddescribesthenatureoftheproblembutdoesnotspecifytheactualruntimeprocessing.Intermsoforganizingdata,arelationcanbethoughtofasatableinarelationaldatabase,wheretuplesinabagcorrespondtotherowsinatable.Relationsareunorderedandthereforeeasilyparallelizable,andtheyarelessconstrainedthanrelationaltables.Pigrelationscancontaintupleswithdifferentnumbersoffields,andthosewiththesamefieldcountcanhavefieldsofdifferenttypesincorrespondingpositions.

AkeydifferencebetweenSQLandthedataflowmodeladoptedbyPigLatinliesinhowsplitsinadatapipelinearemanaged.Intherelationalworld,adeclarativelanguagesuchasSQLimplementsandexecutesqueriesthatwillgenerateasingleresult.Thedataflowmodelseesdatatransformationsasagraphwhereinputandoutputarenodesconnectedbyanoperator.Forinstance,intermediatestepsofaquerymightrequiretheinputtobegroupedbyanumberofkeysandresultinmultipleoutputs(GROUPBY).Pighasbuilt-inmechanismstomanagemultipledataflowsinsuchagraphbyexecutingoperatorsassoonasinputsarereadilyavailableandpotentiallyapplydifferentoperatorstoeachflow.Forinstance,Pig’simplementationoftheGROUPBYoperatorusestheparallelfeature(http://pig.apache.org/docs/r0.12.0/perf.html#parallel)toallowausertoincreasethenumberofreducetasksfortheMapReducejobsgeneratedandhenceincreasesconcurrency.Anadditionalsideeffectofthispropertyisthatwhenmultipleoperatorscanbeexecutedinparallelinthesameprogram,Pigdoesso(moredetailsonPig’smulti-queryimplementationcanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#multi-query-execution).AnotherconsequenceofPigLatin’sapproachtocomputationisthatitallowsthepersistenceofdataatanypointinthepipeline.Itallowsthedevelopertoselectspecificoperatorimplementationsandexecutionplanswhennecessary,effectivelyoverridingtheoptimizer.

PigLatinallowsandevenencouragesdeveloperstoinserttheirowncodealmostanywhereinapipelinebymeansofUserDefinedFunctions(UDFs)aswellasbyutilizingHadoopstreaming.UDFsallowuserstospecifycustombusinesslogiconhowdataisloaded,howitisstored,andhowitisprocessed,whereasstreamingallowsuserstolaunchexecutablesatanypointinthedataflow.

http://pig.apache.org/docs/r0.12.0/perf.html#parallel

http://pig.apache.org/docs/r0.12.0/perf.html#multi-query-execution

ProgrammingPigPigLatincomeswithanumberofbuilt-infunctions(theeval,load/store,math,string,bag,andtuplefunctions)andanumberofscalarandcomplexdatatypes.Additionally,Pigallowsfunctionanddata-typeextensionbymeansofUDFsanddynamicinvocationofJavamethods.

PigdatatypesPigsupportsthefollowingscalardatatypes:

int:asigned32-bitintegerlong:asigned64-bitintegerfloat:a32-bitfloatingpointdouble:a64-bitfloatingpointchararray:acharacterarray(string)inUnicodeUTF-8formatbytearray:abytearray(blob)boolean:abooleandatetime:adatetimebiginteger:aJavaBigIntegerbigdecimal:aJavaBigDecimal

Pigsupportsthefollowingcomplexdatatypes:

map:anassociativearrayenclosedby[],withthekeyandvalueseparatedby#,anditemsseparatedby,tuple:anorderedlistofdata,whereelementscanbeofanyscalarorcomplextypeenclosedby(),withitemsseparatedby,bag:anunorderedcollectionoftuplesenclosedby{}andseparatedby,

Bydefault,Pigtreatsdataasuntyped.Theusercandeclarethetypesofdataatloadtimeormanuallycastitwhennecessary.Ifadatatypeisnotdeclared,butascriptimplicitlytreatsavalueasacertaintype,Pigwillassumeitisofthattypeandcastitaccordingly.Thefieldsofabagortuplecanbereferredtobythenametuple.fieldorbytheposition$<index>.Pigcountsfrom0andhencethefirstelementwillbedenotedas$0.

PigfunctionsBuilt-infunctionsareimplementedinJava,andtheytrytofollowstandardJavaconventions.Therearehoweveranumberofdifferencestokeepinmind,whichareasfollows:

FunctionnamesarecasesensitiveanduppercaseIftheresultvalueisnull,empty,ornotanumber(NaN),PigreturnsnullIfPigisunabletoprocesstheexpression,itreturnsanexception

Alistofallbuilt-infunctionscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html.

Load/storeLoad/storefunctionsdeterminehowdatagoesintoandcomesoutofPig.ThePigStorage,TextLoader,andBinStoragefunctionscanbeusedtoreadandwriteUTF-8delimited,unstructuredtext,andbinarydatarespectively.Supportforcompressionisdeterminedbytheload/storefunction.ThePigStorageandTextLoaderfunctionssupportgzipandbzip2compressionforbothread(load)andwrite(store).TheBinStoragefunctiondoesnotsupportcompression.

Asofversion0.12,Pigincludesbuilt-insupportforloadingandstoringAvroandJSONdataviatheAvroStorage(load/store),JsonStorage(store),andJsonLoader(load).Atthetimeofwriting,JSONsupportisstillsomewhatlimited.Inparticular,PigexpectsaschemaforthedatatobeprovidedasanargumenttoJsonLoader/JsonStorage,oritassumesthat.pig_schema(producedbyJsonStorage)ispresentinthedirectorycontainingtheinputdata.Inpractice,thismakesitdifficulttoworkwithJSONdumpsnotgeneratedbyPigitself.

Asseeninourfollowingexample,wecanloadtheJSONdatasetwithJsonLoader:

tweets=LOAD'tweets.json'USINGJsonLoader(

'created_at:chararray,

id:long,

id_str:chararray,

text:chararray,

source:chararray');

WeprovideaschemasothatthefirstfiveelementsofaJSONobjectcreated_id,id,id_str,text,andsourcearemapped.Wecanlookattheschemaoftweetsbyusingdescribetweets,whichreturnsthefollowing:

tweets:{created_at:chararray,id:long,id_str:chararray,text:

chararray,source:chararray}

EvalEvalfunctionsimplementasetofoperationstobeappliedonanexpressionthatreturnsabagormapdatatype.Theexpressionresultisevaluatedwithinthefunctioncontext.

AVG(expression):computestheaverageofthenumericvaluesinasingle-column

http://pig.apache.org/docs/r0.12.0/func.html

bagCOUNT(expression):countsallelementswithnon-nullvaluesinthefirstpositioninabagCOUNT_STAR(expression):countsallelementsinabagIsEmpty(expression):checkswhetherabagormapisemptyMAX(expression),MIN(expression),andSUM(expression):returnthemax,min,orthesumofelementsinabagTOKENIZE(expression):splitsastringandoutputsabagofwords

Thetuple,bag,andmapfunctionsThesefunctionsallowconversionfromandtothebag,tuple,andmaptypes.Theyincludethefollowing:

TOTUPLE(expression),TOMAP(expression),andTOBAG(expression):Thesecoerceexpressiontoatuple,map,orbagTOP(n,column,relation):Thisreturnsthetopntuplesfromabagoftuples

Themath,string,anddatetimefunctionsPigexposesanumberoffunctionsprovidedbythejava.lang.Math,java.lang.String,java.util.Date,andJoda-TimeDateTimeclass(foundathttp://www.joda.org/joda-time/).

DynamicinvokersDynamicinvokersallowtheexecutionofJavafunctionswithouthavingtowraptheminaUDF.Theycanbeusedforanystaticfunctionthat:

acceptsnoargumentsoracceptsacombinationofstring,int,long,double,float,orarraywiththesesametypesreturnsastring,int,long,double,orfloatvalue

OnlyprimitivescanbeusedfornumbersandJavaboxedclasses(suchasInteger)cannotbeusedasarguments.Dependingonthereturntype,aspecifickindofinvokermustbeused:InvokeForString,InvokeForInt,InvokeForLong,InvokeForDouble,orInvokeForFloat.Moredetailsregardingdynamicinvokerscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html#dynamic-invokers.

MacrosAsofversion0.9,PigLatin’spreprocessorsupportsmacroexpansion.MacrosaredefinedusingtheDEFINEstatement:

DEFINEmacro_name(param1,...,paramN)RETURNSoutput_bag{

pig_latin_statements

};

Themacroisexpandedinline,anditsparametersarereferencedinthePigLatinblockwithin{}.

http://www.joda.org/joda-time/

http://pig.apache.org/docs/r0.12.0/func.html#dynamic-invokers

ThemacrooutputrelationisgivenintheRETURNSstatements(output_bag).RETURNSvoidisusedforamacrowithnooutputrelation.

Wecandefineamacrotocountthenumberofrowsinarelation,asfollows:

DEFINEcount_rows(X)RETURNScnt{

grpd=group$Xall;

$cnt=foreachgrpdgenerateCOUNT($X);

};

WecanuseitinaPigscriptorGruntsessiontocountthenumberoftweets:

tweets_count=count_rows(tweets);

DUMPtweets_count;

Macrosallowustomakescriptsmodularbyhousingcodeinseparatefilesandimportingthemwhereneeded.Forexample,wecansavecount_rowsinafilecalledcount_rows.macroandlateronimportitwiththecommandimport'count_rows.macro'.

Macroshaveanumberoflimitations;inparticular,onlyPigLatinstatementsareallowedinsideamacro.ItisnotpossibletouseREGISTERstatementsandshellcommands,UDFsarenotallowed,andparametersubstitutioninsidethemacroisnotsupported.

WorkingwithdataPigLatinprovidesanumberofrelationaloperatorstocombinefunctionsandapplytransformationsondata.Typicaloperationsinadatapipelineconsistoffilteringrelations(FILTER),aggregatinginputsbasedonkeys(GROUP),generatingtransformationsbasedoncolumnsofdata(FOREACH),andjoiningrelations(JOIN)basedonsharedkeys.

Inthefollowingsections,wewillillustratesuchoperatorsonadatasetoftweetsgeneratedbyloadingJSONdata.

FilteringTheFILTERoperatorselectstuplesfromarelationbasedonanexpression,asfollows:

relation=FILTERrelationBYexpression;

Wecanusethisoperatortofiltertweetswhosetextmatchesthehashtagregularexpression,asfollows:

tweets_with_tag=FILTERtweetsBY

(text

MATCHES'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)'

);

AggregationTheGROUPoperatorgroupstogetherdatainoneormorerelationsbasedonanexpressionorakey,asfollows:

relation=GROUPrelationBYexpression;

Wecangrouptweetsbythesourcefieldintoanewrelationgrpd,asfollows:

grpd=GROUPtweetsBYsource;

Itispossibletogrouponmultipledimensionsbyspecifyingatupleasthekey,asfollows:

grpd=GROUPtweetsBY(created_at,source);

TheresultofaGROUPoperationisarelationthatincludesonetupleperuniquevalueofthegroupexpression.Thistuplecontainstwofields.Thefirstfieldisnamedgroupandisofthesametypeasthegroupkey.Thesecondfieldtakesthenameoftheoriginalrelationandisofthetypebag.Thenamesofbothfieldsaregeneratedbythesystem.

UsingtheALLkeyword,Pigwillaggregateacrossthewholerelation.TheGROUPtweetsALLschemewillaggregatealltuplesinthesamegroup.

Aspreviouslymentioned,PigallowsexplicithandlingoftheconcurrencyleveloftheGROUPoperatorusingthePARALLELoperator:

grpd=GROUPtweetsBY(created_at,id)PARALLEL10;

Intheprecedingexample,theMapReducejobgeneratedbythecompilerwillrun10concurrentreducetasks.Pighasaheuristicestimateofhowmanyreducerstouse.

Anotherwayofgloballyenforcingthenumberofreducetasksistousethesetdefault_parallel<n>command.

ForeachTheFOREACHoperatorappliesfunctionsoncolumns,asfollows:

relation=FOREACHrelationGENERATEtransformation;

TheoutputofFOREACHdependsonthetransformationapplied.

Wecanusetheoperatortoprojectthetextofalltweetsthatcontainahashtag,asfollows:

t=FOREACHtweets_with_tagGENERATEtext;

Wecanalsoapplyafunctiontotheprojectedcolumns.Forinstance,wecanusetheREGEX_TOKENIZEfunctiontospliteachtweetintowords,asfollows:

t=FOREACHtweets_with_tagGENERATEFLATTEN(TOKENIZE(text))asword;

TheFLATTENmodifierfurtherun-neststhebaggeneratedbyTOKENIZEintoatupleofwords.

JoinTheJOINoperatorperformsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.Itssyntaxisasfollows:

relation=JOINrelation1BYexpression1,relation2BYexpression2;

Wecanuseajoinoperationtodetecttweetsthatcontainpositivewords,asfollows:

positive=LOAD'positive-words.txt'USINGPigStorage()as(w:chararray);

Filteroutthecomments,asfollows:

positive_words=FILTERpositiveBYNOTwMATCHES'^;.*';

positive_wordsisabagoftuples,eachcontainingaword.Wethentokenizethetweets’textandcreateanewbagof(id_str,word)tuplesasfollows:

id_words=FOREACHtweets{

GENERATE

id_str,

FLATTEN(TOKENIZE(text))asword;

}

Wejointhetworelationsonthewordfieldandobtainarelationofalltweetsthatcontainoneormorepositivewords,asfollows:

positive_tweets=JOINpositive_wordsBYw,id_wordsBYword;

Inthisstatement,wejoinpositive_wordsandid_wordsontheconditionthatid_words.wordisapositiveword.Thepositive_tweetsoperatorisabagintheformof{w:chararray,id_str:chararray,word:chararray}thatcontainsallelementsofpositive_wordsandid_wordsthatmatchthejoincondition.

WecancombinetheGROUPandFOREACHoperatortocalculatethenumberofpositivewordspertweet(withatleastonepositiveword).First,wegrouptherelationofpositivetweetsbythetweetID,andthenwecountthenumberofoccurrencesofeachIDintherelation,asfollows:

grpd=GROUPpositive_tweetsBYid_str;

score=FOREACHgrpdGENERATEFLATTEN(group),COUNT(positive_tweets);

TheJOINoperatorcanmakeuseoftheparallelizefeatureaswell,asfollows:

positive_tweets=JOINpositive_wordsBYw,id_wordsBYwordPARALLEL10

Theprecedingcommandwillexecutethejoinwith10reducertasks.

Itispossibletospecifytheoperator’sbehaviorwiththeUSINGkeywordfollowedbytheIDofaspecializedjoin.Moredetailscanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#specialized-joins.

http://pig.apache.org/docs/r0.12.0/perf.html#specialized-joins

ExtendingPig(UDFs)FunctionscanbeapartofalmosteveryoperatorinPig.TherearetwomaindifferencesbetweenUDFsandbuilt-infunctions.First,UDFsneedtoberegisteredusingtheREGISTERkeywordinordertomakethemavailabletoPig.Secondly,theyneedtobequalifiedwhenused.PigUDFscancurrentlybeimplementedinJava,Python,Ruby,JavaScript,andGroovy.ThemostextensivesupportisprovidedforJavafunctions,whichallowyoutocustomizeallpartsoftheprocessincludingdataload/store,transformation,andaggregation.Additionally,JavafunctionsarealsomoreefficientbecausetheyareimplementedinthesamelanguageasPigandbecauseadditionalinterfacesaresupported,suchastheAlgebraicandAccumulatorinterfaces.Ontheotherhand,RubyandPythonAPIsallowmorerapidprototyping.

TheintegrationofUDFswiththePigenvironmentismainlymanagedbythefollowingtwostatementsREGISTERandDEFINE:

REGISTERregistersaJARfilesothattheUDFsinthefilecanbeused,asfollows:

REGISTER'piggybank.jar'

DEFINEcreatesanaliastoafunctionorastreamingcommand,asfollows:

DEFINEMyFunctionmy.package.uri.MyFunction

Theversion0.12ofPigintroducedthestreamingofUDFsasamechanismforwritingfunctionsusinglanguageswithnoJVMimplementation.

ContributedUDFsPig’scodebasehostsaUDFrepositorycalledPiggybank.OtherpopularcontributedrepositoriesareTwitter’sElephantBird(foundathttps://github.com/kevinweil/elephant-bird/)andApacheDataFu(foundathttp://datafu.incubator.apache.org/).

PiggybankPiggybankisaplaceforPiguserstosharetheirfunctions.SharedcodeislocatedintheofficialPigSubversionrepositoryfoundathttp://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/TheAPIdocumentationcanbefoundathttp://pig.apache.org/docs/r0.12.0/api/underthecontribsection.PiggybankUDFscanbeobtainedbycheckingoutandcompilingthesourcesfromtheSubversionrepositoryorbyusingtheJARfilethatshipswithbinaryreleasesofPig.InClouderaCDH,piggybank.jarisavailableat/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar.

ElephantBirdElephantBirdisanopensourcelibraryofallthingsHadoopusedinproductionatTwitter.Thislibrarycontainsanumberofserializationtools,custominputandoutputformats,writables,Pigload/storefunctions,andmoremiscellanea.

ElephantBirdshipswithanextremelyflexibleJSONloaderfunction,whichatthetimeofwriting,isthego-toresourceformanipulatingJSONdatainPig.

ApacheDataFuApacheDataFuPigcollectsanumberofanalyticalfunctionsdevelopedandcontributedbyLinkedIn.Theseincludestatisticalandestimationfunctions,bagandsetoperations,sampling,hashing,andlinkanalysis.

https://github.com/kevinweil/elephant-bird/

http://datafu.incubator.apache.org/

http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/

http://pig.apache.org/docs/r0.12.0/api/

AnalyzingtheTwitterstreamInthefollowingexamples,wewillusetheimplementationofJsonLoaderprovidedbyElephantBirdtoloadandmanipulateJSONdata.WewillusePigtoexploretweetmetadataandanalyzetrendsinthedataset.Finally,wewillmodeltheinteractionbetweenusersasagraphanduseApacheDataFutoanalyzethissocialnetwork.

PrerequisitesDownloadtheelephant-bird-pig(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/4.5/elephant-bird-pig-4.5.jar),elephant-bird-hadoop-compat(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-hadoop-compat/4.5/elephant-bird-hadoop-compat-4.5.jar),andelephant-bird-core(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.5/elephant-bird-core-4.5.jar)JARfilesfromtheMavencentralrepositoryandcopythemontoHDFSusingthefollowingcommand:

$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/

$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/

$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/

http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/4.5/elephant-bird-pig-4.5.jar

http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-hadoop-compat/4.5/elephant-bird-hadoop-compat-4.5.jar

http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.5/elephant-bird-core-4.5.jar

DatasetexplorationBeforedivingdeeperintothedataset,weneedtoregisterthedependenciestoElephantBirdandDataFu,asfollows:

REGISTER/opt/cloudera/parcels/CDH/lib/pig/datafu-1.1.0-cdh5.0.0.jar

REGISTER/opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar

REGISTERhdfs:///jar/elephant-bird-pig-4.5.jar

REGISTERhdfs:///jar/elephant-bird-hadoop-compat-4.5.jar

REGISTERhdfs:///jar/elephant-bird-core-4.5.jar

Then,loadtheJSONdatasetoftweetsusingcom.twitter.elephantbird.pig.load.JsonLoader,asfollows:

tweets=LOAD'tweets.json'using

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

com.twitter.elephantbird.pig.load.JsonLoaderdecodeseachlineoftheinputfiletoJSONandpassestheresultingmapofvaluestoPigasasingle-elementtuple.ThisenablesaccesstoelementsoftheJSONobjectwithouthavingtospecifyaschemaupfront.The–nestedLoadargumentinstructstheclasstoloadnesteddatastructures.

TweetmetadataIntheremainderofthechapter,wewillusemetadatafromtheJSONdatasettomodelthetweetstream.OneexampleofmetadataattachedtoatweetisthePlaceobject,whichcontainsgeographicalinformationabouttheuser’slocation.Placecontainsfieldsthatdescribeitsname,ID,country,countrycode,andmore.Afulldescriptioncanbefoundathttps://dev.twitter.com/docs/platform-objects/places.

place=FOREACHtweetsGENERATE(chararray)$0#'place'asplace;

Entitiesgiveinformationsuchasstructureddatafromtweets,URLs,hashtags,andmentions,withouthavingtoextractthemfromtext.Adescriptionofentitiescanbefoundathttps://dev.twitter.com/docs/entities.Thehashtagentityisanarrayoftagsextractedfromatweet.Eachentityhasthefollowingtwoattributes:

Text:isthehashtagtextIndices:isthecharacterpositionfromwhichthehashtagwasextracted

Thefollowingcodeusesentities:

hashtags_bag=FOREACHtweets{

GENERATE

FLATTEN($0#'entities'#'hashtags')astag;

}

Wethenflattenhashtags_bagtoextracteachhashtag’stext:

hashtags=FOREACHhashtags_bagGENERATEtag#'text'astopic;

Entitiesforuserobjectscontaininformationthatappearsintheuserprofileanddescriptionfields.Wecanextractthetweetauthor’sIDviatheuserfieldinthetweetmap:

users=FOREACHtweetsGENERATE$0#'user'#'id'asid;

https://dev.twitter.com/docs/platform-objects/places

https://dev.twitter.com/docs/entities

DatapreparationTheSAMPLEbuilt-inoperatorselectsasetofntupleswithprobabilitypoutofthedataset,asfollows:

sampled=SAMPLEtweets0.01;

Theprecedingcommandwillselectapproximately1percentofthedataset.GiventhatSAMPLEisprobabilistic(http://en.wikipedia.org/wiki/Bernoulli_sampling),thereisnoguaranteethatthesamplesizewillbeexact.Moreoverthefunctionsampleswithreplacement,whichmeansthateachitemmightappearmorethanonce.

ApacheDataFuimplementsanumberofsamplingmethodsforcaseswherehavinganexactsamplesizeandnoreplacementisdesired(SimpleRandomSampling),samplingwithreplacement(SimpleRandomSampleWithReplacementVoteandSimpleRandomSampleWithReplacementElect),whenwewanttoaccountforsamplebias(WeightedRandomSampling),ortosampleacrossmultiplerelations(SampleByKey).

Wecancreateasampleofexactly1percentofthedataset,witheachitemhavingthesameprobabilityofbeingselected,usingSimpleRandomSample.

NoteTheactualguaranteeisasampleofsizeceil(p*n)withaprobabilityofatleast99percent.

First,wepassasamplingprobability0.01totheUDFconstructor:

DEFINESRSdatafu.pig.sampling.SimpleRandomSample('0.01');

andthebag,createdwith(GROUPtweetsALL),tobesampled:

sampled=FOREACH(GROUPtweetsALL)GENERATEFLATTEN(SRS(tweets));

TheSimpleRandomSampleUDFselectswithoutreplacement,whichmeansthateachitemwillappearonlyonce.

NoteWhichsamplingmethodtousedependsbothonthedataweareworkingwith,assumptionsonhowitemsaredistributed,thesizeofthedataset,andwhatwepracticallywanttoachieve.Ingeneral,whenwewanttoexploreadatasettoformulatehypotheses,SimpleRandomSamplecanbeagoodchoice.However,inseveralanalyticsapplications,itiscommontousemethodsthatassumereplacement(forexample,bootstrapping).

Notethatwhenworkingwithverylargedatasets,samplingwithreplacementandsamplingwithoutreplacementtendtobehavesimilarly.Theprobabilityofanitembeingselectedtwiceoutofapopulationofbillionsofitemswillbelow.

http://en.wikipedia.org/wiki/Bernoulli_sampling

TopnstatisticsOneofthefirstquestionswemightwanttoaskishowfrequentcertainthingsare.Forinstance,wemightwanttocreateahistogramofthetop10topicsbythenumberofmentions.Similarly,wemightwanttofindthetop50countriesorthetop10users.Beforelookingattweetsdata,wewilldefineamacrosothatwecanapplythesameselectionlogictodifferentcollectionsofitems:

DEFINEtop_n(rel,col,n)

RETURNStop_n_items{

grpd=GROUP$relBY$col;

cnt_items=FOREACHgrpd

GENERATEFLATTEN(group),COUNT($rel)AScnt;

cnt_items_sorted=ORDERcnt_itemsBYcntDESC;

$top_n_items=LIMITcnt_items_sorted$n;

}

Thetop_nmethodtakesarelationrel,thecolumncolwewanttocount,andthenumberofitemstoreturnnasparameters.InthePigLatinblock,wefirstgrouprelbyitemsincol,countthenumberofoccurrencesofeachitem,sortthem,andselectthemostfrequentn.

Tofindthetop10Englishhashtags,wefilterthembylanguage,andextracttheirtext:

tweets_en=FILTERtweetsby$0#'lang'=='en';

hashtags_bag=FOREACHtweets{

GENERATE

FLATTEN($0#'entities'#'hashtags')AStag;

}

hashtags=FOREACHhashtags_bagGENERATEtag#'text'AStag;

Andapplythetop_nmacro:

top_10_hashtags=top_n(hashtags,tag,10);

Inordertobettercharacterizewhatistrendingandmakethisinformationmorerelevanttousers,wecandrilldownintothedatasetandlookathashtagspergeographiclocation.

First,wegeneratebagof(place,hashtag)tuples,asfollows:

hashtags_country_bag=FOREACHtweetsgenerate{

0#'place'asplace,

FLATTEN($0#'entities'#'hashtags')astag;

}

Andthen,weextractthecountrycodeandhashtagtext,asfollows:

hashtags_country=FOREACHhashtags_country_bag{

GENERATE

place#'country_code'asco,

tag#'text'astag;

}

Then,wecounthowmanytimeseachcountrycodeandhashtagappeartogether,as

follows:

hashtags_country_frequency=FOREACH(GROUPhashtags_countryALL){

GENERATE

FLATTEN(group),

COUNT(hashtags_country)ascount;

}

Finally,wecountthetop10countriesperhashtagwiththeTOPfunction,asfollows:

hashtags_country_regrouped=GROUPhashtags_country_frequencyBYcnt;

top_results=FOREACHhashtags_country_regrouped{

result=TOP(10,1,hashtags_country_frequency);

GENERATEFLATTEN(result);

}

TOP‘sparametersarethenumberoftuplestoreturn,thecolumntocompare,andtherelationcontainingsaidcolumn:

top_results=FOREACHD{

result=TOP(10,1,C);

GENERATEFLATTEN(result);

}

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/topn.pig.

https://github.com/learninghadoop2/book-examples/blob/master/ch6/topn.pig

DatetimemanipulationThecreated_atfieldintheJSONtweetsgivesustime-stampedinformationaboutwhenthetweetwasposted.Unfortunately,itsformatisnotcompatiblewithPig’sbuilt-indatetimetype.

PiggybankcomestotherescuewithanumberoftimemanipulationUDFscontainedinorg.apache.pig.piggybank.evaluation.datetime.convert.OneofthemisCustomFormatToISO,whichconvertsanarbitrarilyformattedtimestampintoanISO8601datetimestring.

InordertoaccesstheseUDFs,wefirstneedtoregisterthepiggybank.jarfile,asfollows:

REGISTER/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar

Tomakeourcodelessverbose,wecreateanaliasfortheCustomFormatToISOclass’sfullyqualifiedJavaname:

DEFINECustomFormatToISO

org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();

Byknowinghowtomanipulatetimestamps,wecancalculatestatisticsatdifferenttimeintervals.Forinstance,wecanlookathowmanytweetsarecreatedperhour.Pighasabuilt-inGetHourfunctionthatextractsthehouroutofadatetimetype.Tousethis,wefirstconvertthetimestampstringtoISO8601withCustomFormatToISOandthentheresultingchararraytodatetimeusingthebuilt-inToDatefunction,asfollows:

hourly_tweets=FOREACHtweets{

GENERATE

GetHour(

ToDate(

CustomFormatToISO(

$0#'created_at','EEEMMMMdHH:mm:ssZy')

)

)ashour;

}

Now,itisjustamatterofgroupinghourly_tweetsbyhourandthengeneratingacountoftweetspergroup,asfollows:

hourly_tweets_count=FOREACH(GROUPhourly_tweetsBYhour){

GENERATEFLATTEN(group),COUNT(hourly_tweets);

}

SessionsDataFu’sSessionizeclasscanhelpustobettercaptureuseractivityovertime.Asessionrepresentstheactivityofauserwithinagivenperiodoftime.Forinstance,wecanlookateachuser’stweetstreamatintervalsof15minutesandmeasurethesesessionstodeterminebothnetworkvolumesaswellasuseractivity:

DEFINESessionizedatafu.pig.sessions.Sessionize('15m');

users_activity=FOREACHtweets{

GENERATE

CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')ASdt,

(chararray)$0#'user'#'id'asuser_id;

}

users_activity_sessionized=FOREACH

(GROUPusers_activityBYuser_id){

ordered=ORDERusers_activityBYdt;

GENERATEFLATTEN(Sessionize(ordered))

AS(dt,user_id,session_id);

}

user_activitysimplyrecordsthetimedtagivenuser_idpostedastatusupdate.

Sessionizetakesthesessiontimeoutandabagasinput.ThefirstelementoftheinputbagisanISO8601timestamp,andthebagmustbesortedbythistimestamp.Eventsthatarewithin15minutesfromeachotherwillbelongtothesamesession.

Itreturnstheinputbagwithanewfield,session_id,thatuniquelyidentifiesasession.Withthisdata,wecancalculatethesession’slengthandsomeotherstatistics.MoreexamplesofSessionizeusagecanbefoundathttp://datafu.incubator.apache.org/docs/datafu/guide/sessions.html.

http://datafu.incubator.apache.org/docs/datafu/guide/sessions.html

CapturinguserinteractionsIntheremainderofthechapter,wewilllookathowtocapturepatternsfromuserinteractions.Asafirststepinthisdirection,wewillcreateadatasetsuitabletomodelasocialnetwork.Thisdatasetwillcontainatimestamp,theIDofthetweet,theuserwhopostedthetweet,theuserandtweetshe’sreplyingto,andthehashtaginthetweet.

Twitterconsidersasareply(in_reply_to_status_id_str)anymessagebeginningwiththe@character.Suchtweetsareinterpretedasadirectmessagetothatperson.Placingan@characteranywhereelseinthetweetisinterpretedasamention('entities'#'user_mentions‘)andnotareply.Thedifferenceisthatmentionsareimmediatelybroadcasttoaperson’sfollowers,whereasrepliesarenot.Repliesare,however,consideredasmentions.

Whenworkingwithpersonallyidentifiableinformation,itisagoodideatoanonymizeifnotremoveentirelysensitivedatasuchasIPaddresses,names,anduserIDs.Acommonlyusedtechniqueinvolvesahashfunctionthattakesasinputthedatawewanttoanonymize,concatenatedwithadditionalrandomdatacalledsalt.Thefollowingcodeshowsanexampleofsuchanonymization:

DEFINESHAdatafu.pig.hash.SHA();

from_to_bag=FOREACHtweets{

dt=$0#'created_at';

user_id=(chararray)$0#'user'#'id';

tweet_id=(chararray)$0#'id_str';

reply_to_tweet=(chararray)$0#'in_reply_to_status_id_str';

reply_to=(chararray)$0#'in_reply_to_user_id_str';

place=$0#'place';

topics=$0#'entities'#'hashtags';

GENERATE

CustomFormatToISO(dt,'EEEMMMMdHH:mm:ssZy')ASdt,

SHA((chararray)CONCAT('SALT',user_id))ASsource,

SHA(((chararray)CONCAT('SALT',tweet_id)))AStweet_id,

((reply_to_tweetISNULL)

?NULL

:SHA((chararray)CONCAT('SALT',reply_to_tweet)))

ASreply_to_tweet_id,

((reply_toISNULL)

?NULL

:SHA((chararray)CONCAT('SALT',reply_to)))

ASdestination,

(chararray)place#'country_code'ascountry,

FLATTEN(topics)AStopic;

}

—extractthehashtagtext

from_to=FOREACHfrom_to_bag{

GENERATE

dt,

tweet_id,

reply_to_tweet_id,

source,

destination,

country,

(chararray)topic#'text'AStopic;

}

Inthisexample,weuseCONCATtoappenda(notsorandom)saltstringtopersonaldata.WethengenerateahashofthesaltedIDswithDataFu’sSHAfunction.TheSHAfunctionrequiresitsinputparameterstobenonnull.Weenforcethisconditionusingif-then-elsestatements.InPigLatin,thisisexpressedas<conditionistrue>?<truebranch>:<falsebranch>.Ifthestringisnull,wereturnNULL,andifnot,wereturnthesaltedhash.Tomakecodemorereadable,weusealiasesforthetweetJSONfieldsandreferencethemintheGENERATEblock.

LinkanalysisWecanredefineourapproachtodeterminetrendingtopicstoincludeusers’reactions.Afirst,naïve,approachcouldbetoconsideratopicasimportantifitcausedanumberofreplieslargerthanathresholdvalue.

Aproblemwiththisapproachisthattweetsgeneraterelativelyfewreplies,sothevolumeoftheresultingdatasetwillbelow.Hence,itrequiresaverylargeamountofdatatocontaintweetsbeingrepliedtoandproduceanyresult.Inpractice,wewouldlikelywanttocombinethismetricwithotherones(forexample,mentions)inordertoperformmoremeaningfulanalyses.

Tosatisfythisquery,wewillcreateanewdatasetthatincludesthehashtagsextractedfromboththetweetandtheoneauserisreplyingto:

tweet_hashtag=FOREACHfrom_toGENERATEtweet_id,topic;

from_to_self_joined=JOINfrom_toBYreply_to_tweet_idLEFT,

tweet_hashtagBYtweet_id;

twitter_graph=FOREACHfrom_to_self_joined{

GENERATE

from_to::dtASdt,

from_to::tweet_idAStweet_id,

from_to::reply_to_tweet_idASreply_to_tweet_id,

from_to::sourceASsource,

from_to::destinationASdestination,

from_to::topicAStopic,

from_to::countryAScountry,

tweet_hashtag::topicAStopic_replied;

}

NotethatPigdoesnotallowacrossjoinonthesamerelation,hencewehavetocreatetweet_hashtagfortheright-handsideofthejoin.Here,weusethe::operatortodisambiguatefromwhichrelationandcolumnwewanttoselectrecords.

Onceagain,wecanlookforthetop10topicsbynumberofrepliesusingthetop_nmacro:

top_10_topics=top_n(twitter_graph,topic_replied,10);

Countingthingswillonlytakeussofar.WecancomputemoredescriptivestatisticsonthisdatasetwithDataFu.UsingtheQuantilefunction,wecancalculatethemedian,the90th,95th,andthe99thpercentilesofthenumberofhashtagreactions,asfollows:

DEFINEQuantiledatafu.pig.stats.Quantile('0.5','0.90','0.95','0.99');

SincetheUDFexpectsanorderedbagofintegervaluesasinput,wefirstcountthefrequencyofeachtopic_repliedentry,asfollows.

topics_with_replies_grpd=GROUPtwitter_graphBYtopic_replied;

topics_with_replies_cnt=FOREACHtopics_with_replies_grpd{

GENERATE

COUNT(twitter_graph)ascnt;

}

Then,weapplyQuantileonthebagoffrequencies,asfollows:

quantiles=FOREACH(GROUPtopics_with_replies_cntALL){

sorted=ORDERtopics_with_replies_cntBYcnt;

GENERATEQuantile(sorted);

}

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/graph.pig.

https://github.com/learninghadoop2/book-examples/blob/master/ch6/graph.pig

InfluentialusersWewillusePageRank,analgorithmdevelopedbyGoogletorankwebpages(http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf),toidentifyinfluentialusersintheTwittergraphwegeneratedintheprevioussection.

Thistypeofanalysishasanumberofusecases,suchastargetedandcontextualadvertisement,recommendationsystems,spamdetection,andobviouslymeasuringtheimportanceofwebpages.Asimilarapproach,usedbyTwittertoimplementtheWhotoFollowfeature,isdescribedintheresearchpaperWTF:TheWhotoFollowserviceatTwitterfoundathttp://stanford.edu/~rezab/papers/wtf_overview.pdf.

Informally,PageRankdeterminestheimportanceofapagebasedontheimportanceofotherpageslinkingtoitandassignsitascorebetween0and1.AhighPageRankscoreindicatesthatalotofpagespointtoit.Intuitively,beinglinkedbypageswithahighPageRankisaqualityendorsement.IntermsoftheTwittergraph,weassumethatusersreceivingalotofrepliesareimportantorinfluentialwithinthesocialnetwork.InTwitter’scase,weconsideranextendeddefinitionofPageRank,wherethelinkbetweentwousersisgivenbyadirectreplyandlabeledbyanyeventualhashtagpresentinthemessage.Heuristically,wewanttoidentifyinfluentialusersonagiventopic.

InDataFu’simplementation,eachgraphisrepresentedasabagof(source,edges)tuples.ThesourcetupleisanintegerIDrepresentingthesourcenode.Theedgesareabagof(destination,weight)tuples.destinationisanintegerIDrepresentingthedestinationnode.weightisadoublerepresentinghowmuchtheedgeshouldbeweighted.TheoutputoftheUDFisabagof(source,rank)pairs,whererankisthePageRankvalueforthesourceuserinthegraph.Noticethatwetalkedaboutnodes,edges,andgraphsasabstractconcepts.InGoogle’scase,nodesarewebpages,edgesarelinksfromonepagetotheother,andgraphsaregroupsofpagesconnecteddirectlyandindirectly.

Inourcase,nodesrepresentusers,edgesrepresentin_reply_to_user_id_strmentions,andedgesarelabeledbyhashtagsintweets.TheoutputofPageRankshouldsuggestwhichusersareinfluentialonagiventopicgiventheirinteractionpatterns.

Inthissection,wewillwriteapipelineto:

RepresentdataasagraphwhereeachnodeisauserandahashtaglabelstheedgeMapIDsandhashtagstointegerssothattheycanbeconsumedbyPageRankApplyPageRankStoretheresultsintoHDFSinaninteroperableformat(Avro)

Werepresentthegraphasabagoftuplesintheform(source,destination,topic),whereeachtuplerepresentstheinteractionbetweennodes.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/pagerank.pig.

Wewillmapusers’andhashtags’texttonumericalIDs.WeusetheJavaStringhashCode()methodtoperformthisconversionstepandwrapthelogicinanEvalUDF.

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

http://stanford.edu/~rezab/papers/wtf_overview.pdf

https://github.com/learninghadoop2/book-examples/blob/master/ch6/pagerank.pig

NoteThesizeofanintegeriseffectivelytheupperboundforthenumberofnodesandedgesinthegraph.Forproductioncode,itisrecommendedthatyouuseamorerobusthashfunction.

TheStringToIntclasstakesastringasinput,callsthehashCode()method,andreturnsthemethodoutputtoPig.TheUDFcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/udf/com/learninghadoop2/pig/udf/StringToInt.java.

packagecom.learninghadoop2.pig.udf;

importjava.io.IOException;

importorg.apache.pig.EvalFunc;

importorg.apache.pig.data.Tuple;

publicclassStringToIntextendsEvalFunc<Integer>{

publicIntegerexec(Tupleinput)throwsIOException{

if(input==null||input.size()==0)

returnnull;

try{

Stringstr=(String)input.get(0);

returnstr.hashCode();

}catch(Exceptione){

throw

newIOException("CannotconvertStringtoInt",e);

}

}

}

Weextendorg.apache.pig.EvalFuncandoverridetheexecmethodtoreturnstr.hashCode()onthefunctioninput.TheEvalFunc<Integer>classisparameterizedwiththereturntypeoftheUDF(Integer).

Next,wecompiletheclassandarchiveitintoaJAR,asfollows:

$javac-classpath/opt/cloudera/parcels/CDH/lib/pig/pig.jar:$(hadoop

classpath)com/learninghadoop2/pig/udf/StringToInt.java

$jarcvfmyudfs-pig.jarcom/learninghadoop2/pig/udf/StringToInt.class

WecannowregistertheUDFinPigandcreateanaliastoStringToInt,asfollows:

REGISTERmyudfs-pig.jar

DEFINEStringToIntcom.learninghadoop2.pig.udf.StringToInt();

Wefilterouttweetswithnodestinationandnotopic,asfollows:

tweets_graph_filtered=FILTERtwitter_graphby

(destinationISNOTNULL)AND

(topicISNOTnull);

Then,weconvertthesource,destination,andtopictointegerIDs:

from_to=foreachtweets_graph_filtered{

GENERATE

StringToInt(source)assource_id,

https://github.com/learninghadoop2/book-examples/blob/master/ch6/udf/com/learninghadoop2/pig/udf/StringToInt.java

StringToInt(destination)asdestination_id,

StringToInt(topic)astopic_id;

}

Oncedataisintheappropriateformat,wecanreusetheimplementationofPageRankandtheexamplecode(foundathttps://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/linkanalysis/PageRank.java)providedbyDataFu,asshowninthefollowingcode:

DEFINEPageRankdatafu.pig.linkanalysis.PageRank('dangling_nodes','true');

Webeginbycreatingabagof(source_id,destination_id,topic_id)tuples,asfollows:

reply_to=groupfrom_toby(source_id,destination_id,topic_id);

Wecounttheoccurrencesofeachtuple,thatis,howmanytimestwopeopletalkedaboutatopic,asfollows:

topic_edges=foreachreply_to{

GENERATEflatten(group),((double)COUNT(from_to.topic_id))asw;

}

Rememberthattopicistheedgeofourgraph;webeginbycreatinganassociationbetweenthesourcenodeandthetopicedge,asfollows:

topic_edges_grouped=GROUPtopic_edgesby(topic_id,source_id);

Thenweregroupitwiththepurposeofaddingadestinationnodeandtheedgeweight,asfollows:

topic_edges_grouped=FOREACHtopic_edges_grouped{

GENERATE

group.topic_idastopic,

group.source_idassource,

topic_edges.(destination_id,w)asedges;

}

OncewecreatetheTwittergraph,wecalculatethePageRankofallusers(source_id):

topic_rank=FOREACH(GROUPtopic_edges_groupedBYtopic){

GENERATE

groupastopic,

FLATTEN(PageRank(topic_edges_grouped.(source,edges)))as(source,rank);

}

topic_rank=FOREACHtopic_rankGENERATEtopic,source,rank;

WestoretheresultinHDFSinAvroformat.IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReducejarfiletoourenvironmentbeforeaccessingindividualfields.WithinPig,forexample,ontheClouderaCDH5VM:

REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro.jar

REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar

STOREtopic_rankINTO'replies-pagerank'usingAvroStorage();

Note

https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/linkanalysis/PageRank.java

Intheselasttwosections,wemadeanumberofimplicitassumptionsonwhataTwittergraphmightlooklikeandwhattheconceptsoftopicanduserinteractionmean.Giventheconstraintsthatweposed,theresultingsocialnetworkweanalyzedwillberelativelysmallandnotnecessarilyrepresentativeoftheentireTwittersocialnetwork.Extrapolatingresultsfromthisdatasetisdiscouraged.Inpractice,therearemanyotherfactorsthatshouldbetakenintoaccounttogeneratearobustmodelofsocialinteraction.

SummaryInthischapter,weintroducedApachePig,aplatformforlarge-scaledataanalysisonHadoop.Inparticular,wecoveredthefollowingtopics:

ThegoalsofPigasawayofprovidingadataflow-likeabstractionthatdoesnotrequirehands-onMapReducedevelopmentHowPig’sapproachtoprocessingdatacomparestoSQL,wherePigisproceduralwhileSQLisdeclarativeGettingstartedwithPig—aneasytask,asitisalibrarythatgeneratescustomcodeanddoesn’trequireadditionalservicesAnoverviewofthedatatypes,corefunctions,andextensionmechanismsprovidedbyPigExamplesofapplyingPigtoanalyzetheTwitterdatasetindetail,whichdemonstrateditsabilitytoexpresscomplexconceptsinaveryconcisefashionHowlibrariessuchasPiggybank,ElephantBird,andDataFuproviderepositoriesfornumeroususefulprewrittenPigfunctionsInthenextchapter,wewillrevisittheSQLcomparisonbyexploringtoolsthatexposeaSQL-likeabstractionoverdatastoredinHDFS

Chapter7.HadoopandSQLMapReduceisapowerfulparadigmthatenablescomplexdataprocessingthatcanrevealvaluableinsights.Asdiscussedinearlierchaptershowever,itdoesrequireadifferentmindsetandsometrainingandexperienceonthemodelofbreakingprocessinganalyticsintoaseriesofmapandreducesteps.ThereareseveralproductsthatarebuiltatopHadooptoprovidehigher-levelormorefamiliarviewsofthedataheldwithinHDFS,andPigisaverypopularone.ThischapterwillexploretheothermostcommonabstractionimplementedatopHadoop:SQL.

Inthischapter,wewillcoverthefollowingtopics:

WhattheusecasesforSQLonHadoopareandwhyitissopopularHiveQL,theSQLdialectintroducedbyApacheHiveUsingHiveQLtoperformSQL-likeanalysisoftheTwitterdatasetHowHiveQLcanapproximatecommonfeaturesofrelationaldatabasessuchasjoinsandviewsHowHiveQLallowstheincorporationofuser-definedfunctionsintoitsqueriesHowSQLonHadoopcomplementsPigOtherSQL-on-HadoopproductssuchasImpalaandhowtheydifferfromHive

WhySQLonHadoopSofarwehaveseenhowtowriteHadoopprogramsusingtheMapReduceAPIsandhowPigLatinprovidesascriptingabstractionandawrapperforcustombusinesslogicbymeansofUDFs.Pigisaverypowerfultool,butitsdataflow-basedprogrammingmodelisnotfamiliartomostdevelopersorbusinessanalysts.ThetraditionaltoolofchoiceforsuchpeopletoexploredataisSQL.

Backin2008FacebookreleasedHive,thefirstwidelyusedimplementationofSQLonHadoop.

Insteadofprovidingawayofmorequicklydevelopingmapandreducetasks,HiveoffersanimplementationofHiveQL,aquerylanguagebasedonSQL.HivetakesHiveQLstatementsandimmediatelyandautomaticallytranslatesthequeriesintooneormoreMapReducejobs.ItthenexecutestheoverallMapReduceprogramandreturnstheresultstotheuser.

ThisinterfacetoHadoopnotonlyreducesthetimerequiredtoproduceresultsfromdataanalysis,italsosignificantlywidensthenetastowhocanuseHadoop.Insteadofrequiringsoftwaredevelopmentskills,anyonewho’sfamiliarwithSQLcanuseHive.

ThecombinationoftheseattributesisthatHiveQLisoftenusedasatoolforbusinessanddataanalyststoperformadhocqueriesonthedatastoredonHDFS.WithHive,thedataanalystcanworkonrefiningquerieswithouttheinvolvementofasoftwaredeveloper.JustaswithPig,HivealsoallowsHiveQLtobeextendedbymeansofUserDefinedFunctions,enablingthebaseSQLdialecttobecustomizedwithbusiness-specificfunctionality.

OtherSQL-on-HadoopsolutionsThoughHivewasthefirstproducttointroduceandsupportHiveQL,itisnolongertheonlyone.Laterinthischapter,wewillalsodiscussImpala,releasedin2013andalreadyaverypopulartool,particularlyforlow-latencyqueries.Thereareothers,butwewillmostlydiscussHiveandImpalaastheyhavebeenthemostsuccessful.

WhileintroducingthecorefeaturesandcapabilitiesofSQLonHadoophowever,wewillgiveexamplesusingHive;eventhoughHiveandImpalasharemanySQLfeatures,theyalsohavenumerousdifferences.Wedon’twanttoconstantlyhavetocaveateachnewfeaturewithexactlyhowitissupportedinHivecomparedtoImpala.We’llgenerallybelookingataspectsofthefeaturesetthatarecommontoboth,butifyouusebothproducts,it’simportanttoreadthelatestreleasenotestounderstandthedifferences.

PrerequisitesBeforedivingintospecifictechnologies,let’sgeneratesomedatathatwe’lluseintheexamplesthroughoutthischapter.We’llcreateamodifiedversionofaformerPigscriptasthemainfunctionalityforthis.ThescriptinthischapterassumesthattheElephantBirdJARsusedpreviouslyareavailableinthe/jardirectoryonHDFS.Thefullsourcecodeisathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig,butthecoreofextract_for_hive.pigisasfollows:

--loadJSONdata

tweets=load'$inputDir'using

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');—Tweets

tweets_tsv=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str',

(chararray)$0#'text'astext,

(chararray)$0#'in_reply_to',

(boolean)$0#'retweeted'asis_retweeted,

(chararray)$0#'user'#'id_str'asuser_id,(chararray)$0#'place'#'id'as

place_id;

}

storetweets_tsvinto'$outputDir/tweets'

usingPigStorage('\u0001');—Places

needed_fields=foreachtweets{

generate



(chararray)$0#'id_str'asid_str,

$0#'place'asplace;

}

place_fields=foreachneeded_fields{

generate

(chararray)place#'id'asplace_id,

(chararray)place#'country_code'asco,

(chararray)place#'country'ascountry,

(chararray)place#'name'asplace_name,

(chararray)place#'full_name'asplace_full_name,

(chararray)place#'place_type'asplace_type;

}

filtered_places=filterplace_fieldsbyco!='';

unique_places=distinctfiltered_places;

storeunique_placesinto'$outputDir/places'

usingPigStorage('\u0001');

—Users

users=foreachtweets{

generate



(chararray)$0#'id_str'asid_str,

$0#'user'asuser;

https://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig

}

user_fields=foreachusers{

generate

(chararray)CustomFormatToISO(user#'created_at',


(chararray)user#'id_str'asuser_id,

(chararray)user#'location'asuser_location,

(chararray)user#'name'asuser_name,

(chararray)user#'description'asuser_description,

(int)user#'followers_count'asfollowers_count,

(int)user#'friends_count'asfriends_count,

(int)user#'favourites_count'asfavourites_count,

(chararray)user#'screen_name'asscreen_name,

(int)user#'listed_count'aslisted_count;

}

unique_users=distinctuser_fields;

storeunique_usersinto'$outputDir/users'

usingPigStorage('\u0001');

Runthisscriptasfollows:

$pig–fextract_for_hive.pig–paraminputDir=<jsoninput>-param

outputDir=<outputpath>

TheprecedingcodewritesdataintothreeseparateTSVfilesforthetweet,user,andplaceinformation.Noticethatinthestorecommand,wepassanargumentwhencallingPigStorage.ThissingleargumentchangesthedefaultfieldseparatorfromatabcharactertounicodevalueU0001,oryoucanalsouseCtrl+C+A.ThisisoftenusedasaseparatorinHivetablesandwillbeparticularlyusefultousasourtweetdatacouldcontaintabsinotherfields.

OverviewofHiveWewillnowshowhowyoucanimportdataintoHiveandrunaqueryagainstthetableabstractionHiveprovidesoverthedata.Inthisexample,andintheremainderofthechapter,wewillassumethatqueriesaretypedintotheshellthatcanbeinvokedbyexecutingthehivecommand.

RecentlyaclientcalledBeelinealsobecameavailableandwilllikelybethepreferredCLIclientinthenearfuture.

WhenimportinganynewdataintoHive,thereisgenerallyathree-stageprocess:

CreatethespecificationofthetableintowhichthedataistobeimportedImportthedataintothecreatedtableExecuteHiveQLqueriesagainstthetable

MostoftheHiveQLstatementsaredirectanaloguestosimilarlynamedstatementsinstandardSQL.WeassumeonlyapassingknowledgeofSQLthroughoutthischapter,butifyouneedarefresher,therearenumerousgoodonlinelearningresources.

Hivegivesastructuredqueryviewofourdata,andtoenablethat,wemustfirstdefinethespecificationofthetable’scolumnsandimportthedataintothetablebeforewecanexecuteanyqueries.AtablespecificationisgeneratedusingaCREATEstatementthatspecifiesthetablename,thenameandtypesofitscolumns,andsomemetadataabouthowthetableisstored:

CREATEtabletweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Thestatementcreatesanewtabletweetsdefinedbyalistofnamesforcolumnsinthedatasetandtheirdatatype.WespecifythatfieldsaredelimitedbytheUnicodeU0001characterandthattheformatusedtostoredataisTEXTFILE.

DatacanbeimportedfromalocationinHDFStweets/usingtheLOADDATAstatement:

LOADDATAINPATH'tweets'OVERWRITEINTOTABLEtweets;

Bydefault,dataforHivetablesisstoredonHDFSunder/user/hive/warehouse.IfaLOADstatementisgivenapathtodataonHDFS,itwillnotsimplycopythedatainto/user/hive/warehouse,butwillmoveitthereinstead.IfyouwanttoanalyzedataonHDFSthatisusedbyotherapplications,theneithercreateacopyorusetheEXTERNALmechanismthatwillbedescribedlater.

OncedatahasbeenimportedintoHive,wecanrunqueriesagainstit.Forinstance:

SELECTCOUNT(*)FROMtweets;

Theprecedingcodewillreturnthetotalnumberoftweetspresentinthedataset.HiveQL,likeSQL,isnotcasesensitiveintermsofkeywords,columns,ortablenames.Byconvention,SQLstatementsuseuppercaseforSQLlanguagekeywords,andwewillgenerallyfollowthiswhenusingHiveQLwithinfiles,aswillbeshownlater.However,whentypinginteractivecommands,wewillfrequentlytakethelineofleastresistanceanduselowercase.

Ifyoulookcloselyatthetimetakenbythevariouscommandsintheprecedingexample,you’llnoticethatloadingdataintoatabletakesaboutaslongascreatingthetablespecification,buteventhesimplecountofallrowstakessignificantlylonger.TheoutputalsoshowsthattablecreationandtheloadingofdatadonotactuallycauseMapReducejobstobeexecuted,whichexplainstheveryshortexecutiontimes.

ThenatureofHivetablesAlthoughHivecopiesthedatafileintoitsworkingdirectory,itdoesnotactuallyprocesstheinputdataintorowsatthatpoint.

BoththeCREATETABLEandLOADDATAstatementsdonottrulycreateconcretetabledataassuch;instead,theyproducethemetadatathatwillbeusedwhenHivegeneratesMapReducejobstoaccessthedataconceptuallystoredinthetablebutactuallyresidingonHDFS.EventhoughtheHiveQLstatementsrefertoaspecifictablestructure,itisHive’sresponsibilitytogeneratecodethatcorrectlymapsthistotheactualon-diskformatinwhichthedatafilesarestored.

ThismightseemtosuggestthatHiveisn’tarealdatabase;thisistrue,itisn’t.Whereasarelationaldatabasewillrequireatableschematobedefinedbeforedataisingestedandtheningestonlydatathatconformstothatspecification,Hiveismuchmoreflexible.ThelessconcretenatureofHivetablesmeansthatschemascanbedefinedbasedonthedataasithasalreadyarrivedandnotonsomeassumptionofhowthedatashouldbe,whichmightprovetobewrong.Thoughchangeabledataformatsaretroublesomeregardlessoftechnology,theHivemodelprovidesanadditionaldegreeoffreedominhandlingtheproblemwhen,notif,itarises.

HivearchitectureUntilversion2,Hadoopwasprimarilyabatchsystem.Aswesawinpreviouschapters,MapReducejobstendtohavehighlatencyandoverheadderivedfromsubmissionandscheduling.Internally,HivecompilesHiveQLstatementsintoMapReducejobs.Hivequerieshavetraditionallybeencharacterizedbyhighlatency.ThishaschangedwiththeStingerinitiativeandtheimprovementsintroducedinHive0.13thatwewilldiscusslater.

HiverunsasaclientapplicationthatprocessesHiveQLqueries,convertsthemintoMapReducejobs,andsubmitsthesetoaHadoopclustereithertonativeMapReduceinHadoop1ortotheMapReduceApplicationMasterrunningonYARNinHadoop2.

Regardlessofthemodel,Hiveusesacomponentcalledthemetastore,inwhichitholdsallitsmetadataaboutthetablesdefinedinthesystem.Ironically,thisisstoredinarelationaldatabasededicatedtoHive’susage.IntheearliestversionsofHive,allclientscommunicateddirectlywiththemetastore,butthismeantthateveryuseroftheHiveCLItoolneededtoknowthemetastoreusernameandpassword.

HiveServerwascreatedtoactasapointofentryforremoteclients,whichcouldalsoactasasingleaccess-controlpointandwhichcontrolledallaccesstotheunderlyingmetastore.BecauseoflimitationsinHiveServer,thenewestwaytoaccessHiveisthroughthemulti-clientHiveServer2.

NoteHiveServer2introducesanumberofimprovementsoveritspredecessor,includinguserauthenticationandsupportformultipleconnectionsfromthesameclient.Moreinformationcanbefoundathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2.

InstancesofHiveServerandHiveServer2canbemanuallyexecutedwiththehive--servicehiveserverandhive--servicehiveserver2commands,respectively.

Intheexampleswesawbeforeandintheremainderofthischapter,weimplicitlyuseHiveServertosubmitqueriesviatheHivecommand-linetool.HiveServer2comeswithBeeline.Forcompatibilityandmaturityreasons,Beelinebeingrelativelynew,bothtoolsareavailableonClouderaandmostothermajordistributions.TheBeelineclientispartofthecoreApacheHivedistributionandsoisalsofullyopensource.Beelinecanbeexecutedinembeddedversionwiththefollowingcommand:

$beeline-ujdbc:hive2://

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2

DatatypesHiveQLsupportsmanyofthecommondatatypesprovidedbystandarddatabasesystems.Theseincludeprimitivetypes,suchasfloat,double,int,andstring,throughtostructuredcollectiontypesthatprovidetheSQLanaloguestotypessuchasarrays,structs,andunions(structswithoptionsforsomefields).SinceHiveisimplementedinJava,primitivetypeswillbehaveliketheirJavacounterparts.WecandistinguishHivedatatypesintothefollowingfivebroadcategories:

Numeric:tinyint,smallint,int,bigint,float,double,anddecimalDateandtime:timestampanddateString:string,varchar,andcharCollections:array,map,struct,anduniontypeMisc:boolean,binary,andNULL

DDLstatementsHiveQLprovidesanumberofstatementstocreate,delete,andalterdatabases,tables,andviews.TheCREATEDATABASE<name>statementcreatesanewdatabasewiththegivenname.Adatabaserepresentsanamespacewheretableandviewmetadataiscontained.Ifmultipledatabasesarepresent,theUSE<databasename>statementspecifieswhichonetousetoquerytablesorcreatenewmetadata.Ifnodatabaseisexplicitlyspecified,Hivewillrunallstatementsagainstthedefaultdatabase.SHOW[DATABASES,TABLES,VIEWS]displaysthedatabasescurrentlyavailablewithinadatawarehouseandwhichtableandviewmetadataispresentwithinthedatabasecurrentlyinuse:

CREATEDATABASEtwitter;

SHOWdatabases;

USEtwitter;

SHOWTABLES;

TheCREATETABLE[IFNOTEXISTS]<name>statementcreatesatablewiththegivenname.Asalludedtoearlier,whatisreallycreatedisthemetadatarepresentingthetableanditsmappingtofilesonHDFSaswellasadirectoryinwhichtostorethedatafiles.Ifatableorviewwiththesamenamealreadyexists,Hivewillraiseanexception.

Bothtableandcolumnnamesarecaseinsensitive.InolderversionsofHive(0.12andearlier),onlyalphanumericandunderscorecharacterswereallowedintableandcolumnnames.AsofHive0.13,thesystemsupportsunicodecharactersincolumnnames.Reservedwords,suchasloadandcreate,needtobeescapedbybackticks(the`character)tobetreatedliterally.

TheEXTERNALkeywordspecifiesthatthetableexistsinresourcesoutofHive’scontrol,whichcanbeausefulmechanismtoextractdatafromanothersourceatthebeginningofaHadoop-basedExtract-Transform-Load(ETL)pipeline.TheLOCATIONclausespecifieswherethesourcefile(ordirectory)istobefound.TheEXTERNALkeywordandLOCATIONclausehavebeenusedinthefollowingcode:

CREATEEXTERNALTABLEtweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)ROWFORMATDELIMITED


STOREDASTEXTFILE

LOCATION'${input}/tweets';

Thistablewillbecreatedinthemetastorebutthedatawillnotbecopiedintothe/user/hive/warehousedirectory.

Tip

NotethatHivehasnoconceptofprimarykeyoruniqueidentifier.Uniquenessanddatanormalizationareaspectstobeaddressedbeforeloadingdataintothedatawarehouse.

TheCREATEVIEW<viewname>…ASSELECTstatementcreatesaviewwiththegivenname.Forexample,wecancreateaviewtoisolateretweetsfromothermessages,asfollows:

CREATEVIEWretweets

COMMENT'Tweetsthathavebeenretweeted'

ASSELECT*FROMtweetsWHEREretweeted=true;

Unlessotherwisespecified,columnnamesarederivedfromthedefiningSELECTstatement.Hivedoesnotcurrentlysupportmaterializedviews.

TheDROPTABLEandDROPVIEWstatementsremovebothmetadataanddataforagiventableorview.WhendroppinganEXTERNALtableoraview,onlymetadatawillberemovedandtheactualdatafileswillnotbeaffected.

HiveallowstablemetadatatobealteredviatheALTERTABLEstatement,whichcanbeusedtochangeacolumntype,name,position,andcommentortoaddandreplacecolumns.

Whenaddingcolumns,itisimportanttorememberthatonlymetadatawillbechangedandnotthedatasetitself.Thismeansthatifweweretoaddacolumninthemiddleofthetablewhichdidn’texistinolderfiles,thenwhileselectingfromolderdata,wemightgetwrongvaluesinthewrongcolumns.Thisisbecausewewouldbelookingatoldfileswithanewformat.WewilldiscussdataandschemamigrationsinChapter8,DataLifecycleManagement,whendiscussingAvro.

Similarly,ALTERVIEW<viewname>AS<selectstatement>changesthedefinitionofanexistingview.

FileformatsandstorageThedatafilesunderlyingaHivetablearenodifferentfromanyotherfileonHDFS.UserscandirectlyreadtheHDFSfilesintheHivetablesusingothertools.TheycanalsouseothertoolstowritetoHDFSfilesthatcanbeloadedintoHivethroughCREATEEXTERNALTABLEorthroughLOADDATAINPATH.

HiveusestheSerializerandDeserializerclasses,SerDe,aswellasFileFormattoreadandwritetablerows.AnativeSerDeisusedifROWFORMATisnotspecifiedorROWFORMATDELIMITEDisspecifiedinaCREATETABLEstatement.TheDELIMITEDclauseinstructsthesystemtoreaddelimitedfiles.DelimitercharacterscanbeescapedusingtheESCAPEDBYclause.

HivecurrentlyusesthefollowingFileFormatclassestoreadandwriteHDFSfiles:

TextInputFormatandHiveIgnoreKeyTextOutputFormat:willread/writedatainplaintextfileformatSequenceFileInputFormatandSequenceFileOutputFormat:classesread/writedataintheHadoopSequenceFileformat

Additionally,thefollowingSerDeclassescanbeusedtoserializeanddeserializedata:

MetadataTypedColumnsetSerDe:willread/writedelimitedrecordssuchasCSVortab-separatedrecordsThriftSerDe,andDynamicSerDe:willread/writeThriftobjects

JSONAsofversion0.13,Hiveshipswiththenativeorg.apache.hive.hcatalog.data.JsonSerDe.ForolderversionsofHive,Hive-JSON-Serde(foundathttps://github.com/rcongiu/Hive-JSON-Serde)isarguablyoneofthemostfeature-richJSONserialization/deserializationmodules.

WecanuseeithermoduletoloadJSONtweetswithoutanyneedforpreprocessingandjustdefineaHiveschemathatmatchesthecontentofaJSONdocument.Inthefollowingexample,weuseHive-JSON-Serde.

Aswithanythird-partymodule,weloadtheSerDeJARsintoHivewiththefollowingcode:

ADDJARJARjson-serde-1.3-jar-with-dependencies.jar;

Then,weissuetheusualCREATEstatement,asfollows:


contributorsstring,

coordinatesstruct<

coordinates:array<float>,

type:string>,

created_atstring,

entitiesstruct<

hashtags:array<struct<

https://github.com/rcongiu/Hive-JSON-Serde

indices:array<tinyint>,

text:string>>,

…

)

ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'

STOREDASTEXTFILE

LOCATION'tweets';

WiththisSerDe,wecanmapnesteddocuments(suchasentitiesorusers)tothestructormaptypes.WetellHivethatthedatastoredatLOCATION'tweets'istext(STOREDASTEXTFILE)andthateachrowisaJSONobject(ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe‘).InHive0.13andlater,wecanexpressthispropertyasROWFORMATSERDE'org.apache.hive.hcatalog.data.JsonSerDe'.

Manuallyspecifyingtheschemaforcomplexdocumentscanbeatediousanderror-proneprocess.Thehive-jsonmodule(foundathttps://github.com/hortonworks/hive-json)isahandyutilitytoanalyzelargedocumentsandgenerateanappropriateHiveschema.Dependingonthedocumentcollection,furtherrefinementmightbenecessary.

Inourexample,weusedaschemageneratedwithhive-jsonthatmapsthetweetsJSONtoanumberofstructdatatypes.Thisallowsustoquerythedatausingahandydotnotation.Forinstance,wecanextractthescreennameanddescriptionfieldsofauserobjectwiththefollowingcode:

SELECTuser.screen_name,user.descriptionFROMtweets_jsonLIMIT10;

AvroAvroSerde(https://cwiki.apache.org/confluence/display/Hive/AvroSerDe)allowsustoreadandwritedatainAvroformat.Startingfrom0.14,Avro-backedtablescanbecreatedusingtheSTOREDASAVROstatement,andHivewilltakecareofcreatinganappropriateAvroschemaforthetable.PriorversionsofHiveareabitmoreverbose.

Asanexample,let’sloadintoHivethePageRankdatasetwegeneratedinChapter6,DataAnalysiswithApachePig.ThisdatasetwascreatedusingPig’sAvroStorageclass,andhasthefollowingschema:

{

"type":"record",

"name":"record",

"fields":[

{"name":"topic","type":["null","int"]},

{"name":"source","type":["null","int"]},

{"name":"rank","type":["null","float"]}

]

}

ThetablestructureiscapturedinanAvrorecord,whichcontainsheaderinformation(anameandoptionalnamespacetoqualifythename)andanarrayofthefields.Eachfieldisspecifiedwithitsnameandtypeaswellasanoptionaldocumentationstring.

Forafewofthefields,thetypeisnotasinglevalue,butinsteadapairofvalues,oneofwhichisnull.ThisisanAvrounion,andthisistheidiomaticwayofhandlingcolumns

https://github.com/hortonworks/hive-json

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe

thatmighthaveanullvalue.Avrospecifiesnullasaconcretetype,andanylocationwhereanothertypemighthaveanullvalueneedstobespecifiedinthisway.Thiswillbehandledtransparentlyforuswhenweusethefollowingschema.

Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:

CREATEEXTERNALTABLEtweets_pagerank

ROWFORMATSERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITHSERDEPROPERTIES('avro.schema.literal'='{

"type":"record",

"name":"record",

"fields":[

{"name":"topic","type":["null","int"]},

{"name":"source","type":["null","int"]},

{"name":"rank","type":["null","float"]}

]

}')

STOREDASINPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION'${data}/ch5-pagerank';

Then,lookatthefollowingtabledefinitionfromwithinHive(notealsothatHCatalog,whichwe’llintroduceinChapter8,DataLifeCycleManagement,alsosupportssuchdefinitions):

DESCRIBEtweets_pagerank;

OK

topicintfromdeserializer

sourceintfromdeserializer

rankfloatfromdeserializer

IntheDDL,wetoldHivethatdataisstoredinAvroformatusingAvroContainerInputFormatandAvroContainerOutputFormat.Eachrowneedstobeserializedanddeserializedusingorg.apache.hadoop.hive.serde2.avro.AvroSerDe.ThetableschemaisinferredbyHivefromtheAvroschemaembeddedinavro.schema.literal.

Alternatively,wecanstoreaschemaonHDFSandhaveHivereadittodeterminethetablestructure.Createtheprecedingschemainafilecalledpagerank.avsc—thisisthestandardfileextensionforAvroschemas.ThenplaceitonHDFS;weprefertohaveacommonlocationforschemafilessuchas/schema/avro.Finally,definethetableusingtheavro.schema.urlSerDepropertyWITHSERDEPROPERTIES('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc').

IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforeaccessingindividualfields.WithinHive,ontheClouderaCDH5VM:

ADDJAR/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar;

Wecanalsousethistablelikeanyother.Forinstance,wecanquerythedatatoselecttheuserandtopicpairswithahighPageRank:

SELECTsource,topicfromtweets_pagerankWHERErank>=0.9;

InChapter8,DataLifecycleManagement,wewillseehowAvroandavro.schema.urlplayaninstrumentalroleinenablingschemamigrations.

ColumnarstoresHivecanalsotakeadvantageofcolumnarstorageviatheORC(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC)andParquet(https://cwiki.apache.org/confluence/display/Hive/Parquet)formats.

Ifatableisdefinedwithverymanycolumns,itisnotunusualforanygivenquerytoonlyprocessasmallsubsetofthesecolumns.ButeveninaSequenceFileeachfullrowandallitscolumnswillbereadfromdisk,decompressed,andprocessed.Thisconsumesalotofsystemresourcesfordatathatweknowinadvanceisnotofinterest.

Traditionalrelationaldatabasesalsostoredataonarowbasis,andatypeofdatabasecalledcolumnarchangedthistobecolumn-focused.Inthesimplestmodel,insteadofonefileforeachtable,therewouldbeonefileforeachcolumninthetable.Ifaqueryonlyneededtoaccessfivecolumnsinatablewith100columnsintotal,thenonlythefilesforthosefivecolumnswillberead.BothORCandParquetusethisprincipleaswellasotheroptimizationstoenablemuchfasterqueries.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

https://cwiki.apache.org/confluence/display/Hive/Parquet

QueriesTablescanbequeriedusingthefamiliarSELECT…FROMstatement.TheWHEREstatementallowsthespecificationoffilteringconditions,GROUPBYaggregatesrecords,ORDERBYspecifiessortingcriteria,andLIMITspecifiesthenumberofrecordstoretrieve.Aggregatefunctions,suchascountandsum,canbeappliedtoaggregatedrecords.Forinstance,thefollowingcodereturnsthetop10mostprolificusersinthedataset:

SELECTuser_id,COUNT(*)AScntFROMtweetsGROUPBYuser_idORDERBYcnt

DESCLIMIT10

Thisreturnsthetop10mostprolificusersinthedataset:

22639496594

13321880534

9594688573

13677521183

3625629443

586460413

23752966883

14681885293

371142093

23850409403

Wecanimprovethereadabilityofthehiveoutputbysettingthefollowing:

SEThive.cli.print.header=true;

Thiswillinstructhive,thoughnotbeeline,toprintcolumnnamesaspartoftheoutput.

TipYoucanaddthecommandtothe.hivercfileusuallyfoundintherootoftheexecutinguser’shomedirectorytohaveitapplytoallhiveCLIsessions.

HiveQLimplementsaJOINoperatorthatenablesustocombinetablestogether.InthePrerequisitessection,wegeneratedseparatedatasetsfortheuserandplaceobjects.Let’snowloadthemintohiveusingexternaltables.

Wefirstcreateausertabletostoreuserdata,asfollows:

CREATEEXTERNALTABLEuser(

created_atstring,

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)ROWFORMATDELIMITED


STOREDASTEXTFILE

LOCATION'${input}/users';

Wethencreateaplacetabletostorelocationdata,asfollows:

CREATEEXTERNALTABLEplace(

place_idstring,

country_codestring,

countrystring,

`name`string,

full_namestring,

place_typestring

)ROWFORMATDELIMITED


STOREDASTEXTFILE

LOCATION'${input}/places';

WecanusetheJOINoperatortodisplaythenamesofthe10mostprolificusers,asfollows:

SELECTtweets.user_id,user.name,COUNT(tweets.user_id)AScnt

FROMtweets

JOINuserONuser.user_id=tweets.user_id

GROUPBYtweets.user_id,user.user_id,user.name

ORDERBYcntDESCLIMIT10;

TipOnlyequality,outer,andleft(semi)joinsaresupportedinHive.

NoticethattheremightbemultipleentrieswithagivenuserIDbutdifferentvaluesforthefollowers_count,friends_count,andfavourites_countcolumns.Toavoidduplicateentries,wecountonlyuser_idfromthetweetstable.

Wecanrewritethepreviousqueryasfollows:

SELECTtweets.user_id,u.name,COUNT(*)AScnt

FROMtweets

join(SELECTuser_id,nameFROMuserGROUPBYuser_id,name)u

ONu.user_id=tweets.user_id

GROUPBYtweets.user_id,u.name

ORDERBYcntDESCLIMIT10;

Insteadofdirectlyjoiningtheusertable,weexecuteasubquery,asfollows:

SELECTuser_id,nameFROMuserGROUPBYuser_id,name;

ThesubqueryextractsuniqueuserIDsandnames.NotethatHivehaslimitedsupportforsubqueries,historicallyonlypermittingasubqueryintheFROMclauseofaSELECTstatement.Hive0.13hasaddedlimitedsupportforsubquerieswithintheWHEREclausealso.

HiveQLisanever-evolvingrichlanguage,afullexpositionofwhichisbeyondthescopeofthischapter.Adescriptionofitsqueryandddlcapabilitiescanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

StructuringHivetablesforgivenworkloadsOftenHiveisn’tusedinisolation,insteadtablesarecreatedwithparticularworkloadsinmindorneedsinvokedinwaysthataresuitableforinclusioninautomatedprocesses.We’llnowexploresomeofthesescenarios.

PartitioningatableWithcolumnarfileformats,weexplainedthebenefitsofexcludingunneededdataasearlyaspossiblewhenprocessingaquery.AsimilarconcepthasbeenusedinSQLforsometime:tablepartitioning.

Whencreatingapartitionedtable,acolumnisspecifiedasthepartitionkey.Allvalueswiththatkeyarethenstoredtogether.InHive’scase,differentsubdirectoriesforeachpartitionkeyarecreatedunderthetabledirectoryinthewarehouselocationonHDFS.

It’simportanttounderstandthecardinalityofthepartitioncolumn.Withtoofewdistinctvalues,thebenefitsarereducedasthefilesarestillverylarge.Iftherearetoomanyvalues,thenqueriesmightneedalargenumberoffilestobescannedtoaccessalltherequireddata.Perhapsthemostcommonpartitionkeyisonebasedondate.Wecould,forexample,partitionourusertablefromearlierbasedonthecreated_atcolumn,thatis,thedatetheuserwasfirstregistered.Notethatsincepartitioningatablebydefinitionaffectsitsfilestructure,wecreatethistablenowasanon-externalone,asfollows:

CREATETABLEpartitioned_user(

created_atstring,

user_idstring,

`location`string,

namestring,

descriptionstring,




screen_namestring,

listed_countbigint

)PARTITIONEDBY(created_at_datestring)

ROWFORMATDELIMITED


STOREDASTEXTFILE;

Toloaddataintoapartition,wecanexplicitlygiveavalueforthepartitionintowhichtoinsertthedata,asfollows:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date='2014-01-01')

SELECT

created_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count

FROMuser;

Thisisatbestverbose,asweneedastatementforeachpartitionkeyvalue;ifasingle

LOADorINSERTstatementcontainsdataformultiplepartitions,itjustwon’twork.Hivealsohasafeaturecalleddynamicpartitioning,whichcanhelpushere.Wesetthefollowingthreevariables:

SEThive.exec.dynamic.partition=true;

SEThive.exec.dynamic.partition.mode=nonstrict;

SEThive.exec.max.dynamic.partitions.pernode=5000;

Thefirsttwostatementsenableallpartitions(nonstrictoption)tobedynamic.Thethirdoneallows5,000distinctpartitionstobecreatedoneachmapperandreducernode.

Wecanthensimplyusethenameofthecolumntobeusedasthepartitionkey,andHivewillinsertdataintopartitionsdependingonthevalueofthekeyforagivenrow:


PARTITION(created_at_date)

SELECT

created_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count,

to_date(created_at)ascreated_at_date

FROMuser;

Eventhoughweuseonlyasinglepartitioncolumnhere,wecanpartitionatablebymultiplecolumnkeys;justhavethemasacomma-separatedlistinthePARTITIONEDBYclause.

Notethatthepartitionkeycolumnsneedtobeincludedasthelastcolumnsinanystatementbeingusedtoinsertintoapartitionedtable.IntheprecedingcodeweuseHive’sto_datefunctiontoconvertthecreated_attimestamptoaYYYY-MM-DDformattedstring.

PartitioneddataisstoredinHDFSas/path/to/warehouse/<database>/<table>/key=<value>.Inourexample,thepartitioned_usertablestructurewilllooklike/user/hive/warehouse/default/partitioned_user/created_at=2014-04-01.

Ifdataisaddeddirectlytothefilesystem,forinstancebysomethird-partyprocessingtoolorbyhadoopfs-put,themetastorewon’tautomaticallydetectthenewpartitions.TheuserwillneedtomanuallyrunanALTERTABLEstatementsuchasthefollowingforeachnewlyaddedpartition:

ALTERTABLE<table_name>ADDPARTITION<location>;

Toaddmetadataforallpartitionsnotcurrentlypresentinthemetastorewecanuse:MSCKREPAIRTABLE<table_name>;statement.OnEMR,thisisequivalenttoexecutingthefollowingstatement:

ALTERTABLE<table_name>RECOVERPARTITIONS;

NoticethatbothstatementswillworkalsowithEXTERNALtables.Inthefollowingchapter,wewillseehowthispatterncanbeexploitedtocreateflexibleandinteroperablepipelines.

OverwritingandupdatingdataPartitioningisalsousefulwhenweneedtoupdateaportionofatable.Normallyastatementofthefollowingformwillreplaceallthedataforthedestinationtable:

INSERTOVERWRITEINTO<table>…

IfOVERWRITEisomitted,theneachINSERTstatementwilladdadditionaldatatothetable.Sometimes,thisisdesirable,butoften,thesourcedatabeingingestedintoaHivetableisintendedtofullyupdateasubsetofthedataandkeeptherestuntouched.

IfweperformanINSERTOVERWRITEstatement(oraLOADOVERWRITEstatement)intoapartitionofatable,thenonlythespecifiedpartitionwillbeaffected.Thus,ifwewereinsertinguserdataandonlywantedtoaffectthepartitionswithdatainthesourcefile,wecouldachievethisbyaddingtheOVERWRITEkeywordtoourpreviousINSERTstatement.

WecanalsoaddcaveatstotheSELECTstatement.Say,forexample,weonlywantedtoupdatedataforacertainmonth:


PARTITION(created_at_date)

SELECTcreated_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count,

to_date(created_at)ascreated_at_date

FROMuser

WHEREto_date(created_at)BETWEEN'2014-03-01'and'2014-03-31';

BucketingandsortingPartitioningatableisaconstructthatyoutakeexplicitadvantageofbyusingthepartitioncolumn(orcolumns)intheWHEREclauseofqueriesagainstthetables.ThereisanothermechanismcalledbucketingthatcanfurthersegmenthowatableisstoredanddoessoinawaythatallowsHiveitselftooptimizeitsinternalqueryplanstotakeadvantageofthestructure.

Let’screatebucketedversionsofourtweetsandusertables;notethefollowingadditionalCLUSTERBYandSORTBYstatementsintheCREATETABLEstatements:

CREATEtablebucketed_tweets(

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)PARTITIONEDBY(created_atstring)

CLUSTEREDBY(user_ID)into64BUCKETS

ROWFORMATDELIMITED


STOREDASTEXTFILE;

CREATETABLEbucketed_user(

user_idstring,

`location`string,

namestring,

descriptionstring,




screen_namestring,

listed_countbigint

)PARTITIONEDBY(created_atstring)

CLUSTEREDBY(user_ID)SORTEDBY(name)into64BUCKETS

ROWFORMATDELIMITED


STOREDASTEXTFILE;

Notethatwechangedthetweetstabletoalsobepartitioned;youcanonlybucketatablethatispartitioned.

Justasweneedtospecifyapartitioncolumnwheninsertingintoapartitionedtable,wemustalsotakecaretoensurethatdatainsertedintoabucketedtableiscorrectlyclustered.Wedothisbysettingthefollowingflagbeforeinsertingthedataintothetable:

SEThive.enforce.bucketing=true;

Justaswithpartitionedtables,youcannotapplythebucketingfunctionwhenusingtheLOADDATAstatement;ifyouwishtoloadexternaldataintoabucketedtable,firstinsertitintoatemporarytable,andthenusetheINSERT…SELECT…syntaxtopopulatethebucketedtable.

Whendataisinsertedintoabucketedtable,rowsareallocatedtoabucketbasedontheresultofahashfunctionappliedtothecolumnspecifiedintheCLUSTEREDBYclause.

Oneofthegreatestadvantagesofbucketingatablecomeswhenweneedtojointwotablesthataresimilarlybucketed,asinthepreviousexample.So,forexample,anyqueryofthefollowingformwouldbevastlyimproved:

SEThive.optimize.bucketmapjoin=true;

SELECT…

FROMbucketed_useruJOINbucketed_tweett

ONu.user_id=t.user_id;

Withthejoinbeingperformedonthecolumnusedtobucketthetable,Hivecanoptimizetheamountofprocessingasitknowsthateachbucketcontainsthesamesetofuser_idcolumnsinbothtables.Whiledeterminingwhichrowsagainstwhichtomatch,onlythose

inthebucketneedtobecomparedagainst,andnotthewholetable.Thisdoesrequirethatthetablesarebothclusteredonthesamecolumnandthatthebucketnumbersareeitheridenticaloroneisamultipleoftheother.Inthelattercase,withsayonetableclusteredinto32bucketsandanotherinto64,thenatureofthedefaulthashfunctionusedtoallocatedatatoabucketmeansthattheIDsinbucket3inthefirsttablewillcoverthoseinbothbuckets3and35inthesecond.

SamplingdataBucketingatablecanalsohelpwhileusingHive’sabilitytosampledatainatable.Samplingallowsaquerytogatheronlyaspecifiedsubsetoftheoverallrowsinthetable.Thisisusefulwhenyouhaveanextremelylargetablewithmoderatelyconsistentdatapatterns.Insuchacase,applyingaquerytoasmallfractionofthedatawillbemuchfasterandwillstillgiveabroadlyrepresentativeresult.Note,ofcourse,thatthisonlyappliestoquerieswhereyouarelookingtodeterminetablecharacteristics,suchaspatternrangesinthedata;ifyouaretryingtocountanything,thentheresultneedstobescaledtothefulltablesize.

Foranon-bucketedtable,youcansampleinamechanismsimilartowhatwesawearlierbyspecifyingthatthequeryshouldonlybeappliedtoacertainsubsetofthetable:

SELECTmax(friends_count)

FROMuserTABLESAMPLE(BUCKET2OUTOF64ONname);

Inthisquery,Hivewilleffectivelyhashtherowsinthetableinto64bucketsbasedonthenamecolumn.Itwillthenonlyusethesecondbucketforthequery.Multiplebucketscanbespecified,andifRAND()isgivenastheONclause,thentheentirerowisusedbythebucketingfunction.

Thoughsuccessful,thisishighlyinefficientasthefulltableneedstobescannedtogeneratetherequiredsubsetofdata.Ifwesampleonabucketedtableandensurethenumberofbucketssampledisequaltooramultipleofthebucketsinthetable,thenHivewillonlyreadthebucketsinquestion.Forexample:

SELECTMAX(friends_count)

FROMbucketed_userTABLESAMPLE(BUCKET2OUTOF32onuser_id);

Intheprecedingqueryagainstthebucketed_usertable,whichiscreatedwith64bucketsontheuser_idcolumn,thesampling,sinceitisusingthesamecolumn,willonlyreadtherequiredbuckets.Inthiscase,thesewillbebuckets2and34fromeachpartition.

Afinalformofsamplingisblocksampling.Inthiscase,wecanspecifytherequiredamountofthetabletobesampled,andHivewilluseanapproximationofthisbyonlyreadingenoughsourcedatablocksonHDFStomeettherequiredsize.Currently,thedatasizecanbespecifiedaseitherapercentageofthetable,asanabsolutedatasize,orasanumberofrows(ineachblock).ThesyntaxforTABLESAMPLEisasfollows,whichwillsample0.5percentofthetable,1GBofdataor100rowspersplit,respectively:

TABLESAMPLE(0.5PERCENT)

TABLESAMPLE(1G)

TABLESAMPLE(100ROWS)

Iftheselatterformsofsamplingareofinterest,thenconsultthedocumentation,astherearesomespecificlimitationsontheinputformatandfileformatsthataresupported.

WritingscriptsWecanplaceHivecommandsinafileandrunthemwiththe-foptioninthehiveCLIutility:

$catshow_tables.hql

showtables;

$hive-fshow_tables.hql

WecanparameterizeHiveQLstatementsbymeansofthehiveconfmechanism.Thisallowsustospecifyanenvironmentvariablenameatthepointitisusedratherthanatthepointofinvocation.Forexample:

$catshow_tables2.hql

showtableslike'${hiveconf:TABLENAME}';

$hive-hiveconfTABLENAME=user-fshow_tables2.hql

ThevariablecanalsobesetwithintheHivescriptoraninteractivesession:

SETTABLE_NAME='user';

TheprecedinghiveconfargumentwilladdanynewvariablesinthesamenamespaceastheHiveconfigurationoptions.AsofHive0.8,thereisasimilaroptioncalledhivevarthataddsanyuservariablesintoadistinctnamespace.Usinghivevar,theprecedingcommandwouldbeasfollows:

$catshow_tables3.hql

showtableslike'${hivevar:TABLENAME}';

$hive-hivevarTABLENAME=user–fshow_tables3.hql

Orwecanwritethecommandinteractively:

SEThivevar:TABLE_NAME='user';

HiveandAmazonWebServicesWithElasticMapReduceastheAWSHadoop-on-demandservice,itisofcoursepossibletorunHiveonanEMRcluster.ButitisalsopossibletouseAmazonstorageservices,particularlyS3,fromanyHadoopclusterbeitwithinEMRoryourownlocalcluster.

HiveandS3AsmentionedinChapter2,Storage,itispossibletospecifyadefaultfilesystemotherthanHDFSforHadoopandS3isoneoption.But,itdoesn’thavetobeanall-or-nothingthing;itispossibletohavespecifictablesstoredinS3.Thedataforthesetableswillberetrievedintotheclustertobeprocessed,andanyresultingdatacaneitherbewrittentoadifferentS3location(thesametablecannotbethesourceanddestinationofasinglequery)orontoHDFS.

WecantakeafileofourtweetdataandplaceitontoalocationinS3withacommandsuchasthefollowing:

$awss3puttweets.tsvs3://<bucket-name>/tweets/

Wefirstlyneedtospecifytheaccesskeyandsecretaccesskeythatcanaccessthebucket.Thiscanbedoneinthreeways:

Setfs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeytotheappropriatevaluesintheHiveCLISetthesamevaluesinhive-site.xmlthoughnotethislimitsuseofS3toasinglesetofcredentialsSpecifythetablelocationexplicitlyinthetableURL,thatis,s3n://<accesskey>:<secretaccesskey>@<bucket>/<path>

Thenwecancreateatablereferencingthisdata,asfollows:

CREATEtableremote_tweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)CLUSTEREDBY(user_ID)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

LOCATION's3n://<bucket-name>/tweets'

ThiscanbeanincrediblyeffectivewayofpullingS3dataintoalocalHadoopclusterforprocessing.

NoteInordertouseAWScredentialsintheURIofanS3locationregardlessofhowtheparametersarepassed,thesecretandaccesskeysmustnotcontain/,+,=,or\characters.Ifnecessary,anewsetofcredentialscanbegeneratedfromtheIAMconsoleathttps://console.aws.amazon.com/iam/.

Intheory,youcanjustleavethedataintheexternaltableandrefertoitwhenneededtoavoidWANdatatransferlatencies(andcosts),eventhoughitoftenmakessensetopull

https://console.aws.amazon.com/iam/

thedataintoalocaltableanddofutureprocessingfromthere.Ifthetableispartitioned,thenyoumightfindyourselfretrievinganewpartitioneachday,forexample.

HiveonElasticMapReduceOnonelevel,usingHivewithinAmazonElasticMapReduceisjustthesameaseverythingdiscussedinthischapter.Youcancreateapersistentcluster,logintothemasternode,andusetheHiveCLItocreatetablesandsubmitqueries.DoingallthiswillusethelocalstorageontheEC2instancesforthetabledata.

Notsurprisingly,jobsonEMRclusterscanalsorefertotableswhosedataisstoredonS3(orDynamoDB).Andalsonotsurprisingly,AmazonhasmadeextensionstoitsversionofHivetomakeallthisveryseamless.ItisquitesimplefromwithinanEMRjobtopulldatafromatablestoredinS3,processit,writeanyintermediatedatatotheEMRlocalstorage,andthenwritetheoutputresultsintoS3,DynamoDB,oroneofagrowinglistofotherAWSservices.

ThepatternmentionedearlierwherenewdataisaddedtoanewpartitiondirectoryforatableeachdayhasprovedveryeffectiveinS3;itisoftenthestoragelocationofchoiceforlargeandincrementallygrowingdatasets.ThereisasyntaxdifferencewhenusingEMR;insteadoftheMSCKcommandmentionedearlier,thecommandtoupdateaHivetablewithnewdataaddedtoapartitiondirectoryisasfollows:

ALTERTABLE<table-name>RECOVERPARTITIONS;

ConsulttheEMRdocumentationforthelatestenhancementsathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html.Also,consultthebroaderEMRdocumentation.Inparticular,theintegrationpointswithotherAWSservicesisanareaofrapidgrowth.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html

ExtendingHiveQLTheHiveQLlanguagecanbeextendedbymeansofpluginsandthird-partyfunctions.InHive,therearethreetypesoffunctionscharacterizedbythenumberofrowstheytakeasinputandproduceasoutput:

UserDefinedFunctions(UDFs):aresimplerfunctionsthatactononerowatatime.UserDefinedAggregateFunctions(UDAFs):takemultiplerowsasinputandgeneratemultiplerowsasoutput.TheseareaggregatefunctionstobeusedinconjunctionwithaGROUPBYstatement(similartoCOUNT(),AVG(),MIN(),MAX(),andsoon).UserDefinedTableFunctions(UDTFs):takemultiplerowsasinputandgeneratealogicaltablecomprisedofmultiplerowsthatcanbeusedinjoinexpressions.

TipTheseAPIsareprovidedonlyinJava.Forotherlanguages,itispossibletostreamdatathroughauser-definedscriptusingtheTRANSFORM,MAP,andREDUCEclausesthatactasafrontendtoHadoop’sstreamingcapabilities.

TwoAPIsareavailabletowriteUDFs.AsimpleAPIorg.apache.hadoop.hive.ql.exec.UDFcanbeusedforfunctionsthattakeandreturnbasicwritabletypes.AricherAPI,whichprovidessupportfordatatypesotherthanwritableisavailableintheorg.apache.hadoop.hive.ql.udf.generic.GenericUDFpackage.We’llnowillustratehoworg.apache.hadoop.hive.ql.exec.UDFcanbeusedtoimplementastringtoIDfunctionsimilartotheoneweusedinChapter5,IterativeComputationwithSpark,tomaphashtagstointegersinPig.BuildingaUDFwiththisAPIonlyrequiresextendingtheUDFclassandwritinganevaluate()method,asfollows:

publicclassStringToIntextendsUDF{

publicIntegerevaluate(Textinput){

if(input==null)

returnnull;

Stringstr=input.toString();

returnstr.hashCode();

}

}

ThefunctiontakesaTextobjectasinputandmapsittoanintegervaluewiththehashCode()method.Thesourcecodeofthisfunctioncanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/udf/com/learninghadoop2/hive/udf/StringToInt.java.

TipAsnotedinChapter6,DataAnalysiswithApachePig,amorerobusthashfunctionshouldbeusedinproduction.

https://github.com/learninghadoop2/book-examples/blob/master/ch7/udf/com/learninghadoop2/hive/udf/StringToInt.java

WecompiletheclassandarchiveitintoaJARfile,asfollows:

$javac-classpath$(hadoop

classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*

com/learninghadoop2/hive/udf/StringToInt.java

$jarcvfmyudfs-hive.jarcom/learninghadoop2/hive/udf/StringToInt.class

Beforebeingabletouseit,aUDFmustberegisteredinHivewiththefollowingcommands:

ADDJARmyudfs-hive.jar;

CREATETEMPORARYFUNCTIONstring_to_intAS

'com.learninghadoop2.hive.udf.StringToInt';

TheADDJARstatementaddsaJARfiletothedistributedcache.TheCREATETEMPORARYFUNCTION<function>AS<class>statementregistersafunctioninHivethatimplementsagivenJavaclass.ThefunctionwillbedroppedoncetheHivesessionisclosed.AsofHive0.13,itispossibletocreatepermanentfunctionswhosedefinitioniskeptinthemetastoreusingCREATEFUNCTION….

Onceregistered,StringToIntcanbeusedinaqueryjustlikeanyotherfunction.Inthefollowingexample,wefirstextractalistofhashtagsfromthetweet’stextbyapplyingregexp_extract.Then,weusestring_to_inttomapeachtagtoanumericalID:

SELECTunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag)AS

tag_idFROM

(

SELECTregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')ashashtag

FROMtweets

GROUPBYregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')

)unique_hashtagsGROUPBYunique_hashtags.hashtag,

string_to_int(unique_hashtags.hashtag);

Justaswedidinthepreviouschapter,wecanusetheprecedingquerytocreatealookuptable:

CREATETABLElookuptable(tagstring,tag_idbigint);

INSERTOVERWRITETABLElookuptable

SELECTunique_hashtags.hashtag,

string_to_int(unique_hashtags.hashtag)astag_id

FROM

(

SELECTregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')AShashtag

FROMtweets

GROUPBYregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')

)unique_hashtags

GROUPBYunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag);

ProgrammaticinterfacesInadditiontothehiveandbeelinecommand-linetools,itispossibletosubmitHiveQLqueriestothesystemviatheJDBCandThriftprogrammaticinterfaces.SupportforODBCwasbundledinolderversionsofHive,butasofHive0.12,itneedstobebuiltfromscratch.Moreinformationonthisprocesscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/HiveODBC.

https://cwiki.apache.org/confluence/display/Hive/HiveODBC

JDBCAHiveclientwrittenusingJDBCAPIslooksexactlythesameasaclientprogramwrittenforotherdatabasesystems(forexampleMySQL).ThefollowingisasampleHiveclientprogramusingJDBCAPIs.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveJdbcClient.java.

publicclassHiveJdbcClient{

privatestaticStringdriverName="org.apache.hive.jdbc.HiveDriver";

//connectionstring

publicstaticStringURL="jdbc:hive2://localhost:10000";

//Showalltablesinthedefaultdatabase

publicstaticStringQUERY="showtables";

publicstaticvoidmain(String[]args)throwsSQLException{

try{

Class.forName(driverName);

}

catch(ClassNotFoundExceptione){


System.exit(1);

}

Connectioncon=DriverManager.getConnection(URL);

Statementstmt=con.createStatement();

ResultSetresultSet=stmt.executeQuery(QUERY);

while(resultSet.next()){

System.out.println(resultSet.getString(1));

}

}

}

TheURLpartistheJDBCURIthatdescribestheconnectionendpoint.Theformatforestablishingaremoteconnectionisjdbc:hive2:<host>:<port>/<database>.Connectionsinembeddedmodecanbeestablishedbynotspecifyingahostorport,likejdbc:hive2://.

hiveandhive2arethedriverstobeusedwhenconnectingtoHiveServerandHiveServer2.QUERYcontainstheHiveQLquerytobeexecuted.

TipHive’sJDBCinterfaceexposesonlythedefaultdatabase.Inordertoaccessotherdatabases,youneedtoreferencethemexplicitlyintheunderlyingqueriesusingthe<database>.<table>notation.

FirstweloadtheHiveServer2JDBCdriverorg.apache.hive.jdbc.HiveDriver.

Tip

https://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveJdbcClient.java

Useorg.apache.hadoop.hive.jdbc.HiveDrivertoconnecttoHiveServer.

Then,likewithanyotherJDBCprogram,weestablishaconnectiontoURLanduseittoinstantiateaStatementclass.WeexecuteQUERY,withnoauthentication,andstoretheoutputdatasetintotheResultSetobject.Finally,wescanresultSetandprintitscontenttothecommandline.

Compileandexecutetheexamplewiththefollowingcommands:

$javacHiveJdbcClient.java

$java-cp$(hadoop

classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/C

DH/lib/hive/lib/hive-jdbc.jar:

com.learninghadoop2.hive.client.HiveJdbcClient

ThriftThriftprovideslower-levelaccesstoHiveandhasanumberofadvantagesovertheJDBCimplementationofHiveServer.Primarily,itallowsmultipleconnectionsfromthesameclient,anditallowsprogramminglanguagesotherthanJavatobeusedwithease.WithHiveServer2,itisalesscommonlyusedoptionbutstillworthmentioningforcompatibility.AsampleThriftclientimplementedusingtheJavaAPIcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveThriftClient.java.ThisclientcanbeusedtoconnecttoHiveServer,butduetoprotocoldifferences,theclientwon’tworkwithHiveServer2.

IntheexamplewedefineagetClient()methodthattakesasinputthehostandportofaHiveServerserviceandreturnsaninstanceoforg.apache.hadoop.hive.service.ThriftHive.Client.

Aclientisobtainedbyfirstinstantiatingasocketconnection,org.apache.thrift.transport.TSocket,totheHiveServerservice,andbyspecifyingaprotocol,org.apache.thrift.protocol.TBinaryProtocol,toserializeandtransmitdata,asfollows:

TSockettransport=newTSocket(host,port);

transport.setTimeout(TIMEOUT);

transport.open();

TBinaryProtocolprotocol=newTBinaryProtocol(transport);

client=newThriftHive.Client(protocol);

WecallgetClient()fromthemainmethodandusetheclienttoexecuteaqueryagainstaninstanceofHiveServerrunningonlocalhostonport11111,asfollows:


Clientclient=getClient("localhost",11111);

client.execute("showtables");

List<String>results=client.fetchAll();

for(Stringresult:results){

System.out.println(result);

}

}

MakesurethatHiveServerisrunningonport11111,andifnot,startaninstancewiththefollowingcommand:

$sudohive--servicehiveserver-p11111

CompileandexecutetheHiveThriftClient.javaexamplewith:

$javac$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*

com/learninghadoop2/hive/client/HiveThriftClient.java

$java-cp$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:

com.learninghadoop2.hive.client.HiveThriftClient

https://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveThriftClient.java

StingerinitiativeHivehasremainedverysuccessfulandcapablesinceitsearliestreleases,particularlyinitsabilitytoprovideSQL-likeprocessingonenormousdatasets.Butothertechnologiesdidnotstandstill,andHiveacquiredareputationofbeingrelativelyslow,particularlyinregardtolengthystartuptimesonlargejobsanditsinabilitytogivequickresponsestoconceptuallysimplequeries.

TheseperceivedlimitationswerelessduetoHiveitselfandmoreaconsequenceofhowtranslationofSQLqueriesintotheMapReducemodelhasmuchbuilt-ininefficiencywhencomparedtootherwaysofimplementingaSQLquery.Particularlyinregardtoverylargedatasets,MapReducesawlotsofI/O(andconsequentlytime)spentwritingouttheresultsofoneMapReducejobjusttohavethemreadbyanother.AsdiscussedinChapter3,Processing–MapReduceandBeyond,thisisamajordriverinthedesignofTez,whichcanschedulejobsonaHadoopclusterasagraphoftasksthatdoesnotrequireinefficientwritesandreadsbetweenthem.

ThefollowingisaqueryontheMapReduceframeworkversusTez:

SELECTa.country,COUNT(b.place_id)FROMplaceaJOINtweetsbON(a.

place_id=b.place_id)GROUPBYa.country;

ThefollowingfigurecontraststheexecutionplanfortheprecedingqueryontheMapReduceframeworkversusTez:

HiveonMapReduceversusTez

InplainMapReduce,twojobsarecreatedfortheGROUPBYandJOINclauses.ThefirstjobiscomposedofasetofMapReducetasksthatreaddatafromthedisktocarryoutgrouping.Thereducerswriteintermediateresultstothedisksothatoutputcanbesynchronized.Themappersinthesecondjobreadtheintermediateresultsfromthediskaswellasdatafromtableb.Thecombineddatasetisthenpassedtothereducerwheresharedkeysarejoined.WerewetoexecuteanORDERBYstatement,thiswouldhaveresultedina

thirdjobandfurtherMapReducepasses.ThesamequeryisexecutedonTezasasinglejobbyasinglesetofMaptasksthatreaddatafromthedisk.I/Ogroupingandjoiningarepipelinedacrossreducers.

Alongsidethesearchitecturallimitations,therewerequiteafewareasaroundSQLlanguagesupportthatcouldalsoprovidebetterefficiency,andinearly2013,theStingerinitiativewaslaunchedwithanexplicitgoalofmakingHiveover100timesasfastandwithmuchricherSQLsupport.Hive0.13hasallthefeaturesofthethreephasesofStinger,resultinginamuchmorecompleteSQLdialect.Also,TezisofferedasanexecutionframeworkinadditiontoaMapReduce-basedimplementationatopYARNwhichismoreefficientthanpreviousimplementationsonHadoop1MapReduce.

WithTezastheexecutionengine,HiveisnolongerlimitedtoaseriesoflinearMapReducejobsandcaninsteadbuildaprocessinggraphwhereanygivenstepcan,forexample,streamresultstomultiplesub-steps.

TotakeadvantageoftheTezframework,thereisanewhivevariablesetting:

sethive.execution.engine=tez;

ThissettingreliesonTezbeinginstalledonthecluster;itisavailableinsourceformfromhttp://tez.apache.orgorinseveraldistributions,thoughatthetimeofwriting,notCloudera.

Thealternativevalueismr,whichusestheclassicMapReducemodel(atopYARN),soitispossibleinasingleinstallationtocomparewiththeperformanceofHiveusingTez.

http://tez.apache.org

ImpalaHiveisnottheonlyproductprovidingSQL-on-Hadoopcapability.ThesecondmostwidelyusedislikelyImpala,announcedinlate2012andreleasedinspring2013.ThoughoriginallydevelopedinternallywithinCloudera,itssourcecodeisperiodicallypushedtoanopensourceGitrepository(https://github.com/cloudera/impala).

ImpalawascreatedoutofthesameperceptionofHive’sweaknessesthatledtotheStingerinitiative.

ImpalaalsotooksomeinspirationfromGoogleDremel(http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdfwhichwasfirstopenlydescribedbyapaperpublishedin2009.DremelwasbuiltatGoogletoaddressthegapbetweentheneedforveryfastqueriesonverylargedatasetsandthehighlatencyinherentintheexistingMapReducemodelunderpinningHiveatthetime.Dremelwasasophisticatedapproachtothisproblemthat,ratherthanbuildingmitigationsatopMapReducesuchasimplementedbyHive,insteadcreatedanewservicethataccessedthesamedatastoredinHDFS.Dremelalsobenefitedfromsignificantworktooptimizethestorageformatofthedatainawaythatmadeitmoreamenabletoveryfastanalyticqueries.

https://github.com/cloudera/impala

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

ThearchitectureofImpalaThebasicarchitecturehasthreemaincomponents;theImpaladaemons,thestatestore,andtheclients.Recentversionshaveaddedadditionalcomponentsthatimprovetheservice,butwe’llfocusonthehigh-levelarchitecture.

TheImpaladaemon(impalad)shouldberunoneachhostwhereaDataNodeprocessismanagingHDFSdata.NotethatimpaladdoesnotaccessthefilesystemblocksthroughthefullHDFSFileSystemAPI;instead,itusesafeaturecalledshort-circuitreadstomakedataaccessmoreefficient.

Whenaclientsubmitsaquery,itcandosotoanyoftherunningimpaladprocesses,andthisonewillbecomethecoordinatorfortheexecutionofthatquery.ThekeyaspectofImpala’sperformanceisthatforeachquery,itgeneratescustomnativecode,whichisthenpushedtoandexecutedbyalltheimpaladprocessesonthesystem.Thishighlyoptimizedcodeperformsthequeryonthelocaldata,andeachimpaladthenreturnsitssubsetoftheresultsettothecoordinatornode,whichperformsthefinaldataconsolidationtoproducethefinalresult.Thistypeofarchitectureshouldbefamiliartoanyonewhohasworkedwithanyofthe(usuallycommercialandexpensive)MassivelyParallelProcessing(MPP)(thetermusedforthistypeofsharedscale-outarchitecture)datawarehousesolutionsavailabletoday.Astheclusterruns,thestatestoredaemonensuresthateachimpaladprocessisawareofalltheothersandprovidesaviewoftheoverallclusterhealth.

Co-existingwithHiveImpala,asanewerproduct,tendstohaveamorerestrictedsetofSQLdatatypesandsupportsamoreconstraineddialectofSQLthanHive.Itis,however,expandingthissupportwitheachnewrelease.RefertotheImpaladocumentation(http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html)togetanoverviewofthecurrentlevelofsupport.

ImpalasupportstheHivemetastoremechanismusedbyHivetopersistentlystorethemetadatasurroundingitstablestructureandstorage.ThismeansthatonaclusterwithanexistingHivesetup,itshouldbeimmediatelypossibletouseImpalaasitwillaccessthesamemetastoreandthereforeprovideaccesstothesametablesavailableinHive.

ButbewarnedthatthedifferencesinSQLdialectanddatatypesmightcauseunexpectedresultswhenworkinginacombinedHiveandImpalaenvironment.Somequeriesmightworkononebutnottheother,theymightshowverydifferentperformancecharacteristics(moreonthislater),ortheymightactuallygivedifferentresults.Thislastpointmightbecomeapparentwhenusingdatatypessuchasfloatanddoublethataresimplytreateddifferentlyintheunderlyingsystems(HiveisimplementedonJavawhileImpalaiswritteninC++).

Asofversion1.2,itsupportsUDFswrittenbothinC++andJava,althoughC++isstronglyrecommendedasamuchfastersolution.KeepthisinmindifyouarelookingtosharecustomfunctionsbetweenHiveandImpala.

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html

AdifferentphilosophyWhenImpalawasfirstreleased,itsgreatestbenefitwasinhowittrulyenabledwhatisoftencalledspeedofthoughtanalysis.Queriescouldbereturnedsufficientlyfastthatananalystcouldexploreathreadofanalysisinacompletelyinteractivefashionwithouthavingtowaitforminutesatatimeforeachquerytocomplete.It’sfairtosaythatmostadoptersofImpalawereattimesstunnedbyitsperformance,especiallywhencomparedtotheversionofHiveshippingatthetime.

TheImpalafocushasremainedmostlyontheseshorterqueries,andthisdoesimposesomelimitationsonthesystem.Impalatendstobequitememory-heavyasitreliesonin-memoryprocessingtoachievemuchofitsperformance.Ifaqueryrequiresadatasettobeheldinmemoryratherthanbeingavailableontheexecutingnode,thenthatquerywillsimplyfailinversionsofImpalabefore2.0.

ComparingtheworkonStingertoImpala,itcouldbearguedthatImpalahasamuchstrongerfocusonexcellingintheshorter(andarguablymorecommon)queriesthatsupportinteractivedataanalysis.ManybusinessintelligencetoolsandservicesarenowcertifiedtodirectlyrunonImpala.TheStingerinitiativehasputlesseffortintomakingHivejustasfastintheareawhereImpalaexcelsbuthasinsteadimprovedHive(tovaryingdegrees)forallworkloads.ImpalaisstilldevelopingatafastpaceandStingerhasputadditionalmomentumintoHive,soitismostlikelywisetoconsiderbothproductsanddeterminewhichbestmeetstheperformanceandfunctionalityrequirementsofyourprojectsandworkflows.

ItshouldalsobekeptinmindthattherearecompetitivecommercialpressuresshapingthedirectionofImpalaandHive.ImpalawascreatedandisstilldrivenbyCloudera,themostpopularvendorofHadoopdistributions.TheStingerinitiative,thoughcontributedtobymanycompaniesasdiverseasMicrosoft(yes,really!)andIntel,wasleadbyHortonworks,probablythesecondlargestvendorofHadoopdistributions.ThefactisthatifyouareusingtheClouderadistributionofHadoop,thensomeofthecorefeaturesofHivemightbeslowertoarrive,whereasImpalawillalwaysbeup-to-date.Conversely,ifyouuseanotherdistribution,youmightgetthelatestHiverelease,butthatmighteitherhaveanolderImpalaor,asiscurrentlythecase,youmighthavetodownloadandinstallityourself.

AsimilarsituationhasarisenwiththeParquetandORCfileformatsmentionedearlier.ParquetispreferredbyImpalaanddevelopedbyagroupofcompaniesledbyCloudera,whileORCispreferredbyHiveandischampionedbyHortonworks.

Unfortunately,therealityisthatParquetsupportisoftenveryquicktoarriveintheClouderadistributionbutlesssoinsaytheHortonworksdistribution,wheretheORCfileformatispreferred.

Thesethemesarealittleconcerningsince,althoughcompetitioninthisspaceisagoodthing,andarguablytheannouncementofImpalahelpedenergizetheHivecommunity,thereisagreaterriskthatyourchoiceofdistributionmighthavealargerimpactonthe

toolsandfileformatsthatwillbefullysupported,unlikeinthepast.Hopefully,thecurrentsituationisjustanartifactofwhereweareinthedevelopmentcyclesofallthesenewandimprovedtechnologies,butdoconsideryourchoiceofdistributioncarefullyinrelationtoyourSQL-on-Hadoopneeds.

Drill,Tajo,andbeyondYoushouldalsoconsiderthatSQLonHadoopnolongeronlyreferstoHiveorImpala.ApacheDrill(http://drill.apache.org)isafullerimplementationoftheDremelmodelfirstdescribedbyGoogle.AlthoughImpalaimplementstheDremelarchitectureacrossHDFSdata,Drilllookstoprovidesimilarfunctionalityacrossmultipledatasources.Itisstillinitsearlystages,butifyourneedsarebroaderthanwhatHiveorImpalaprovides,itmightbeworthconsidering.

Tajo(http://tajo.apache.org)isanotherApacheprojectthatseekstobeafulldatawarehousesystemonHadoopdata.WithanarchitecturesimilartothatofImpala,itoffersamuchrichersystemwithcomponentssuchasmultipleoptimizersandETLtoolsthatarecommonplaceintraditionaldatawarehousesbutlessfrequentlybundledintheHadoopworld.Ithasamuchsmalleruserbasebuthasbeenusedbycertaincompaniesverysuccessfullyforasignificantlengthoftime,andmightbeworthconsideringifyouneedafullerdatawarehousingsolution.

Otherproductsarealsoemerginginthisspace,andit’sagoodideatodosomeresearch.HiveandImpalaareawesometools,butifyoufindthattheydon’tmeetyourneeds,thenlookaround—somethingelsemight.

http://drill.apache.org

http://tajo.apache.org

SummaryInitsearlydays,Hadoopwassometimeserroneouslyseenasthelatestsupposedrelationaldatabasekiller.Overtime,ithasbecomemoreapparentthatthemoresensibleapproachistoviewitasacomplementtoRDBMStechnologiesandthat,infact,theRDBMScommunityhasdevelopedtoolssuchasSQLthatarealsovaluableintheHadoopworld.

HiveQLisanimplementationofSQLonHadoopandwastheprimaryfocusofthischapter.InregardtoHiveQLanditsimplementations,wecoveredthefollowingtopics:

HowHiveQLprovidesalogicalmodelatopdatastoredinHDFSincontrasttorelationaldatabaseswherethetablestructureisenforcedinadvanceHowHiveQLsupportsmanystandardSQLdatatypesandcommandsincludingjoinsandviewsTheETL-likefeaturesofferedbyHiveQL,includingtheabilitytoimportdataintotablesandoptimizethetablestructurethroughpartitioningandsimilarmechanismsHowHiveQLofferstheabilitytoextenditscoresetofoperatorswithuser-definedcodeandhowthiscontraststothePigUDFmechanismTherecenthistoryofHivedevelopments,suchastheStingerinitiative,thathaveseenHivetransitiontoanupdatedimplementationthatusesTezThebroaderecosystemaroundHiveQLthatnowincludesproductssuchasImpala,TajoandDrillandhoweachofthesefocusesonspecificareasinwhichtoexcel

WithPigandHive,we’veintroducedalternativemodelstoprocessMapReducedata,butsofarwe’venotlookedatanotherquestion:whatapproachesandtoolsarerequiredtoactuallyallowthismassivedatasetbeingcollectedinHadooptoremainusefulandmanageableovertime?Inthenextchapter,we’lltakeaslightstepuptheabstractionhierarchyandlookathowtomanagethelifecycleofthisenormousdataasset.

Chapter8.DataLifecycleManagementOurpreviouschapterswerequitetechnologyfocused,describingparticulartoolsortechniquesandhowtheycanbeused.Inthisandthenextchapter,wearegoingtotakeamoretop-downapproachwherebywewilldescribeaproblemspaceyouarelikelytoencounterandthenexplorehowtoaddressit.Inparticular,we’llcoverthefollowingtopics:

WhatwemeanbythetermdatalifecyclemanagementWhydatalifecyclemanagementissomethingtothinkaboutThecategoriesoftoolsthatcanbeusedtoaddresstheproblemHowtousethesetoolstobuildthefirsthalfofaTwittersentimentanalysispipeline

WhatdatalifecyclemanagementisDatadoesn’texistonlyatapointintime.Particularlyforlong-runningproductionworkflows,youarelikelytoacquireasignificantquantityofdatainaHadoopcluster.Requirementsrarelystaystaticforlong,soalongsidenewlogicyoumightalsoseetheformatofthatdatachangeorrequiremultipledatasourcestobeusedtoprovidethedatasetprocessedinyourapplication.Weusethetermdatalifecyclemanagementtodescribeanapproachtohandlingthecollection,storage,andtransformationofdatathatensuresthatdataiswhereitneedstobe,intheformatitneedstobein,inawaythatallowsdataandsystemevolutionovertime.

ImportanceofdatalifecyclemanagementIfyoubuilddataprocessingapplications,youarebydefinitionreliantonthedatathatisprocessed.Justasweconsiderthereliabilityofapplicationsandsystems,itbecomesnecessarytoensurethatthedataisalsoproduction-ready.

DataatsomepointneedstobeingestedintoHadoop.Itisonepartofanenterpriseandoftenhasmultiplepointsofintegrationwithexternalsystems.Iftheingestofdatacomingfromthosesystemsisnotreliable,thentheimpactonthejobsthatprocessthatdataisoftenasdisruptiveasamajorsystemfailure.Dataingestbecomesacriticalcomponentinitsownright.Andwhenwesaytheingestneedstobereliable,wedon’tjustmeanthatdataisarriving;italsohastobearrivinginaformatthatisusableandthroughamechanismthatcanhandleevolutionovertime.

Theproblemwithmanyoftheseissuesisthattheydonotariseinasignificantfashionuntiltheflowsarelarge,thesystemiscritical,andthebusinessimpactofanyproblemsisnon-trivial.Adhocapproachesthatworkedforalesscriticaldataflowoftenwillsimplynotscale,butwillbeverypainfultoreplaceonalivesystem.

ToolstohelpButdon’tpanic!Thereareanumberofcategoriesoftoolsthatcanhelpwiththedatalifecyclemanagementproblem.We’llgiveexamplesofthefollowingthreebroadcategoriesinthischapter:

Orchestrationservices:buildinganingestpipelineusuallyhasmultiplediscretestages,andwewilluseanorchestrationtooltoallowthesetobedescribed,executed,andmanagedConnectors:giventheimportanceofintegrationwithexternalsystems,wewilllookathowwecanuseconnectorstosimplifytheabstractionsprovidedbyHadoopstorageFileformats:howwestorethedataimpactshowwemanageformatevolutionovertime,andseveralrichstorageformatshavewaysofsupportingthis

BuildingatweetanalysiscapabilityInearlierchapters,weusedvariousimplementationsofTwitterdataanalysistodescribeseveralconcepts.Wewilltakethiscapabilitytoadeeperlevelandapproachitasamajorcasestudy.

Inthischapter,wewillbuildadataingestpipeline,constructingaproduction-readydataflowthatisdesignedwithreliabilityandfutureevolutioninmind.

We’llbuildoutthepipelineincrementallythroughoutthechapter.Ateachstage,we’llhighlightwhathaschangedbutcan’tincludefulllistingsateachstagewithouttreblingthesizeofthechapter.Thesourcecodeforthischapter,however,haseveryiterationinitsfullglory.

GettingthetweetdataThefirstthingweneedtodoisgettheactualtweetdata.Asinpreviousexamples,wecanpassthe-jand-nargumentstostream.pytodumpJSONtweetstostdout:

$stream.py-j-n10000>tweets.json

Sincewehavethistoolthatcancreateabatchofsampletweetsondemand,wecouldstartouringestpipelinebyhavingthisjobrunonaperiodicbasis.Buthow?

IntroducingOozieWecould,ofcourse,bangrockstogetherandusesomethinglikecronforsimplejobscheduling,butrecallthatwewantaningestpipelinethatisbuiltwithreliabilityinmind.So,wereallywantaschedulingtoolthatwecanusetodetectfailuresandotherwiserespondtoexceptionalsituations.

ThetoolwewillusehereisOozie(http://oozie.apache.org),aworkflowengineandschedulerbuiltwithafocusontheHadoopecosystem.

Oozieprovidesameanstodefineaworkflowasaseriesofnodeswithconfigurableparametersandcontrolledtransitionfromonenodetothenext.ItisinstalledaspartoftheClouderaQuickStartVM,andthemaincommand-lineclientis,notsurprisingly,calledoozie.

NoteWe’vetestedtheworkflowsinthischapteragainstversion5.0oftheClouderaQuickStartVM,andatthetimeofwritingOozieinthelatestversion,5.1,hassomeissues.There’snothingparticularlyversion-specificinourworkflows,however,sotheyshouldbecompatiblewithanycorrectlyworkingOoziev4implementation.

Thoughpowerfulandflexible,Ooziecantakealittlegettingusedto,sowe’llgivesomeexamplesanddescribewhatwearedoingalongtheway.

ThemostcommonnodeinanOozieworkflowisanaction.Itiswithinactionnodesthatthestepsoftheworkflowareactuallyexecuted;theothernodetypeshandlemanagementoftheworkflowintermsofdecisions,parallelism,andfailuredetection.Ooziehasmultipletypesofactionsthatitcanperform.Oneoftheseistheshellaction,whichcanbeusedtoexecuteanycommandonthesystem,suchasnativebinaries,shellscripts,oranyothercommand-lineutility.Let’screateascripttogenerateafileoftweetsandcopythistoHDFS:

set-e

sourcetwitter.keys

pythonstream.py-j-n500>/tmp/tweets.out

hdfsdfs-put/tmp/tweets.out/tmp/tweets/tweets.out

rm-f/tmp/tweets.out

Notethatthefirstlinewillcausetheentirescripttofailshouldanyoftheincludedcommandsfail.WeuseanenvironmentfiletoprovidetheTwitterkeystoourscriptintwitter.keys,whichisofthefollowingform:

exportTWITTER_CONSUMER_KEY=<value>

exportTWITTER_CONSUMER_SECRET=<value>

exportTWITTER_ACCESS_KEY=<value>

exportTWITTER_ACCESS_SECRET=<value>

OozieusesXMLtodescribeitsworkflows,usuallystoredinafilecalledworkflow.xml.Let’swalkthroughthedefinitionforanOozieworkflowthatcallsashellcommand.

http://oozie.apache.org

TheschemaforanOozieworkflowiscalledworkflow-app,andwecangivetheworkflowaspecificname.ThisisusefulwhenviewingjobhistoryintheCLIorOoziewebUI.Intheexamplesinthisbook,we’lluseanincreasingversionnumbertoallowustomoreeasilyseparatetheiterationswithinthesourcerepository.Thisishowwegivetheworkflow-appaspecificname:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="v1">

Oozieworkflowsaremadeupofaseriesofconnectednodes,eachofwhichrepresentsastepintheprocess,andwhicharerepresentedbyXMLnodesintheworkflowdefinition.Ooziehasanumberofnodesthatdealwiththetransitionoftheworkflowfromonesteptothenext.Thefirstoftheseisthestartnode,whichsimplystatesthenameofthefirstnodetobeexecutedaspartoftheworkflow,asfollows:

<startto="fs-node"/>

Wethenhavethedefinitionforthenamedstartnode.Inthiscase,itisanactionnode,whichisthegenericnodetypeformostOozienodesthatactuallyperformsomeprocessing,asfollows:

<actionname="fs-node">

Actionisabroadcategoryofnodes,andwewilltypicallythenspecializeitwiththeparticularprocessingforthisgivennode.Inthiscase,weareusingthefsnodetype,whichallowsustoperformfilesystemoperations:

<fs>

WewanttoensurethatthedirectoryonHDFStowhichwewishtocopythefileoftweetdata,exists,isempty,andhassuitablepermissions.Wedothisbytryingtodeletethedirectoryifitexists,thencreatingit,andfinallyapplyingtherequiredpermissions,asfollows:

<deletepath="${nameNode}/tmp/tweets"/>

<mkdirpath="${nameNode}/tmp/tweets"/>

<chmodpath="${nameNode}/tmp/tweets"permissions="777"/>

</fs>

We’llseeanalternativewayofsettingupdirectorieslater.Afterperformingthefunctionalityofthenode,Oozieneedsknowhowtoproceedwiththeworkflow.Inmostcases,thiswillcomprisemovingtoanotheractionnodeifthisnodewassuccessfulandabortingtheworkflowotherwise.Thisisspecifiedbythenextelements.Theoknodegivesthenameofthenodetowhichtotransitioniftheexecutionwassuccessful;theerrornodenamesthedestinationnodeforfailurescenarios.Here’showtheokandfailnodesareused:

<okto="shell-node"/>

<errorto="fail"/>

</action>

<actionname="shell-node">

Thesecondactionnodeisagainspecializedwithitsspecificprocessingtype;inthiscase,

wehaveashellnode:

<shellxmlns="uri:oozie:shell-action:0.2">

TheshellactionthenhastheHadoopJobTrackerandNameNodelocationsspecified.Notethattheactualvaluesaregivenbyvariables;we’llexplainwheretheycomefromlater.TheJobTrackerandNameNodearespecifiedasfollows:

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

AsmentionedinChapter3,Processing–MapReduceandBeyond,MapReduceusesmultiplequeuestoprovidesupportfordifferentapproachestoresourcescheduling.ThenextelementspecifiestheMapReducequeuetowhichtheworkflowshouldbesubmitted:

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

Nowthattheshellnodeisfullyconfigured,wecanspecifythecommandtoinvoke,againviaavariable,asfollows:

<exec>${EXEC}</exec>

ThevariousstepsofOozieworkflowsareexecutedasMapReducejobs.Thisshellactionwill,therefore,beexecutedasaspecifictaskinstanceonaparticularTaskTracker.We,therefore,needtospecifywhichfilesneedtobecopiedtothelocalworkingdirectoryontheTaskTrackermachinebeforetheactioncanbeperformed.Inthiscase,weneedtocopythemainshellscript,thePythontweetgenerator,andtheTwitterconfigfile,asfollows:

<file>${workflowRoot}/${EXEC}</file>

<file>${workflowRoot}/twitter.keys</file>

<file>${workflowRoot}/stream.py</file>

Afterclosingtheshellelement,weagainspecifywhattododependingonwhethertheactioncompletedsuccessfullyornot.BecauseMapReduceisusedforjobexecution,themajorityofnodetypesbydefinitionhavebuilt-inretryandrecoverylogic,thoughthisisnotthecaseforshellnodes:

</shell>

<okto="end"/>

<errorto="fail"/>

</action>

Iftheworkflowfails,let’sjustkillitinthiscase.Thekillnodetypedoesexactlythat—terminatetheworkflowfromproceedingtoanyfurthersteps,usuallyloggingerrormessagesalongtheway.Here’showthekillnodetypeisused:

<killname="fail">

<message>Shellactionfailed,error

message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

TheendnodeontheotherhandsimplyhaltstheworkflowandlogsitasasuccessfulcompletionwithinOozie:

<endname="end"/>

</workflow-app>

Theobviousquestioniswhattheprecedingvariablesrepresentandfromwheretheygettheirconcretevalues.TheprecedingvariablesareexamplesoftheOozieExpressionLanguageoftenreferredtoasEL.

Alongsidetheworkflowdefinitionfile(workflow.xml),whichdescribesthestepsintheflow,wealsoneedtocreateaconfigurationfilethatgivesthespecificvaluesforagivenexecutionoftheworkflow.Thisseparationoffunctionalityandconfigurationallowsustowriteworkflowsthatcanbeusedondifferentclusters,ondifferentfilelocations,orwithdifferentvariablevalueswithouthavingtorecreatetheworkflowitself.Byconvention,thisfileisusuallynamedjob.properties.Fortheprecedingworkflow,here’sasamplejob.propertiesfile.

Firstly,wespecifythelocationoftheJobTracker,theNameNode,andtheMapReducequeuetowhichtosubmittheworkflow.ThefollowingshouldworkontheCloudera5.0QuickStartVM,thoughinv5.1thehostnamehasbeenchangedtoquickstart.cloudera.TheimportantthingisthatthespecifiedNameNodeandJobTrackeraddressesneedtobeintheOoziewhitelist—thelocalservicesontheVMareaddedautomatically:

jobTracker=localhost.localdomain:8032

nameNode=hdfs://localhost.localdomain:8020

queueName=default

Next,wesetsomevaluesforwheretheworkflowdefinitionsandassociatedfilescanbefoundontheHDFSfilesystem.Notetheuseofavariablerepresentingtheusernamerunningthejob.Thisallowsasingleworkflowtobeappliedtodifferentpathsdependingonthesubmittinguser,asfollows:

tasksRoot=book

workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v1

oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v1

Next,wenamethecommandtobeexecutedintheworkflowas${EXEC}:

EXEC=gettweets.sh

Morecomplexworkflowswillrequireadditionalentriesinthejob.propertiesfile;theprecedingworkflowisassimpleasitgets.

Theooziecommand-linetoolneedstoknowwheretheOozieserverisrunning.ThiscanbeaddedasanargumenttoeveryOozieshellcommand,butthatgetsunwieldyveryquickly.Instead,youcansettheshellenvironmentvariable,asfollows:

$exportOOZIE_URL='http://localhost:11000/oozie'

Afterallthatwork,wecannowactuallyrunanOozieworkflow.Createadirectoryon

HDFSasspecifiedinthevaluesinthejob.propertiesfile.Intheprecedingcommand,we’dbecreatingthisasbook/v1underourhomedirectoryonHDFS.Copythestream.py,gettweets.shandtwitter.propertiesfilestothatdirectory;thesearethefilesrequiredtoperformtheactualexecutionoftheshellcommand.Then,addtheworkflow.xmlfiletothesamedirectory.

Toruntheworkflowthen,wedothefollowing:

$ooziejob-run-config<path-to-job.properties>

Ifsubmittedsuccessfully,Ooziewillprintthejobnametothescreen.Youcanseethecurrentstatusofthisworkflowwith:

$ooziejob-info<job-id>

Youcanalsocheckthelogsforthejob:

$ooziejob-log<job-id>

Inaddition,allcurrentandrecentjobscanbeviewedwith:

$ooziejobs

AnoteonHDFSfilepermissionsThereisasubtleaspectintheshellcommandthatcancatchtheunwary.Asanalternativetohavingthefsnode,wecouldinsteadincludeapreparationelementwithintheshellnodetocreatethedirectoryweneedonthefilesystem.Itwouldlooklikethefollowing:

<prepare>

<mkdirpath="${nameNode}/tmp/tweets"/>

</prepare>

Thepreparestageisexecutedbytheuserwhosubmittedtheworkflow,butsincetheactualscriptexecutionisperformedonYARN,itisusuallyexecutedastheyarnuser.Youmighthitaproblemwherethescriptgeneratesthetweets,the/tmp/tweetsdirectoryiscreatedonHDFS,butthescriptthenfailstohavepermissiontowritetothatdirectory.Youcaneitherresolvethisthroughassigningpermissionsmorepreciselyor,asshownearlier,youaddafilesystemnodetoencapsulatetheneededoperations.We’lluseamixtureofbothtechniquesinthischapter;fornon-shellnodes,we’lluseprepareelements,particularlyiftheneededdirectoryismanipulatedonlybythatnode.Forcaseswhereashellnodeisinvolvedorwherethecreateddirectorieswillbeusedacrossmultiplenodes,we’llbesafeandusethemoreexplicitfsnode.

MakingdevelopmentalittleeasierItcansometimesgetawkwardtomanagethefilesandresourcesforanOoziejobduringdevelopment.SomeneedtobeonHDFS,whilesomeneedtobelocal,andchangestosomefilesrequirechangestoothers.TheeasiestapproachisoftentodevelopormakechangesinacompletecloneoftheworkflowdirectoryonthelocalfilesystemandpushchangesfromtheretothesimilarlynameddirectoryinHDFS,notforgetting,ofcourse,toensurethatallchangesareunderrevisioncontrol!Foroperationalexecutionofthe

workflow,thejob.propertiesfileistheonlythingthatneedstobeonthelocalfilesystemand,conversely,alltheotherfilesneedtobeonHDFS.Alwaysrememberthis:it’salltooeasytomakechangestoalocalcopyofaworkflow,forgettopushthechangestoHDFS,andthenbeconfusedastowhytheworkflowisn’treflectingthechanges.

ExtractingdataandingestingintoHiveWithourdataonHDFS,wecannowextracttheseparatedatasetsfortweetsandusers,andplacedataasinpreviouschapters.Wecanreuseextract_for_hive.pigtoparsetherawtweetJSONintoseparatefiles,storethemagainonHDFS,andthenfollowupwithaHivestepthatingeststhesenewfilesintoHivetablesfortweets,users,andplaces.

TodothiswithinOozie,we’llneedtoaddtwonewnodestoourworkflow,aPigactionforthefirststepandaHiveactionforthesecond.

ForourHiveaction,we’lljustcreatethreeexternaltablesthatpointtothefilesgeneratedbyPig.ThiswouldthenallowustofollowourpreviouslydescribedmodelofingestingintotemporaryorexternaltablesandusingHiveQLINSERTstatementsfromtheretoinsertintotheoperational,andoftenpartitioned,tables.Thiscreate.hqlscriptcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch8/v2/hive/create.hqlbutissimplyofthefollowingform:

CREATEDATABASEIFNOTEXISTStwttr;

USEtwttr;

DROPTABLEIFEXISTStweets;


...

)ROWFORMATDELIMITED


STOREDASTEXTFILE

LOCATION'${ingestDir}/tweets';

DROPTABLEIFEXISTSuser;

CREATEEXTERNALTABLEuser(

...

)ROWFORMATDELIMITED


STOREDASTEXTFILE

LOCATION'${ingestDir}/users';

DROPTABLEIFEXISTSplace;

CREATEEXTERNALTABLEplace(

...

)ROWFORMATDELIMITED


STOREDASTEXTFILE

LOCATION'${ingestDir}/places';

NotethatthefileseparatoroneachtableisalsoexplicitlysettomatchwhatweareoutputtingfromPig.Inadditiontothis,locationsinbothscriptsarespecifiedbyvariablesforwhichwewillprovideconcretevaluesinourjob.propertiesfile.

Withtheprecedingstatements,wecancreatethePignodeforourworkflowfoundinthe

https://github.com/learninghadoop2/book-examples/blob/master/ch8/v2/hive/create.hql

sourcecodeasv2ofthepipeline.Muchofthenodedefinitionlookssimilartotheshellnodeusedpreviously,aswesetthesameconfigurationelements;alsonoticeouruseoftheprepareelementtocreatetheneededoutputdirectory.WecancreatethePignodeforourworkflowasshowninthefollowingaction:

<actionname="pig-node">

<pig>



<prepare>

<deletepath="${nameNode}/${outputDir}"/>

<mkdirpath="${nameNode}/${outputDir}"/>

</prepare>

<configuration>

<property>



</property>

</configuration>

Similarlyaswiththeshellcommand,weneedtotellthePigactionthelocationoftheactualPigscript.Thisisspecifiedinthefollowingscriptelement:

<script>${workflowRoot}/pig/extract_for_hive.pig</script>

WealsoneedtomodifythecommandlineusedtoinvokethePigscripttoaddseveralparameters.Thefollowingelementsdothis;notetheconstructionpatternwhereinoneelementaddstheactualparameternameandthenextitsvalue(we’llseeanalternativemechanismforpassingargumentsinthenextsection):

<argument>-param</argument>

<argument>inputDir=${inputDir}</argument>


<argument>outputDir=${outputDir}</argument>

</pig>

BecausewewanttomovefromthissteptotheHivenode,weneedtosetthefollowingelementsappropriately:

<okto="hive-node"/>

<errorto="fail"/>

</action>

TheHiveactionitselfisalittledifferentthanthepreviousnodes;eventhoughitstartsinasimilarfashion,itspecifiestheHiveaction-specificnamespace,asfollows:

<actionname="hive-node">

<hivexmlns="uri:oozie:hive-action:0.2">



TheHiveactionneedsmanyoftheconfigurationelementsusedbyHiveitselfand,inmostcases,wecopythehive-site.xmlfileintotheworkflowdirectoryandspecifyitslocation,asshowninthefollowingxml;notethatthismechanismisnotHive-specificand

canalsobeusedforcustomactions:

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

Inaddition,wemightneedtooverridesomeMapReducedefaultconfigurationproperties,asshowninthefollowingxml,wherewespecifythatintermediatecompressionshouldbeusedforourjob:

<configuration>

<property>

<name>mapred.compress.map.output</name>

<value>true</value>

</property>

</configuration>

AfterconfiguringtheHiveenvironment,wenowspecifythelocationoftheHivescript:

<script>${workflowRoot}/hive/create.hql</script>

WealsohavetoprovidethemechanismtopassargumentstotheHivescript.Butinsteadofbuildingoutthecommandlineonecomponentatatime,we’lladdtheparamelementsthatmapthenameofaconfigurationelementinthejob.propertiesfiletovariablesspecifiedintheHivescript;thismechanismisalsosupportedwithPigactions:

<param>dbName=${dbName}</param>

<param>ingestDir=${ingestDir}</param>

</hive>

TheHivenodethenclosesastheothers,asfollows:

<okto="end"/>

<errorto="fail"/>

</action>

WenowneedtoputallthistogethertorunthemultistageworkflowinOozie.Thefullworkflow.xmlfilecanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch8/v2andtheworkflowisvisualizedinthefollowingdiagram:

Dataingestionworkflowv2

https://github.com/learninghadoop2/book-examples/tree/master/ch8/v2

Thisworkflowperformsallthestepsdiscussedbefore;itgeneratestweetdata,extractssubsetsofdataviaPig,andtheningeststheseintoHive.

AnoteonworkflowdirectorystructureWenowhavequiteafewfilesinourworkflowdirectoryanditisbesttoadoptsomestructureandnamingconventions.Forthecurrentworkflow,ourdirectoryonHDFSlookslikethefollowing:

/hive/

/hive/create.hql

/lib/

/pig/

/pig/extract_for_hive.pig

/scripts/

/scripts/gettweets.sh

/scripts/stream-json-batch.py

/scripts/twitter-keys

/hive-site.xml

/job.properties

/workflow.xml

Themodelwefollowistokeepconfigurationfilesinthetop-leveldirectorybuttokeepfilesrelatedtoagivenactiontypeindedicatedsubdirectories.Notethatitisusefultohavealibdirectoryevenifempty,assomenodetypeslookforit.

Withtheprecedingstructure,thejob.propertiesfileforourcombinedjobisnowthefollowing:

jobTracker=localhost.localdomain:8032

nameNode=hdfs://localhost.localdomain:8020

queueName=default

tasksRoot=book

workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v2

oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v2

oozie.use.system.libpath=true

EXEC=gettweets.sh

inputDir=/tmp/tweets

outputDir=/tmp/tweetdata

ingestDir=/tmp/tweetdata

dbName=twttr

Intheprecedingcode,we’vefullyupdatedtheworkflow.xmldefinitiontoincludeallthestepsdescribedsofar—includinganinitialfsnodetocreatetherequireddirectorywithoutworryingaboutuserpermissions.

IntroducingHCatalogIfwelookatourcurrentworkflow,thereisinefficiencyinhowweuseHDFSastheinterfacebetweenPigandHive.WeneedtooutputtheresultofourPigscriptontoHDFS,wheretheHivescriptcanthenuseitasthelocationofsomenewtables.Whatthis

highlightsisthatitisoftenveryusefultohavedatastoredinHive,butthisislimited,asfewtools(primarilyHive)canaccesstheHivemetastoreandhencereadandwritesuchdata.Ifwethinkaboutit,Hivehastwomainlayers:itstoolsforaccessingandmanipulatingitsdataplustheexecutionframeworktorunqueriesonthatdata.

TheHCatalogsubprojectofHiveeffectivelyprovidesanindependentimplementationofthefirstoftheselayers—themeanstoaccessandmanipulatedataintheHivemetastore.HCatalogprovidesmechanismsforothertools,suchasPigandMapReduce,tonativelyreadandwritetable-structureddatathatisstoredonHDFS.

Remember,ofcourse,thatthedataisstoredonHDFSinoneformatoranother.TheHivemetastoreprovidesthemodelstoabstractthesefilesintotherelationaltablestructurefamiliarfromHive.SowhenwesaywearestoringdatainHCatalog,whatwereallymeanisthatwearestoringdataonHDFSinsuchawaythatthisdatacanthenbeexposedbytablestructuresspecifiedwithintheHivemetastore.Conversely,whenwerefertoHivedata,whatwereallymeanisdatawhosemetadataisstoredintheHivemetastore,andwhichcanbeaccessedbyanymetastore-awaretool,suchasHCatalog.

UsingHCatalog

TheHCatalogcommand-linetooliscalledhcatandwillbepreinstalledontheClouderaQuickStartVM—itisinstalled,infact,withanyversionofHivelaterthan0.11inclusive.

Thehcatutilitydoesn’thaveaninteractivemode,sogenerallyyouwilluseitwithexplicitcommand-lineargumentsorbypointingitatafileofcommands,asfollows:

$hcat–e"usedefault;showtables"

$hcat–fcommands.hql

Thoughthehcattoolisusefulandcanbeincorporatedintoscripts,themoreinterestingelementofHCatalogforourpurposeshereisitsintegrationwithPig.HCatalogdefinesanewPigloadercalledHCatLoaderandastorercalledHCatStorer.Asthenamessuggest,theseallowPigscriptstoreadfromorwritetoHivetablesdirectly.WecanusethismechanismtoreplaceourpreviousPigandHiveactionsinourOozieworkflowwithasingleHCatalog-basedPigactionthatwritestheoutputofthePigjobdirectlyintoourtablesinHive.

Forclarity,we’llcreatenewtablesnamedtweets_hcat,places_hcat,andusers_hcatintowhichwe’llinsertthisdata;notethatthesearenolongerexternaltables:

CREATETABLEtweets_hcat…

CREATETABLEplaces_hcat…

CREATETABLEusers_hcat…

Notethatifwehadthesecommandsinascriptfile,wecouldusethehcatCLItooltoexecutethem,asfollows:

$hcat–fcreate.hql

TheHCatCLItooldoesnot,however,offeraninteractiveshellakintotheHiveCLI.WecannowuseourpreviousPigscriptandneedtoonlychangethestorecommands,replacingtheuseofPigStoragewithHCatStorer.OurupdatedPigscript,

extract_to_hcat.pig,thereforeincludesstorecommandssuchasthefollowing:

storetweets_tsvinto'twttr.tweets_hcat'using

org.apache.hive.hcatalog.pig.HCatStorer();

NotethatthepackagenamefortheHCatStorerclasshastheorg.apache.hive.hcatalogprefix;whenHCatalogwasintheApacheincubator,itusedorg.apache.hcatalogforitspackageprefix.Thisolderformisnowdeprecated,andthenewformthatexplicitlyshowsHCatalogasasubprojectofHiveshouldbeusedinstead.

WiththisnewPigscript,wecannowreplaceourpreviousPigandHiveactionwithanupdatedPigactionusingHCatalog.ThisalsorequiresthefirstusageoftheOoziesharelib,whichwe’lldiscussinthenextsection.Inourworkflowdefinition,thepigelementofthisactionwillbedefinedasshowninthefollowingxmlandcanbefoundasv3ofthepipelineinthesourcebundle;inv3,we’vealsoaddedautilityHivenodetorunbeforethePignodetoensurethatallnecessarytablesexistbeforethePigscriptthatrequiresthemisexecuted.

<pig>




<configuration>

<property>



</property>

<property>

<name>oozie.action.sharelib.for.pig</name>

<value>pig,hcatalog</value>

</property>

</configuration>

<script>${workflowRoot}/pig/extract_to_hcat.pig

</script>


<argument>inputDir=${inputDir}</argument>

</pig>

Thetwochangesofnotearetheadditionoftheexplicitreferencetothehive-site.xmlfile;thisisrequiredbyHCatalog,andthenewconfigurationelementthattellsOozietoincludetherequiredHCatalogJARs.

TheOoziesharelibThatlastadditiontouchedonanimportantaspectofOoziewe’venotmentionedthusfar:theOoziesharelib.WhenOozierunsallitsvariousactiontypes,itrequiresmultipleJARstoaccessHadoopandtoinvokevarioustools,suchasHiveandPig.AspartoftheOozieinstallation,alargenumberofdependentJARshavebeenplacedonHDFStobeusedbyOozieanditsvariousactiontypes:thisistheOoziesharelib.

FormostusagesofOozie,it’senoughtoknowthesharelibexists,usuallyunder/user/oozie/share/libonHDFS,andwhen,asinthepreviousexample,someexplicit

configurationvaluesneedtobeadded.WhenusingaPigaction,thePigJARswillautomaticallygetpickedup,butwhenthePigscriptusessomethinglikeHCatalog,thenthisdependencywillnotbeexplicitlyknowntoOozie.

TheOozieCLIallowsmanipulationofthesharelib,thoughthescenarioswherethiswillberequiredareoutsideofthescopeofthisbook.ThefollowingcommandcanbeusefulthoughtoseewhichcomponentsareincludedintheOoziesharelib:

$oozieadmin-shareliblist

ThefollowingcommandisusefultoseetheindividualJARscomprisingaparticularcomponentwithinthesharelib,inthiscaseHCatalog:

$oozieadmin-shareliblisthcat

ThesecommandscanbeusefultoverifythattherequiredJARsarebeingincludedandtoseewhichspecificversionsarebeingused.

HCatalogandpartitionedtablesIfyourerunthepreviousworkflowasecondtime,itwillfail;digintothelogs,andyouwillseeHCatalogcomplainingthatitcannotwritetoatablethatalreadycontainsdata.ThisisacurrentlimitationofHCatalog;itviewstablesandpartitionswithintablesasimmutablebydefault.Hive,ontheotherhand,willaddnewdatatoatableorpartition;itsdefaultviewofatableisthatitismutable.

UpcomingchangestoHiveandHCatalogwillseethesupportofanewtablepropertythatwillcontrolthisbehaviorineithertool;forexample,thefollowingaddedtoatabledefinitionwouldallowtableappendsassupportedinHivetoday:

TBLPROPERTIES("immutable"="false")

ThisiscurrentlynotavailableintheshippingversionofHiveandHCatalog,however.Forustohaveaworkflowthataddsmoreandmoredataintoourtables,wethereforeneedtocreateanewpartitionforeachnewrunoftheworkflow.We’vemadethesechangesinv4ofourpipeline,wherewefirstrecreatethetableswithanintegerpartitionkey,asfollows:

CREATETABLEtweets_hcat(

…)

PARTITIONEDBY(partition_keyint)

ROWFORMATDELIMITED


STOREDASSEQUENCEFILE;

CREATETABLE`places_hcat`(

…)

partitionedby(partition_keyint)

ROWFORMATDELIMITED


STOREDASSEQUENCEFILE

TBLPROPERTIES("immutable"="false");

CREATETABLE`users_hcat`(

…)

partitionedby(partition_keyint)

ROWFORMATDELIMITED


STOREDASSEQUENCEFILE

TBLPROPERTIES("immutable"="false");

ThePigHCatStorertakesanoptionalpartitiondefinitionandwemodifythestorestatementsinourPigscriptaccordingly;forexample:

storetweets_tsvinto'twttr.tweets_hcat'

usingorg.apache.hive.hcatalog.pig.HCatStorer(

'partition_key=$partitionKey');

WethenmodifyourPigactionintheworkflow.xmlfiletoincludethisadditionalparameter:

<script>${workflowRoot}/pig/extract_to_hcat.pig</script>

<param>inputDir=${inputDir}</param>

<param>partitionKey=${partitionKey}</param>

Thequestionisthenhowwepassthispartitionkeytotheworkflow.Wecouldspecifyitinthejob.propertiesfile,butbydoingsowewouldhitthesameproblemwithtryingtowritetoanexistingpartitiononthenextre-run.

Ingestionworkflowv4

Fornow,we’llpassthisasanexplicitargumenttotheinvocationoftheOozieCLIandexplorebetterwaystodothislater:

$ooziejob–run–configv4/job.properties–DpartitionKey=12345

NoteNotethataconsequenceofthisbehavioristhatrerunninganHCatworkflowwiththesameargumentswillfail.Beawareofthiswhentestingworkflowsorplayingwiththesamplecodefromthisbook.

ProducingderiveddataNowthatwehaveourmaindatapipelineestablished,thereismostlikelyaseriesofactionsthatwewishtotakeafterweaddeachnewadditionaldataset.Asasimpleexample,notethatwithourpreviousmechanismofaddingeachsetofuserdatatoaseparatepartition,theusers_hcattablewillcontainusersmultipletimes.Let’screateanewtableforuniqueusersandregeneratethiseachtimeweaddnewuserdata.

NotethatgiventheaforementionedlimitationsofHCatalog,we’lluseaHiveactionforthispurpose,asweneedtoreplacethedatainatable.

First,we’llcreateanewtableforuniqueuserinformation,asfollows:

CREATETABLEIFNOTEXISTS`unique_users`(

`user_id`string,

`name`string,

`description`string,

`screen_name`string)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

STOREDASsequencefile;

Inthistable,we’llonlystoretheattributesofauserthateitherneverchange(ID)orchangerarely(thescreenname,andsoon).WecanthenwriteasimpleHivestatementtopopulatethistablefromthefullusers_hcattable:

USEtwttr;

INSERTOVERWRITETABLEunique_users

SELECTDISTINCTuser_id,name,description,screen_name

FROMusers_hcat;

WecanthenaddanadditionalHiveactionnodethatcomesafterourpreviousPignodeintheworkflow.Whendoingthis,wediscoverthatourpatternofsimplygivingnodesnamessuchashive-nodeisareallybadidea,aswenowhavetwoHive-basednodes.Inv5oftheworkflow,weaddthisnewnodeandalsochangeournodestohavemoredescriptivenames:

Ingestionworkflowv5

PerformingmultipleactionsinparallelOurworkflowhastwotypesofactivity:initialsetupwiththenodesthatinitializethefilesystemandHivetables,andthefunctionalnodesthatperformactualprocessing.Ifwelookatthetwosetupnodeswehavebeenusing,itisobviousthattheyarequitedistinctandnotinterdependent.WecanthereforetakeadvantageofanOoziefeaturecalledforkandjoinnodestoexecutetheseactionsinparallel.Thestartofourworkflow.xmlfilenowbecomes:

<startto="setup-fork-node"/>

TheOozieforknodecontainsanumberofpathelements,eachofwhichspecifiesastartingnode.Eachofthesewillbelaunchedinparallel:

<forkname="setup-fork-node">

<pathstart="setup-filesystem-node"/>

<pathstart="create-tables-node"/>

</fork>

Eachofthespecifiedactionnodesisnodifferentfromanywehaveusedpreviously.Anactionnodecanlinktoaseriesofothernodes;theonlyrequirementisthateachparallelseriesofactionsmustendwithatransitiontothejoinnodeassociatedwiththeforknode,asfollows:

<actionname="setup-filesystem-node">

…

<okto="setup-join-node"/>

<errorto="fail"/>

</action>

<actionname="create-tables-node">

…

<okto="setup-join-node"/>

<errorto="fail"/>

</action>

Thejoinnodeitselfactsasthepointofcoordination;anyworkflowthathascompletedwillwaituntilallthepathsspecifiedintheforknodereachthispoint.Atthatpoint,theworkflowcontinuesatthenodespecifiedwithinthejoinnode.Here’showthejoinnodeisused:

<joinname="create-join-node"to="gettweets-node"/>

Intheprecedingcodeweomittedtheactiondefinitionsforspacepurposes,butthefullworkflowdefinitionisinv6:

Ingestionworkflowv6

CallingasubworkflowThoughthefork/joinmechanismmakestheprocessofparallelactionsmoreefficient,itdoesstilladdsignificantverbosityifweincludeitinourmainworkflow.xmldefinition.Conceptually,wehaveaseriesofactionsthatareperformingrelatedtasksrequiredbyourworkflowbutnotnecessarilypartofit.Forthisandsimilarcases,Oozieofferstheabilitytoinvokeasubworkflow.Theparentworkflowwillexecutethechildandwaitforittocomplete,withtheabilitytopassconfigurationelementsfromoneworkflowtotheother.

Thechildworkflowwillbeafullworkflowinitsownright,usuallystoredinadirectoryonHDFSwithalltheusualstructureweexpectforaworkflow,themainworkflow.xmlfile,andanyrequiredHive,Pig,orsimilarfiles.

WecancreateanewdirectoryonHDFScalledsetup-workflow,andinthiscreatethefilesrequiredonlyforourfilesystemandHivecreationactions.Thesubworkflowconfigurationfilewilllooklikethefollowing:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="create-workflow">

<startto="setup-fork-node"/>

<forkname="setup-fork-node">

<pathstart="setup-filesystem-node"/>

<pathstart="create-tables-node"/>

</fork>

<actionname="setup-filesystem-node">

…

</action>

<actionname="create-tables-node">

…

</action>

<joinname="create-join-node"to="end"/>

<killname="fail">

<message>Actionfailed,error

message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

<endname="end"/>

</workflow-app>

Withthissubworkflowdefined,wethenmodifythefirstnodesofourmainworkflowtouseasubworkflownode,asinthefollowing:

<startto="create-subworkflow-node"/>

<actionname="create-subworkflow-node">

<sub-workflow>

<app-path>${subWorkflowRoot}</app-path>

<propagate-configuration/>

</sub-workflow>

<okto="gettweets-node"/>

<errorto="fail"/>

</action>

WewillspecifythesubWorkflowPathinthejob.propertiesofourparentworkflow,andthepropagate-configurationelementwillpasstheconfigurationoftheparentworkflowtothechild.

AddingglobalsettingsByextractingutilitynodesintosubworkflows,wecansignificantlyreduceclutterandcomplexityinourmainworkflowdefinition.Inv7ofouringestpipeline,we’llmakeoneadditionalsimplificationandaddaglobalconfigurationsection,asinthefollowing:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="v7">

<global>




<configuration>

<property>



</property>

</configuration>

</global>

<startto="create-subworkflow-node"/>

Byaddingthisglobalconfigurationsection,weremovetheneedtospecifyanyofthesevaluesintheHiveandPignodesintheremainingworkflow(notethatcurrentlytheshellnodedoesnotsupporttheglobalconfigurationmechanism).Thiscandramaticallysimplifysomeofournodes;forexample,ourPignodeisnowasfollows:

<actionname="hcat-ingest-node">

<pig>

<configuration>

<property>

<name>oozie.action.sharelib.for.pig</name>

<value>pig,hcatalog</value>

</property>

</configuration>

<script>${workflowRoot}/pig/extract_to_hcat.pig</script>

<param>inputDir=${inputDir}</param>

<param>dbName=${dbName}</param>

<param>partitionKey=${partitionKey}</param>

</pig>

<okto="derived-data-node"/>

<errorto="fail"/>

</action>

Ascanbeseen,wecanaddadditionalconfigurationelements,orindeedoverridethosespecifiedintheglobalsection,resultinginamuchcleareractiondefinitionthatfocusesonlyontheinformationspecifictotheactioninquestion.Ourworkflowv7hashadbothaglobalsectionaddedaswellastheadditionofthesubworkflow,andthismakesasignificantimprovementintheworkflowreadability:

Ingestionworkflowv7

ChallengesofexternaldataWhenwerelyonexternaldatatodriveourapplication,weareimplicitlydependentonthequalityandstabilityofthatdata.Thisis,ofcourse,trueforanydata,butwhenthedataisgeneratedbyanexternalsourceoverwhichwedonothavecontrol,therisksaremostlikelyhigher.Regardless,whenbuildingwhatweexpecttobereliableapplicationsontopofsuchdatafeeds,andespeciallywhenourdatavolumesgrow,weneedtothinkabouthowtomitigatetheserisks.

DatavalidationWeusethegeneraltermdatavalidationtorefertotheactofensuringthatincomingdatacomplieswithourexpectationsandpotentiallyapplyingnormalizationtomodifyitaccordinglyortoevendeletemalformedorcorruptinput.Whatthisactuallyinvolveswillbeveryapplication-specific.Insomecases,theimportantthingisensuringthesystemonlyingestsdatathatconformstoagivendefinitionofaccurateorclean.Forourtweetdata,wedon’tcareabouteverysinglerecordandcouldveryeasilyadoptapolicysuchasdroppingrecordsthatdon’thavevaluesinparticularfieldswecareabout.Forotherapplications,however,itisimperativetocaptureeveryinputrecord,andthismightdrivetheimplementationoflogictoreformateveryrecordtomakesureitcomplieswiththerequirements.Inyetothercases,onlycorrectrecordswillbeingested,buttherest,insteadofbeingdiscarded,mightbestoredelsewhereforlateranalysis.

Thebottomlineisthattryingtodefineagenericapproachtodatavalidationisvastlybeyondthescopeofthischapter.

However,wecanoffersomethoughtsonwhereinthepipelinetoincorporatevarioustypesofvalidationlogic.

ValidationactionsLogictodoanynecessaryvalidationorcleanupcanbeincorporateddirectlyintootheractions.Ashellnoderunningascripttogatherdatacanhavecommandsaddedtohandlemalformedrecordsdifferently.PigandHiveactionsthatloaddataintotablescaneitherperformfilteringoningest(easierdoneinPig)oraddcaveatswhencopyingdatafromaningesttabletotheoperationalstore.

Thereisanargumentthoughfortheadditionofavalidationnodeintotheworkflow,evenifinitiallyitperformsnoactuallogic.Thiscould,forinstance,beaPigactionthatreadsthedata,appliesthevalidation,andwritesthevalidateddatatoanewlocationtobereadbyfollow-onnodes.Theadvantagehereisthatwecanlaterupdatethevalidationlogicwithoutalteringourotheractions,whichshouldreducetheriskofaccidentallybreakingtherestofthepipelineandalsomakenodesmorecleanlydefinedintermsofresponsibilities.Thenaturalextensionofthistrainofthoughtisthatanewsubworkflowforvalidationismostlikelyagoodmodelaswell,asitnotonlyprovidesseparationofresponsibilities,butalsomakesthevalidationlogiceasiertotestandupdate.

Theobviousdisadvantageofthisapproachisthatitaddsadditionalprocessingandanothercycleofreadingthedataandwritingitallagain.Thisis,ofcourse,directlyworkingagainstoneoftheadvantageswehighlightedwhenconsideringtheuseofHCatalogfromPig.

Intheend,itwillcomedowntoatrade-offofperformanceagainstworkflowcomplexityandmaintainability.Whenconsideringhowtoperformvalidationandjustwhatthatmeansforyourworkflow,takealltheseelementsintoaccountbeforedecidingonanimplementation.

HandlingformatchangesWecan’tdeclarevictoryjustbecausewehavedataflowingintooursystemandareconfidentthedataissufficientlyvalidated.Particularlywhenthedatacomesfromanexternalsourcewehavetothinkabouthowthestructureofthedatamightchangeovertime.

RememberthatsystemssuchasHiveonlyapplythetableschemawhenthedataisbeingread.Thisisahugebenefitinenablingflexibledatastorageandingest,butcanleadtouser-facingqueriesorworkloadsfailingsuddenlywhentheingesteddatanolongermatchesthequeriesbeingexecutedagainstit.Arelationaldatabase,whichappliesschemasonwrite,wouldnotevenallowsuchdatatobeingestedintothesystem.

Theobviousapproachtohandlingchangesmadetothedataformatwouldbetoreprocessexistingdataintothenewformat.Thoughthisistractableonsmallerdatasets,itquicklybecomesinfeasibleonthesortofvolumesseeninlargeHadoopclusters.

HandlingschemaevolutionwithAvroAvrohassomefeatureswithrespecttoitsintegrationwithHivethathelpuswiththisproblem.Ifwetakeourtablefortweetsdata,wecouldrepresentthestructureofatweetrecordbythefollowingAvroschema:

{

"namespace":"com.learninghadoop2.avrotables",

"type":"record",

"name":"tweets_avro",

"fields":[

{"name":"created_at","type":["null","string"]},

{"name":"tweet_id_str","type":["null","string"]},

{"name":"text","type":["null","string"]},

{"name":"in_reply_to","type":["null","string"]},

{"name":"is_retweeted","type":["null","string"]},

{"name":"user_id","type":["null","string"]},

{"name":"place_id","type":["null","string"]}

]

}

Createtheprecedingschemainafilecalledtweets_avro.avsc—thisisthestandardfileextensionforAvroschemas.Then,placeitonHDFS;weliketohaveacommonlocationforschemafilessuchas/schema/avro.

Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:

CREATETABLEtweets_avro

PARTITIONEDBY(`partition_key`int)

ROWFORMATSERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITHSERDEPROPERTIES(

'avro.schema.url'='hdfs://localhost.localdomain:8020/schema/avro/tweets_avr

o.avsc'

)

STOREDASINPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Then,lookatthetabledefinitionfromwithinHive(orHCatalog,whichalsosupportssuchdefinitions):

describetweets_avro

OK

created_atstringfromdeserializer

tweet_id_strstringfromdeserializer

textstringfromdeserializer

in_reply_tostringfromdeserializer

is_retweetedstringfromdeserializer

user_idstringfromdeserializer

place_idstringfromdeserializer

partition_keyintNone

Wecanalsousethistablelikeanyother,forexample,tocopythedatafrompartition3fromthenon-AvrotableintotheAvrotable,asfollows:

SEThive.exec.dynamic.partition.mode=nonstrict

INSERTINTOTABLEtweets_avro

PARTITION(partition_key)

SELECTFROMtweets_hcat

NoteJustasinpreviousexamples,ifAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforebeingabletoselectfromthetable.

WenowhaveanewtweetstablespecifiedbyanAvroschema;sofaritjustlookslikeothertables.ButtherealbenefitsforourpurposesinthischapterareinhowwecanusetheAvromechanismtohandleschemaevolution.Let’saddanewfieldtoourtableschema,asfollows:

{

"namespace":"com.learninghadoop2.avrotables",

"type":"record",

"name":"tweets_avro",

"fields":[

{"name":"created_at","type":["null","string"]},

{"name":"tweet_id_str","type":["null","string"]},

{"name":"text","type":["null","string"]},

{"name":"in_reply_to","type":["null","string"]},

{"name":"is_retweeted","type":["null","string"]},

{"name":"user_id","type":["null","string"]},

{"name":"place_id","type":["null","string"]},

{"name":"new_feature","type":"string","default":"wow!"}

]

}

Withthisnewschemainplace,wecanvalidatethatthetabledefinitionhasalsobeenupdated,asfollows:

describetweets_avro;

OK

created_atstringfromdeserializer

tweet_id_strstringfromdeserializer

textstringfromdeserializer

in_reply_tostringfromdeserializer

is_retweetedstringfromdeserializer

user_idstringfromdeserializer

place_idstringfromdeserializer

new_featurestringfromdeserializer

partition_keyintNone

Withoutaddinganynewdata,wecanrunqueriesonthenewfieldthatwillreturnthedefaultvalueforourexistingdata,asfollows:

SELECTnew_featureFROMtweets_avroLIMIT5;

...

OK

wow!

wow!

wow!

wow!

wow!

Evenmoreimpressiveisthefactthatthenewcolumndoesn’tneedtobeaddedattheend;itcanbeanywhereintherecord.Withthismechanism,wecannowupdateourAvroschemastorepresentthenewdatastructureandseethesechangesautomaticallyreflectedinourHivetabledefinitions.Anyqueriesthatrefertothenewcolumnwillretrievethedefaultvalueforallourexistingdatathatdoesnothavethatfieldpresent.

NotethatthedefaultmechanismweareusinghereiscoretoAvroandisnotspecifictoHive.Avroisaverypowerfulandflexibleformatthathasapplicationsinmanyareasandisdefinitelyworthdeeperexaminationthanwearegivingithere.

Technically,whatthisprovidesuswithisforwardcompatibility.Wecanmakechangestoourtableschemaandhaveallourexistingdataremainautomaticallycompliantwiththenewstructurewecan’t,however,continuetoingestdataoftheoldformatintotheupdatedtablessincethemechanismdoesnotprovidebackwardcompatibility:

INSERTINTOTABLEtweets_avro

PARTITION(partition_key)

SELECT*FROMtweets_hcat;

FAILED:SemanticException[Error10044]:Line1:18Cannotinsertinto

targettablebecausecolumnnumber/typesaredifferent'tweets_avro':Table

insclause-0has8columns,butqueryhas7columns.

SupportingschemaevolutionwithAvroallowsdatachangestobesomethingthatishandledaspartofnormalbusinessinsteadofthefirefightingemergencytheyalltoooftenturninto.Butplainly,it’snotforfree;thereisstillaneedtomakethechangesinthepipelineandrolltheseintoproduction.HavingHivetablesthatprovideforwardcompatibilitydoes,however,allowtheprocesstobeperformedinmoremanageablesteps;otherwise,youwouldneedtosynchronizechangesacrosseverystageofthepipeline.IfthechangesaremadefromingestuptothepointtheyareinsertedintoAvro-backedHivetables,thenallusersofthosetablescanremainunchanged(aslongastheydon’tdothingslikeselect*,whichisusuallyaterribleideaanyway)andcontinuetorunexistingqueriesagainstthenewdata.Theseapplicationscanthenbechangedonadifferenttimetabletotheingestionmechanism.Inourv8oftheingestpipeline,weshowhowtofullyuseAvrotablesforallofourexistingfunctionality.

NoteNotethatHive0.14,currentlyunreleasedatthetimeofwritingthis,willlikelyincludemorebuilt-insupportforAvrothatmightsimplifytheprocessofschemaevolutionevenfurther.IfHive0.14isavailablewhenyoureadthis,thendocheckoutthefinalimplementation.

FinalthoughtsonusingAvroschemaevolution

WiththisdiscussionofAvro,wehavetouchedonsomeaspectsofmuchbroadertopics,inparticularofdatamanagementonabroaderscaleandpoliciesarounddataversioningandretention.Muchofthisareabecomesveryspecifictoanorganization,buthereareafewpartingthoughtsthatwefeelaremorebroadlyapplicable.

Onlymakeadditivechanges

Wediscussedaddingcolumnsintheprecedingexample.Sometimes,thoughmorerarely,yoursourcedatadropscolumnsoryoudiscoveryounolongerneedanewcolumn.Avrodoesn’treallyprovidetoolstohelpwiththis,andwefeelitisoftenundesirable.Insteadofdroppingoldcolumns,wetendtomaintaintheolddataandsimplydonotusetheemptycolumnsinallthenewdata.Thisismucheasiertomanageifyoucontrolthedataformat;ifyouareingestingexternalsources,thentofollowthisapproachyouwilleitherneedtoreprocessdatatoremovetheoldcolumnorchangetheingestmechanismtoaddadefaultvalueforallnewdata.

Manageschemaversionsexplicitly

Intheprecedingexamples,wehadasingleschemafiletowhichwemadechangesdirectly.Thisislikelyaverybadidea,asitremovesourabilitytotrackschemachangesovertime.Inadditiontotreatingschemasasartifactstobekeptunderversioncontrol(yourschemasareinGittoo,aren’tthey?)itisoftenusefultotageachschemawithanexplicitversion.Thisisparticularlyusefulwhentheincomingdataisalsoexplicitlyversioned.Then,insteadofoverwritingtheexistingschemafile,youcanaddthenewfileanduseanALTERTABLEstatementtopointtheHivetabledefinitionatthenewschema.Weare,ofcourse,assumingherethatyoudon’thavetheoptionofusingadifferentqueryfortheolddatawiththedifferentformat.ThoughthereisnoautomaticmechanismforHivetoselectschema,theremightbecaseswhereyoucancontrolthismanuallyandsidesteptheevolutionquestion.

Thinkaboutschemadistribution

Whenusingaschemafile,thinkabouthowitwillbedistributedtotheclients.If,asinthepreviousexample,thefileisonHDFS,thenitlikelymakessensetogiveitahighreplicationfactor.ThefilewillberetrievedbyeachmapperineveryMapReducejobthatqueriesthetable.

TheAvroURLcanalsobespecifiedasalocalfilesystemlocation(file://),whichisusefulfordevelopmentandalsoasawebresource(http://).Thoughthelatterisveryusefulasitisaconvenientmechanismtodistributetheschematonon-Hadoopclients,rememberthattheloadonthewebservermightbehigh.Withmodernhardwareandefficientwebservers,thisismostlikelynotahugeconcern,butifyouhaveaclusterofthousandsofmachinesrunningmanyparalleljobswhereeachmapperneedstohitthewebserver,thenbecareful.

CollectingadditionaldataManydataprocessingsystemsdon’thaveasingledataingestsource;often,oneprimarysourceisenrichedbyothersecondarysources.Wewillnowlookathowtoincorporatetheretrievalofsuchreferencedataintoourdatawarehouse.

Atahighlevel,theproblemisn’tverydifferentfromourretrievaloftherawtweetdata,aswewishtopulldatafromanexternalsource,possiblydosomeprocessingonit,andstoreitsomewherewhereitcanbeusedlater.Butthisdoeshighlightanaspectweneedtoconsider;dowereallywanttoretrievethisdataeverytimeweingestnewtweets?Theansweriscertainlyno.Thereferencedatachangesveryrarely,andwecouldeasilyfetchitmuchlessfrequentlythannewtweetdata.Thisraisesaquestionwe’veskirteduntilnow:justhowdowescheduleOozieworkflows?

SchedulingworkflowsUntilnow,we’verunallourOozieworkflowsondemandfromtheCLI.OoziealsohasaschedulerthatallowsjobstobestartedeitheronatimedbasisorwhenexternalcriteriasuchasdataappearinginHDFSaremet.Itwouldbeagoodfitforourworkflowstohaveourmaintweetpipelinerun,say,every10minutesbutthereferencedataonlyrefresheddaily.

TipRegardlessofwhendataisretrieved,thinkcarefullyhowtohandledatasetsthatperformadelete/replaceoperation.Inparticular,don’tdothedeletebeforeretrievingandvalidatingthenewdata;otherwise,anyjobsthatrequirethereferencedatawillfailuntilthenextrunoftheretrievalsucceeds.Itcouldbeagoodoptiontoincludethedestructiveoperationsinasubworkflowthatisonlytriggeredaftersuccessfulcompletionoftheretrievalsteps.

Oozieactuallydefinestwotypesofapplicationsthatitcanrun:workflowssuchaswe’veusedsofarandcoordinators,whichscheduleworkflowstobeexecutedbasedonvariouscriteria.Acoordinatorjobisconceptuallysimilartoourotherworkflows;wepushanXMLconfigurationfileontoHDFSanduseaparameterizedpropertiesfiletoconfigureitatruntime.Inaddition,coordinatorjobshavethefacilitytoreceiveadditionalparameterizationfromtheeventsthattriggertheirexecution.

Thisispossiblybestdescribedbyanexample.Let’ssay,wewishtodoaspreviouslymentionedandcreateacoordinatorthatexecutesv7ofouringestworkflowevery10minutes.Here’sthecoordinator.xmlfile(thestandardnameforthecoordinatorXMLdefinition):

<coordinator-appname="tweets-10min-coordinator"frequency="${freq}"

start="${startTime}"end="${endTime}"timezone="UTC"

xmlns="uri:oozie:coordinator:0.2">

Themainactionnodeinacoordinatoristheworkflow,forwhichweneedtospecifyitsrootlocationonHDFSandallrequiredproperties,asfollows:

<action>

<workflow>

<app-path>${workflowPath}</app-path>

<configuration>

<property>

<name>workflowRoot</name>

<value>${workflowRoot}</value>

</property>

…

Wealsoneedtoincludeanypropertiesrequiredbyanyactionintheworkfloworbyanysubworkflowittriggers;ineffect,thismeansthatanyuser-definedvariablespresentinanyoftheworkflowstobetriggeredneedtobeincludedhere,asfollows:

<property>

<name>dbName</name>

<value>${dbName}</value>

</property>

<property>

<name>partitionKey</name>

<value>${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}

</value>

</property>

<property>

<name>exec</name>

<value>gettweets.sh</value>

</property>

<property>

<name>inputDir</name>

<value>/tmp/tweets</value>

</property>

<property>

<name>subWorkflowRoot</name>

<value>${subWorkflowRoot}</value>

</property>

</configuration>

</workflow>

</action>

</coordinator-app>

Weusedafewcoordinator-specificfeaturesintheprecedingxml.Notethespecificationofthestartingandendingtimeofthecoordinatorandalsoitsfrequency(inminutes).Weareusingthesimplestformhere;Ooziealsohasasetoffunctionstoallowquiterichspecificationsofthefrequency.

WeusecoordinatorELfunctionsinourdefinitionofthepartitionKeyvariable.Earlier,whenrunningworkflowsfromtheCLI,wespecifiedtheseexplicitlybutmentionedtherewasabetterway—thisisit.Thefollowingexpressiongeneratesaformattedoutputcontainingtheyear,month,day,hour,andminute:

${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}

Ifwethenusethisasthevalueforourpartitionkey,wecanensurethateachinvocationoftheworkflowcorrectlycreatesauniquepartitioninourHCatalogtables.

Thecorrespondingjob.propertiesforthecoordinatorjoblooksmuchlikeourpreviousconfigfileswiththeusualentriesfortheNameNodeandsimilarvariablesaswellashavingvaluesfortheapplication-specificvariables,suchasdbName.Inaddition,weneedtospecifytherootofthecoordinatorlocationonHDFS,asfollows:

oozie.coord.application.path=${nameNode}/user/${user.name}/${tasksRoot}/twe

ets_10min

Notetheoozie.coordnamespaceprefixinsteadofthepreviouslyusedoozie.wf.WiththecoordinatordefinitiononHDFS,wecansubmitthefiletoOoziejustaswiththepreviousjobs.Butinthiscase,thejobwillonlyrunforagiventimeperiod.Specifically,itwillruneveryfiveminutes(thefrequencyisvariable)whenthesystemclockisbetweenstartTimeandendTime.

We’veincludedthefullconfigurationinthetweets_10mindirectoryinthesourcecodefor

thischapter.

OtherOozietriggersTheprecedingcoordinatorhasaverysimpletrigger;itstartsperiodicallywithinaspecifiedtimerange.Ooziehasanadditionalcapabilitycalleddatasets,whereitcanbetriggeredbytheavailabilityofnewdata.

Thisisn’tagreatfitforhowwe’vedefinedourpipelineuntilnow,butimaginethat,insteadofourworkflowcollectingtweetsasitsfirststep,anexternalsystemwaspushingnewfilesoftweetsontoHDFSonacontinuousbasis.OoziecanbeconfiguredtoeitherlookforthepresenceofnewdatabasedonadirectorypatternortospecificallytriggerwhenareadyfileappearsonHDFS.ThislatterconfigurationprovidesaveryconvenientmechanismwithwhichtointegratetheoutputofMapReducejobs,whichbydefault,writea_SUCCESSfileintotheiroutputdirectory.

Ooziedatasetsarearguablyoneofthemostpowerfulpartsofthewholesystem,andwecannotdothemjusticehereforspacereasons.ButwedostronglyrecommendthatyouconsulttheOoziehomepageformoreinformation.

PullingitalltogetherLet’sreviewwhatwe’vediscusseduntilnowandhowwecanuseOozietobuildasophisticatedseriesofworkflowsthatimplementanapproachtodatalifecyclemanagementbyputtingtogetherallthediscussedtechniques.

First,it’simportanttodefineclearresponsibilitiesandimplementpartsofthesystemusinggooddesignandseparationofconcernprinciples.Byapplyingthis,weendupwithseveraldifferentworkflows:

Asubworkflowtoensuretheenvironment(mainlyHDFSandHivemetadata)iscorrectlyconfiguredAsubworkflowtoperformdatavalidationThemainworkflowthattriggersboththeprecedingsubworkflowsandthenpullsnewdatathroughamultistepingestpipelineAcoordinatorthatexecutestheprecedingworkflowsevery10minutesAsecondcoordinatorthatingestsreferencedatathatwillbeusefultotheapplicationpipeline

WealsodefineallourtableswithAvroschemasandusethemwhereverpossibletohelpmanageschemaevolutionandchangingdataformatsovertime.

Wepresentthefullsourcecodeofthesecomponentsinthefinalversionoftheworkflowinthesourcecodeofthischapter.

OthertoolstohelpThoughOozieisaverypowerfultool,sometimesitcanbesomewhatdifficulttocorrectlywriteworkflowdefinitionfiles.Aspipelinesgetsizeable,managingcomplexitybecomesachallengeevenwithgoodfunctionalpartitioningintomultipleworkflows.Atasimplerlevel,XMLisjustneverfunforahumantowrite!Thereareafewtoolsthatcanhelp.Hue,thetoolcallingitselftheHadoopUI(http://gethue.com/),providessomegraphicaltoolstohelpcompose,execute,andmanageOozieworkflows.Thoughpowerful,Hueisnotabeginnertool;we’llmentionitalittlemoreinChapter11,WheretoGoNext.

AnewApacheprojectcalledFalcon(http://falcon.incubator.apache.org)mightalsobeofinterest.FalconusesOozietobuildarangeofmuchhigher-leveldataflowsandactions.Forexample,Falconprovidesrecipestoenableandensurecross-sitereplicationacrossmultipleHadoopclusters.TheFalconteamisworkingonmuchbetterinterfacestobuildtheirworkflows,sotheprojectmightwellbeworthwatching.

http://gethue.com/

http://falcon.incubator.apache.org

SummaryHopefully,thischapterpresentedthetopicofdatalifecyclemanagementassomethingotherthanadryabstractconcept.Wecoveredalot,particularly:

ThedefinitionofdatalifecyclemanagementandhowitcoversanumberofissuesandtechniquesthatusuallybecomeimportantwithlargedatavolumesTheconceptofbuildingadataingestpipelinealonggooddatalifecyclemanagementprinciplesthatcanthenbeutilizedbyhigher-levelanalytictoolsOozieasaHadoop-focusedworkflowmanagerandhowwecanuseittocomposeaseriesofactionsintoaunifiedworkflowVariousOozietools,suchassubworkflows,parallelactionexecution,andglobalvariables,thatallowustoapplytruedesignprinciplestoourworkflowsHCatalogandhowitprovidesthemeansfortoolsotherthanHivetoreadandwritetable-structureddata;weshoweditsgreatpromiseandintegrationwithtoolssuchasPigbutalsohighlightedsomecurrentweaknessesAvroasourtoolofchoicetohandleschemaevolutionovertimeUsingOoziecoordinatorstobuildscheduledworkflowsbasedeitherontimeintervalsordataavailabilitytodrivetheexecutionofmultipleingestpipelinesSomeothertoolsthatcanmakethesetaskseasier,namely,HueandFalcon

Inthenextchapter,we’lllookatseveralofthehigher-levelanalytictoolsandframeworksthatcanbuildsophisticatedapplicationlogicuponthedatacollectedinaningestpipeline.

Chapter9.MakingDevelopmentEasierInthischapter,wewilllookathow,dependingonusecasesandendgoals,applicationdevelopmentinHadoopcanbesimplifiedusinganumberofabstractionsandframeworksbuiltontopoftheJavaAPIs.Inparticular,wewilllearnaboutthefollowingtopics:

HowthestreamingAPIallowsustowriteMapReducejobsusingdynamiclanguagessuchasPythonandRubyHowframeworkssuchasApacheCrunchandKiteMorphlinesallowustoexpressdatatransformationpipelinesusinghigher-levelabstractionsHowKiteData,apromisingframeworkdevelopedbyCloudera,providesuswiththeabilitytoapplydesignpatternsandboilerplatetoeaseintegrationandinteroperabilityofdifferentcomponentswithintheHadoopecosystem

ChoosingaframeworkInthepreviouschapters,welookedattheMapReduceandSparkprogrammingAPIstowritedistributedapplications.Althoughverypowerfulandflexible,theseAPIscomewithacertainlevelofcomplexityandpossiblyrequiresignificantdevelopmenttime.

Inanefforttoreduceverbosity,weintroducedthePigandHiveframeworks,whichcompiledomain-specificlanguages,PigLatinandHiveQL,intoanumberofMapReducejobsorSparkDAGs,effectivelyabstractingtheAPIsaway.BothlanguagescanbeextendedwithUDFs,whichisawayofmappingcomplexlogictothePigandHivedatamodels.

Attimeswhenweneedacertaindegreeofflexibilityandmodularity,thingscangettricky.Dependingontheusecaseanddeveloperneeds,theHadoopecosystempresentsavastchoiceofAPIs,frameworks,andlibraries.Inthischapter,weidentifyfourcategoriesofusersandmatchthemwiththefollowingrelevanttools:

DevelopersthatwanttoavoidJavainfavorofscriptingMapReducejobsusingdynamiclanguages,oruselanguagesnotimplementedontheJVM.Atypicalusecasewouldbeupfrontanalysisandrapidprototyping:HadoopstreamingJavadevelopersthatneedtointegratecomponentsoftheHadoopecosystemandcouldbenefitfromcodifieddesignpatternsandboilerplate:KiteDataJavadeveloperswhowanttowritemodulardatapipelinesusingafamiliarAPI:ApacheCrunchDeveloperswhowouldratherconfigurechainsofdatatransformations.Forinstance,adataengineerthatwantstoembedexistingcodeinanETLpipeline:KiteMorphlines

HadoopstreamingWehavementionedpreviouslythatMapReduceprogramsdon’thavetobewritteninJava.Thereareseveralreasonswhyyoumightwantorneedtowriteyourmapandreducetasksinanotherlanguage.Perhapsyouhaveexistingcodetoleverageorneedtousethird-partybinaries—thereasonsarevariedandvalid.

Hadoopprovidesanumberofmechanismstoaidnon-Javadevelopment,primaryamongstwhichareHadooppipesthatprovideanativeC++interfaceandHadoopstreamingthatallowsanyprogramthatusesstandardinputandoutputtobeusedformapandreducetasks.WiththeMapReduceJavaAPI,bothmapandreducetasksprovideimplementationsformethodsthatcontainthetaskfunctionality.ThesemethodsreceivetheinputtothetaskasmethodargumentsandthenoutputresultsviatheContextobject.Thisisaclearandtype-safeinterface,butitisbydefinitionJava-specific.

Hadoopstreamingtakesadifferentapproach.Withstreaming,youwriteamaptaskthatreadsitsinputfromstandardinput,onelineatatime,andgivestheoutputofitsresultstostandardoutput.Thereducetaskthendoesthesame,againusingonlystandardinputandoutputforitsdataflow.

Anyprogramthatreadsandwritesfromstandardinputandoutputcanbeusedinstreaming,suchascompiledbinaries,Unixshellscripts,orprogramswritteninadynamiclanguagesuchasPythonorRuby.ThebiggestadvantagetostreamingisthatitcanallowyoutotryideasanditeratethemmorequicklythanusingJava.Insteadofacompile/JAR/submitcycle,youjustwritethescriptsandpassthemasargumentstothestreamingJARfile.Especiallywhendoinginitialanalysisonanewdatasetortryingoutnewideas,thiscansignificantlyspeedupdevelopment.

Theclassicdebateregardingdynamicversusstaticlanguagesbalancesthebenefitsofswiftdevelopmentagainstruntimeperformanceandtypechecking.Thesedynamicdownsidesalsoapplywhenusingstreaming.Consequently,wefavortheuseofstreamingforupfrontanalysisandJavafortheimplementationofjobsthatwillbeexecutedontheproductioncluster.

StreamingwordcountinPythonWe’lldemonstrateHadoopstreamingbyre-implementingourfamiliarwordcountexampleusingPython.First,wecreateascriptthatwillbeourmapper.ItconsumesUTF-8encodedrowsoftextfromstandardinputwithaforloop,splitsthisintowords,andusestheprintfunctiontowriteeachwordtostandardoutput,asfollows:

#!/bin/envpython

importsys

forlineinsys.stdin:

#skipemptylines

ifline=='\n':

continue

#preserveutf-8encoding

try:

line=line.encode('utf-8')

exceptUnicodeDecodeError:

continue

#newlinecharacterscanappearwithinthetext

line=line.replace('\n','')

#lowercaseandtokenize

line=line.lower().split()

forterminline:

ifnotterm:

continue

try:

print(

u"%s"%(

term.decode('utf-8')))

exceptUnicodeEncodeError:

continue

Thereducercountsthenumberofoccurrencesofeachwordfromstandardinput,andgivestheoutputasthefinalvaluetostandardoutput,asfollows:

#!/bin/envpython

importsys

count=1

current=None

forwordinsys.stdin:

word=word.strip()

ifword==current:

count+=1

else:

ifcurrent:

print"%s\t%s"%(current.decode('utf-8'),count)

current=word

count=1

ifcurrent==word:

print"%s\t%s"%(current.decode('utf-8'),count)

NoteInbothcases,weareimplicitlyusingHadoopinputandoutputformatsdiscussedintheearlierchapters.ItistheTextInputFormatthatprocessesthesourcefileandprovideseachlineoneatatimetothemapscript.Conversely,theTextOutputFormatwillensurethattheoutputofreducetasksisalsocorrectlywrittenastext.

Copymap.pyandreduce.pytoHDFS,andexecutethescriptsasastreamingjobusingthesampledatafromthepreviouschapters,asfollows:

$hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-filemap.py\

-mapper"pythonmap.py"\

-filereduce.py\

-reducer"pythonreduce.py"\

-inputsample.txt\

-outputoutput.txt

NoteTweetsareUTF-8encoded.MakesurethatPYTHONIOENCODINGissetaccordinglyinordertopipedatainaUNIXterminal:

$exportPYTHONIOENCODING='UTF-8'

Thesamecodecanbeexecutedfromthecommand-lineprompt:

$catsample.txt|pythonmap.py|pythonreduce.py>out.txt

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/wc/python/map.py.

https://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/wc/python/map.py

DifferencesinjobswhenusingstreamingInJava,weknowthatourmap()methodwillbeinvokedonceforeachinputkey/valuepairandourreduce()methodwillbeinvokedforeachkeyanditssetofvalues.

Withstreaming,wedon’thavetheconceptofthemaporreducemethodsanymore;insteadwehavewrittenscriptsthatprocessstreamsofreceiveddata.Thischangeshowweneedtowriteourreducer.InJava,thegroupingofvaluestoeachkeywasperformedbyHadoop;eachinvocationofthereducemethodwouldreceiveasingle,tabseparatedkeyandallitsvalues.Instreaming,eachinstanceofthereducetaskisgiventheindividualungatheredvaluesoneatatime.

Hadoopstreamingdoessortthekeys,forexample,ifamapperemittedthefollowingdata:

First1

Word1

Word1

A1

First1

Thestreamingreducerwouldreceiveitinthefollowingorder:

A1

First1

First1

Word1

Word1

Hadoopstillcollectsthevaluesforeachkeyandensuresthateachkeyispassedonlytoasinglereducer.Inotherwords,areducergetsallthevaluesforanumberofkeys,andtheyaregroupedtogether;however,theyarenotpackagedintoindividualexecutionsofthereducer,thatis,oneperkey,aswiththeJavaAPI.SinceHadoopstreamingusesthestdinandstdoutchannelstoexchangedatabetweentasks,debuganderrormessagesshouldnotbeprintedtostandardoutput.Inthefollowingexample,wewillusethePythonlogging(https://docs.python.org/2/library/logging.html)packagetologwarningstatementstoafile.

https://docs.python.org/2/library/logging.html

FindingimportantwordsintextWewillnowimplementametric,TermFrequency-InverseDocumentFrequency(TF-IDF),thatwillhelpustodeterminetheimportanceofwordsbasedonhowfrequentlytheyappearacrossasetofdocuments(tweets,inourcase).

Intuitively,ifawordappearsfrequentlyinadocumentitisimportantandshouldbegivenahighscore.However,ifawordappearsinmanydocuments,weshouldpenalizeitwithalowerscore,asitisacommonwordanditsfrequencyisnotuniquetothisdocument.

Therefore,commonwordssuchasthe,andfor,whichappearinmanydocuments,willbescaleddown.Wordsthatappearfrequentlyinasingletweetwillbescaledup.UsesofTF-IDF,oftenincombinationwithothermetricsandtechniques,includestopwordremovalandtextclassification.Notethatthistechniquewillhaveshortcomingswhendealingwithshortdocuments,suchastweets.Insuchcases,thetermfrequencycomponentwilltendtobecomeone.Conversely,onecouldexploitthispropertytodetectoutliers.

ThedefinitionofTF-IDFwewilluseinourexampleisthefollowing:

tf=#oftimestermappearsinadocument(rawfrequency)

idf=1+log(#ofdocuments/#documentswithterminit)

tf-idf=tf*idf

WewillimplementthealgorithminPythonusingthreeMapReducejobs:

ThefirstonecalculatestermfrequencyThesecondonecalculatesdocumentfrequency(thedenominatorofIDF)Thethirdonecalculatesper-tweetTF-IDF

CalculatetermfrequencyThetermfrequencypartisverysimilartothewordcountexample.Themaindifferenceisthatwewillbeusingamulti-field,tab-separated,keytokeeptrackofco-occurrencesoftermsanddocumentIDs.Foreachtweet—inJSONformat—themapperextractstheid_strandtextfields,tokenizestext,andemitsaterm,doc_idtuple:

fortweetinsys.stdin:

#skipemptylines

iftweet=='\n':

continue

try:

tweet=json.loads(tweet)

except:

logger.warn("Invalidinput%s"%tweet)

continue

#Inourexampleonetweetcorrespondstoonedocument.

doc_id=tweet['id_str']

ifnotdoc_id:

continue

#preserveutf-8encoding

text=tweet['text'].encode('utf-8')

#newlinecharacterscanappearwithinthetext

text=text.replace('\n','')

#lowercaseandtokenize

text=text.lower().split()

fortermintext:

try:

print(

u"%s\t%s"%(

term.decode('utf-8'),doc_id.decode('utf-8'))

)

exceptUnicodeEncodeError:

logger.warn("Invalidterm%s"%term)

Inthereducer,weemitthefrequencyofeachterminadocumentasatab-separatedstring:

freq=1

cur_term,cur_doc_id=sys.stdin.readline().split()

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id=line.split('\t')

except:

logger.warn("Invalidrecord%s"%line)

#thekeyisa(doc_id,term)pair

if(doc_id==cur_doc_id)and(term==cur_term):

freq+=1

else:

print(

u"%s\t%s\t%s"%(

cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),

freq))

cur_doc_id=doc_id

cur_term=term

freq=1

print(

u"%s\t%s\t%s"%(

cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),freq))

Forthisimplementationtowork,itiscrucialthatthereducerinputissortedbyterm.Wecantestbothscriptsfromthecommandlinewiththefollowingpipe:

$cattweets.json|pythonmap-tf.py|sort-k1,2|\

pythonreduce-tf.py

Whereasatthecommandlineweusethesortutility,inMapReducewewilluseorg.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator.Thiscomparatorimplementsasubsetoffeaturesprovidedbythesortcommand.Inparticular,orderingbyfieldcanbespecifiedwiththe–k<position>option.Tofilterbyterm,thefirstfieldofourkey,weset-Dmapreduce.text.key.comparator.options=-k1:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-Dmap.output.key.field.separator=\t\

-Dstream.num.map.output.key.fields=2\

-Dmapreduce.output.key.comparator.class=\

org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\

-Dmapreduce.text.key.comparator.options=-k1,2\

-inputtweets.json\

-output/tmp/tf-out.tsv\

-filemap-tf.py\

-mapper"pythonmap-tf.py"\

-filereduce-tf.py\

-reducer"pythonreduce-tf.py"

NoteWespecifywhichfieldsbelongtothekey(forshuffling)inthecomparatoroptions.

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-tf.py.

CalculatedocumentfrequencyThemainlogictocalculatedocumentfrequencyisinthereducer,whilethemapperisjustanidentityfunctionthatloadsandpipesthe(orderedbyterm)outputoftheTFjob.Inthereducer,foreachterm,wecounthowmanytimesitoccursacrossalldocuments.Foreachterm,wekeepabufferkey_cacheof(term,doc_id,tf)tuples,andwhenanewtermisfoundweflushthebuffertostandardoutput,togetherwiththeaccumulateddocumentfrequencydf:

#Cachethe(term,doc_id,tf)tuple.

key_cache=[]

line=sys.stdin.readline().strip()

cur_term,cur_doc_id,cur_tf=line.split('\t')

cur_tf=int(cur_tf)

cur_df=1

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id,tf=line.strip().split('\t')

tf=int(tf)

except:

logger.warn("Invalidrecord:%s"%line)

continue

#termistheonlykeyforthisinput

if(term==cur_term):

#incrementdocumentfrequency

cur_df+=1

key_cache.append(

u"%s\t%s\t%s"%(term.decode('utf-8'),doc_id.decode('utf-8'),

tf))

https://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-tf.py

else:

forkeyinkey_cache:

print("%s\t%s"%(key,cur_df))

print(

u"%s\t%s\t%s\t%s"%(

cur_term.decode('utf-8'),

cur_doc_id.decode('utf-8'),

cur_tf,cur_df)

)

#flushthecache

key_cache=[]

cur_doc_id=doc_id

cur_term=term

cur_tf=tf

cur_df=1

forkeyinkey_cache:

print(u"%s\t%s"%(key.decode('utf-8'),cur_df))

print(

u"%s\t%s\t%s\t%s\n"%(

cur_term.decode('utf-8'),

cur_doc_id.decode('utf-8'),

cur_tf,cur_df))

Wecantestthescriptsfromthecommandlinewith:

$cat/tmp/tf-out.tsv|pythonmap-df.py|pythonreduce-df.py>

/tmp/df-out.tsv

AndwecantestthescriptsonHadoopstreamingwith:


streaming.jar\

-Dmap.output.key.field.separator=\t\

-Dstream.num.map.output.key.fields=3\

-Dmapreduce.output.key.comparator.class=\

org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\

-Dmapreduce.text.key.comparator.options=-k1\

-input/tmp/tf-out.tsv/part-00000\

-output/tmp/df-out.tsv\

-mapperorg.apache.hadoop.mapred.lib.IdentityMapper\

-filereduce-df.py\

-reducer"pythonreduce-df.py"

OnHadoopweuseorg.apache.hadoop.mapred.lib.IdentityMapper,whichprovidesthesamelogicasthemap-df.pyscript.

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-df.py.

Puttingitalltogether–TF-IDFTocalculateTF-IDF,weonlyneedamapperthatconsumestheoutputoftheprevious

https://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-df.py

step:

num_doc=sys.argv[1]

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id,tf,df=line.split('\t')

tf=float(tf)

df=float(df)

num_doc=float(num_doc)

except:

logger.warn("Invalidrecord%s"%line)

#idf=num_doc/df

tf_idf=tf*(1+math.log(num_doc/df))

print("%s\t%s\t%s"%(term,doc_id,tf_idf))

Thenumberofdocumentsinthecollectionispassedasaparametertotf-idf.py:


streaming.jar\

-Dmapreduce.reduce.tasks=0\

-input/tmp/df-out.tsv/part-00000\

-output/tmp/tf-idf.out\

-filetf-idf.py\

-mapper"pythontf-idf.py15578"

Tocalculatethetotalnumberoftweets,wecanusethecatandwcUnixutilitiesincombinationwithHadoopstreaming:


streaming.jar\

-inputtweets.json\

-outputtweets.cnt\

-mapper/bin/cat\

-reducer/usr/bin/wc

Themappersourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py.

https://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py

KiteDataTheKiteSDK(http://www.kitesdk.org)isacollectionofclasses,command-linetools,andexamplesthataimsateasingtheprocessofbuildingapplicationsontopofHadoop.

InthissectionwewilllookathowKiteData,asubprojectofKite,caneaseintegrationwithseveralcomponentsofaHadoopdatawarehouse.Kiteexamplescanbefoundathttps://github.com/kite-sdk/kite-examples.

OnCloudera’sQuickStartVM,KiteJARscanbefoundat/opt/cloudera/parcels/CDH/lib/kite/.

KiteDataisorganizedinanumberofsubprojects,someofwhichwe’lldescribeinthefollowingsections.

http://www.kitesdk.org

https://github.com/kite-sdk/kite-examples

DataCoreAsthenamesuggests,thecoreisthebuildingblockforallcapabilitiesprovidedintheDatamodule.Itsprincipalabstractionsaredatasetsandrepositories.

Theorg.kitesdk.data.Datasetinterfaceisusedtorepresentanimmutablesetofdata:

@Immutable

publicinterfaceDataset<E>extendsRefinableView<E>{

StringgetName();

DatasetDescriptorgetDescriptor();

Dataset<E>getPartition(PartitionKeykey,booleanautoCreate);

voiddropPartition(PartitionKeykey);

Iterable<Dataset<E>>getPartitions();

URIgetUri();

}

Eachdatasetisidentifiedbyanameandaninstanceoftheorg.kitesdk.data.DatasetDescriptorinterface,thatisthestructuraldescriptionofadatasetandprovidesitsschema(org.apache.avro.Schema)andpartitioningstrategy.

ImplementationsoftheReader<E>interfaceareusedtoreaddatafromanunderlyingstoragesystemandproducedeserializedentitiesoftypeE.ThenewReader()methodcanbeusedtogetanappropriateimplementationforagivendataset:

publicinterfaceDatasetReader<E>extendsIterator<E>,Iterable<E>,

Closeable{

voidopen();

booleanhasNext();

Enext();

voidremove();

voidclose();

booleanisOpen();

}

AninstanceofDatasetReaderwillprovidemethodstoreadanditerateoverstreamsofdata.Similarly,org.kitesdk.data.DatasetWriterprovidesaninterfacetowritestreamsofdatatotheDatasetobjects:

publicinterfaceDatasetWriter<E>extendsFlushable,Closeable{

voidopen();

voidwrite(Eentity);

voidflush();

voidclose();

booleanisOpen();

}

Likereaders,writersareuse-onceobjects.TheyserializeinstancesofentitiesoftypeEandwritethemtotheunderlyingstoragesystem.Writersareusuallynotinstantiateddirectly;rather,anappropriateimplementationcanbecreatedbythenewWriter()factorymethod.ImplementationsofDatasetWriterwillholdresourcesuntilclose()iscalledandexpect

thecallertoinvokeclose()inafinallyblockwhenthewriterisnolongerinuse.Finally,notethatimplementationsofDatasetWriteraretypicallynotthread-safe.Thebehaviorofawriterbeingaccessedfrommultiplethreadsisundefined.

AparticularcaseofadatasetistheViewinterface,whichisasfollows:

publicinterfaceView<E>{

Dataset<E>getDataset();

DatasetReader<E>newReader();

DatasetWriter<E>newWriter();

booleanincludes(Eentity);

publicbooleandeleteAll();

}

Viewscarrysubsetsofthekeysandpartitionsofanexistingdataset;theyareconceptuallysimilartothenotionof“view”intherelationalmodel.

AViewinterfacecanbecreatedfromrangesofdata,orrangesofkeys,orasaunionbetweenotherviews.

DataHCatalogDataHCatalogisamodulethatenablestheaccessingofHCatalogrepositories.Thecoreabstractionsofthismoduleareorg.kitesdk.data.hcatalog.HCatalogAbstractDatasetRepositoryanditsconcreteimplementation,org.kitesdk.data.hcatalog.HCatalogDatasetRepository.

TheydescribeaDatasetRepositorythatusesHCatalogtomanagemetadataandHDFSforstorage,asfollows:

publicclassHCatalogDatasetRepositoryextends

HCatalogAbstractDatasetRepository{

HCatalogDatasetRepository(Configurationconf){

super(conf,newHCatalogManagedMetadataProvider(conf));

}

HCatalogDatasetRepository(Configurationconf,MetadataProviderprovider)

{

super(conf,provider);

}

public<E>Dataset<E>create(Stringname,DatasetDescriptordescriptor)

{

getMetadataProvider().create(name,descriptor);

returnload(name);

}

publicbooleandelete(Stringname){

returngetMetadataProvider().delete(name);

}

publicstaticclassBuilder{

…

}

}

NoteAsofKite0.17,DataHCatalogisdeprecatedinfavorofthenewDataHivemodule.

ThelocationofthedatadirectoryiseitherchosenbyHive/HCatalog(so-called“managedtables”),orspecifiedwhencreatinganinstanceofthisclassbyprovidingafilesystemandarootdirectoryintheconstructor(externaltables).

DataHiveThekite-data-moduleexposesHiveschemasviatheDatasetinterface.AsofKite0.17,thispackagesupersedesDataHCatalog.

DataMapReduceTheorg.kitesdk.data.mapreducepackageprovidesinterfacestoreadandwritedatatoandfromaDatasetwithMapReduce.

DataSparkTheorg.kitesdk.data.sparkpackageprovidesinterfacesforreadingandwritingdatatoandfromaDatasetwithApacheSpark.

DataCrunchTheorg.kitesdk.data.crunch.CrunchDatasetspackageisahelperclasstoexposedatasetsandviewsasCrunchReadableSourceorTargetclasses:

publicclassCrunchDatasets{

publicstatic<E>ReadableSource<E>asSource(View<E>view,Class<E>type){

returnnewDatasetSourceTarget<E>(view,type);

}

publicstatic<E>ReadableSource<E>asSource(URIuri,Class<E>type){

returnnewDatasetSourceTarget<E>(uri,type);

}

publicstatic<E>ReadableSource<E>asSource(Stringuri,Class<E>type){

returnasSource(URI.create(uri),type);

}

publicstatic<E>TargetasTarget(View<E>view){

returnnewDatasetTarget<E>(view);

}

publicstaticTargetasTarget(Stringuri){

returnasTarget(URI.create(uri));

}

publicstaticTargetasTarget(URIuri){

returnnewDatasetTarget<Object>(uri);

}

}

ApacheCrunchApacheCrunch(http://crunch.apache.org)isaJavaandScalalibrarytocreatepipelinesofMapReducejobs.ItisbasedonGoogle’sFlumeJava(http://dl.acm.org/citation.cfm?id=1806638)paperandlibrary.TheprojectgoalistomakethetaskofwritingMapReducejobsasstraightforwardaspossibleforanybodyfamiliarwiththeJavaprogramminglanguagebyexposinganumberofpatternsthatimplementoperationssuchasaggregating,joining,filtering,andsortingrecords.

SimilartotoolssuchasPig,Crunchpipelinesarecreatedbycomposingimmutable,distributeddatastructuresandrunningallprocessingoperationsonsuchstructures;theyareexpressedandimplementedasuser-definedfunctions.PipelinesarecompiledintoaDAGofMapReducejobs,whoseexecutionismanagedbythelibrary’splanner.Crunchallowsustowriteiterativecodeandabstractsawaythecomplexityofthinkingintermsofmapandreduceoperations,whileatthesametimeavoidingtheneedofanadhocprogramminglanguagesuchasPigLatin.Inaddition,Crunchoffersahighlycustomizabletypesystemthatallowsustoworkwith,andmix,HadoopWritables,HBase,andAvroserializedobjects.

FlumeJava’smainassumptionisthatMapReduceisthewronglevelofabstractionforseveralclassesofproblems,wherecomputationsareoftenmadeupofmultiple,chainedjobs.Frequently,weneedtocomposelogicallyindependentoperations(forexample,filtering,projecting,grouping,andothertransformations)intoasinglephysicalMapReducejobforperformancereasons.Thisaspectalsohasimplicationsforcodetestability.Althoughwewon’tcoverthisaspectinthischapter,thereaderisencouragedtolookfurtherintoitbyconsultingCrunch’sdocumentation.

http://crunch.apache.org

http://dl.acm.org/citation.cfm?id=1806638

GettingstartedCrunchJARsarealreadyinstalledontheQuickStartVM.Bydefault,theJARsarefoundin/opt/cloudera/parcels/CDH/lib/crunch.

Alternatively,recentCrunchlibrariescanbedownloadedfromhttps://crunch.apache.org/download.html,fromMavenCentralorCloudera-specificrepositories.

https://crunch.apache.org/download.html

ConceptsCrunchpipelinesarecreatedbycomposingtwoabstractions:PCollectionandPTable.

ThePCollection<T>interfaceisadistributed,immutablecollectionofobjectsoftypeT.ThePTable<Key,Value>interfaceisadistributed,immutablehashtable—asub-interfaceofPCollection—ofkeysoftheKeytypeandvaluesoftheValuetypethatexposesmethodstoworkwiththekey-valuepairs.

Thesetwoabstractionssupportthefollowingfourprimitiveoperations:

parallelDo:appliesauser-definedfunction,DoFn,toagivenPCollectionandreturnsanewPCollectionunion:mergestwoormorePCollectionsintoasinglevirtualPCollectiongroupByKey:sortsandgroupstheelementsofaPTablebytheirkeyscombineValues:aggregatesthevaluesfromagroupByKeyoperation

Thehttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/HashtagCount.javaimplementsaCrunchMapReducepipelinethatcountshashtagoccurrences:

Pipelinepipeline=newMRPipeline(HashtagCount.class,getConf());

pipeline.enableDebug();

PCollection<String>lines=pipeline.readTextFile(args[0]);

PCollection<String>words=lines.parallelDo(newDoFn<String,String>(){

publicvoidprocess(Stringline,Emitter<String>emitter){

for(Stringword:line.split("\\s+")){

if(word.matches("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")){

emitter.emit(word);

}

}

}

},Writables.strings());

PTable<String,Long>counts=words.count();

pipeline.writeTextFile(counts,args[1]);

//ExecutethepipelineasaMapReduce.

pipeline.done();

Inthisexample,wefirstcreateaMRPipelinepipelineanduseittofirstreadthecontentofsample.txtcreatedwithstream.py-tintoacollectionofstrings,whereeachelementofthecollectionrepresentsatweet.Wetokenizeeachtweetintowordswithtweet.split("\\s+"),andweemiteachwordthatmatchesthehashtagregularexpression,serializedasWritable.NotethatthetokenizingandfilteringoperationsareexecutedinparallelbyMapReducejobscreatedbytheparallelDocall.WecreateaPTablethatassociateseachhashtag,representedasastring,withthenumberoftimesitoccurredinthedatasets.Finally,wewritethePTablecountsintoHDFSasatextfile.The

https://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/HashtagCount.java

pipelineisexecutedwithpipeline.done().

Tocompileandexecutethepipeline,wecanuseGradletomanagetheneededdependencies,asfollows:

$./gradlewjar

$./gradlewcopyJars

AddtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable:

$exportCRUNCH_DEPS=build/libjars/crunch-example/lib

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-

cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-

mapred-1.7.5-cdh5.0.3-hadoop2.jar

Then,runtheexampleonHadoop:

$hadoopjarbuild/libs/crunch-example.jar\

com.learninghadoop2.crunch.HashtagCount\

tweets.jsoncount-out\

-libjars$LIBJARS

DataserializationOneoftheframework’sgoalsistomakeiteasytoprocesscomplexrecordscontainingnestedandrepeateddatastructures,suchasprotocolbuffersandThriftrecords.

Theorg.apache.crunch.types.PTypeinterfacedefinesthemappingbetweenadatatypethatisusedinaCrunchpipelineandaserializationandstorageformatthatisusedtoread/writedatafrom/toHDFS.EveryPCollectionhasanassociatedPTypethattellsCrunchhowtoread/writedata.

Theorg.apache.crunch.types.PTypeFamilyinterfaceprovidesanabstractfactorytoimplementinstancesofPTypethatsharethesameserializationformat.Currently,Crunchsupportstwotypefamilies:onebasedontheWritableinterfaceandtheotheronApacheAvro.

NoteAlthoughCrunchpermitsmixingandmatchingPCollectioninterfacesthatusedifferentinstancesofPTypeinthesamepipeline,eachPCollectioninterfaces’sPTypemustbelongtoauniquefamily.Forinstance,itisnotpossibletohaveaPTablewithakeyserializedasWritableanditsvalueserializedusingAvro.

Bothtypefamiliessupportacommonsetofprimitivetypes(strings,longs,integers,floats,doubles,booleans,andbytes)aswellasmorecomplexPTypeinterfacesthatcanbeconstructedoutofotherPTypes.TheseincludetuplesandcollectionsofotherPType.Aparticularlyimportant,complex,PTypeistableOf,whichdetermineswhetherthereturntypeofparalleDowillbeaPCollectionorPTable.

NewPTypescanbecreatedbyinheritingandextendingthebuilt-insoftheAvroandWritablefamilies.ThisrequiresimplementinginputMapFn<S,T>andoutputMapFn<T,S>classes.WeareimplementingPTypeforinstanceswhereSistheoriginaltypeandTisthenewtype.

DerivedPTypescanbefoundinthePTypesclass.Theseincludeserializationsupportforprotocolbuffers,Thriftrecords,JavaEnums,BigInteger,andUUIDs.TheElephantBirdlibrarywediscussedinChapter6,DataAnalysiswithApachePig,containsadditionalexamples.

Dataprocessingpatternsorg.apache.crunch.libimplementsanumberofdesignpatternsforcommondatamanipulationoperations.

AggregationandsortingMostofthedataprocessingpatternsprovidedbyorg.apache.crunch.librelyonthePTable‘sgroupByKeymethod.Themethodhasthreedifferentoverloadedforms:

groupByKey():letstheplannerdeterminethenumberofpartitionsgroupByKey(intnumPartitions):isusedtosetthenumberofpartitionsspecifiedbythedevelopergroupByKey(GroupingOptionsoptions):allowsustospecifycustompartitionsandcomparatorsforshuffling

Theorg.apache.crunch.GroupingOptionsclasstakesinstancesofHadoop’sPartitionerandRawComparatorclassestoimplementcustompartitioningandsortingoperations.

ThegroupByKeymethodreturnsaninstanceofPGroupedTable,Crunch’srepresentationofagroupedtable.ItcorrespondstotheoutputoftheshufflephaseofaMapReducejobandallowsvaluestobecombinedwiththecombineValuemethod.

Theorg.apache.crunch.lib.Aggregatepackageexposesmethodstoperformsimpleaggregations(count,max,top,andlength)onthePCollectioninstances.

SortprovidesanAPItosortPCollectionandPTableinstanceswhosecontentsimplementtheComparableinterface.

Bydefault,Crunchsortsdatausingonereducer.Thisbehaviorcanbemodifiedbypassingthenumberofpartitionsrequiredtothesortmethod.TheSort.Ordermethodsignalstheorderinwhichasortshouldbedone.

Thefollowingarehowdifferentsortoptionscanbespecifiedforcollections:

publicstatic<T>PCollection<T>sort(PCollection<T>collection)

publicstatic<T>PCollection<T>sort(PCollection<T>collection,Sort.Order

order)

publicstatic<T>PCollection<T>sort(PCollection<T>collection,int

numReducers,Sort.Orderorder)

Thefollowingarehowdifferentsortoptionscanbespecifiedfortables:

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table)

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,Sort.Orderkey)

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,intnumReducers,

Sort.Orderkey)

Finally,sortPairssortsthePCollectionofpairsusingthespecifiedcolumnorderinSort.ColumnOrder:

sortPairs(PCollection<Pair<U,V>>collection,Sort.ColumnOrder…

columnOrders)

JoiningdataTheorg.apache.crunch.lib.JoinpackageisanAPItojoinPTablesbasedonacommonkey.Thefollowingfourjoinoperationsaresupported:

fullJoin

join(defaultstoinnerJoin)leftJoin

rightJoin

Themethodshaveacommonreturntypeandsignature.Forreference,wewilldescribethecommonlyusedjoinmethodthatimplementsaninnerjoin:

publicstatic<K,U,V>PTable<K,Pair<U,V>>join(PTable<K,U>left,

PTable<K,V>right)

Theorg.apache.crunch.lib.Join.JoinStrategypackageprovidesaninterfacetodefinecustomjoinstrategies.Crunch’sdefaultstrategy(defaultStrategy)istojoindatareduce-side.

PipelinesimplementationandexecutionCrunchcomeswiththreeimplementationsofthepipelineinterface.Theoldestone,implicitlyusedinthischapter,isorg.apache.crunch.impl.mr.MRPipeline,whichusesHadoop’sMapReduceasitsexecutionengine.org.apache.crunch.impl.mem.MemPipelineallowsalloperationstobeperformedinmemory,withnoserializationtodiskperformed.Crunch0.10introducedorg.apache.crunch.impl.spark.SparkPipelinewhichcompilesandrunsaDAGofPCollectionstoApacheSpark.

SparkPipelineWithSparkPipeline,CrunchdelegatesmuchoftheexecutiontoSparkanddoesrelativelylittleoftheplanningtasks,withthefollowingexceptions:

MultipleinputsMultipleoutputsDataserializationCheckpointing

Atthetimeofwriting,SparkPipelineisstillheavilyunderdevelopmentandmightnothandlealloftheusecasesofastandardMRPipeline.TheCrunchcommunityisactivelyworkingtoensurecompletecompatibilitybetweenthetwoimplementations.

MemPipelineMemPipelineexecutesin-memoryonaclient.UnlikeMRPipeline,MemPipelineisnotexplicitlycreatedbutreferencedbycallingthestaticmethodMemPipeline.getInstance().Alloperationsareinmemory,andtheuseofPTypesisminimal.

CrunchexamplesWewillnowuseApacheCrunchtoreimplementsomeoftheMapReducecodewrittensofarinamoremodularfashion.

Wordco-occurrenceInChapter3,Processing–MapReduceandBeyond,weshowedaMapReducejob,BiGramCount,tocountco-occurrencesofwordsintweets.ThatsamelogiccanbeimplementedasaDoFn.Insteadofemittingamulti-fieldkeyandhavingtoparseitatalaterstage,withCrunchwecanuseacomplextypePair<String,String>,asfollows:

classBiGramextendsDoFn<String,Pair<String,String>>{

@Override

publicvoidprocess(Stringtweet,

Emitter<Pair<String,String>>emitter){

String[]words=tweet.split("");

Textbigram=newText();

Stringprev=null;

for(Strings:words){

if(prev!=null){

emitter.emit(Pair.of(prev,s));

}

prev=s;

}

}

}

Noticehow,comparedtoMapReduce,theBiGramCrunchimplementationisastandaloneclass,easilyreusableinanyothercodebase.Thecodeforthisexampleisincludedinhttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/DataPreparationPipeline.java

TF-IDFWecanimplementtheTF-IDFchainofjobswithaMRPipeline,asfollows:

publicclassCrunchTermFrequencyInvertedDocumentFrequency

extendsConfiguredimplementsTool,Serializable{

privateLongnumDocs;

@SuppressWarnings("deprecation")

publicstaticclassTF{

Stringterm;

StringdocId;

intfrequency;

publicTF(){}

https://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/DataPreparationPipeline.java

publicTF(Stringterm,

StringdocId,Integerfrequency){

this.term=term;

this.docId=docId;

this.frequency=(int)frequency;

}

}


if(args.length!=2){

System.err.println();

System.err.println("Usage:"+this.getClass().getName()+"

[genericoptions]inputoutput");

return1;

}

//Createanobjecttocoordinatepipelinecreationandexecution.

Pipelinepipeline=

newMRPipeline(TermFrequencyInvertedDocumentFrequency.class,getConf());

//enabledebugoptions

pipeline.enableDebug();

//ReferenceagiventextfileasacollectionofStrings.

PCollection<String>tweets=pipeline.readTextFile(args[0]);

numDocs=tweets.length().getValue();

//WeuseAvroreflectionstomaptheTFPOJOtoavsc

PTable<String,TF>tf=tweets.parallelDo(newTermFrequencyAvro(),

Avros.tableOf(Avros.strings(),Avros.reflects(TF.class)));

//CalculateDF

PTable<String,Long>df=Aggregate.count(tf.parallelDo(new

DocumentFrequencyString(),Avros.strings()));

//FinallywecalculateTF-IDF

PTable<String,Pair<TF,Long>>tfDf=Join.join(tf,df);

PCollection<Tuple3<String,String,Double>>tfIdf=

tfDf.parallelDo(newTermFrequencyInvertedDocumentFrequency(),

Avros.triples(

Avros.strings(),

Avros.strings(),

Avros.doubles()));

//Serializeasavro

tfIdf.write(To.avroFile(args[1]));

//ExecutethepipelineasaMapReduce.

PipelineResultresult=pipeline.done();

returnresult.succeeded()?0:1;

}

…

}

Theapproachthatwefollowherehasanumberofadvantagescomparedtostreaming.Firstofall,wedon’tneedtomanuallychainMapReducejobsusingaseparatescript.ThistaskisCrunch’smainpurpose.Secondly,wecanexpresseachcomponentofthemetricasadistinctclass,makingiteasiertoreuseinfutureapplications.

Toimplementtermfrequency,wecreateaDoFnclassthattakesasinputatweetandemitsPair<String,TF>.Thefirstelementisaterm,andthesecondisaninstanceofthePOJOclassthatwillbeserializedusingAvro.TheTFpartcontainsthreevariables:term,documentId,andfrequency.Inthereferenceimplementation,weexpectinputdatatobeaJSONstringthatwedeserializeandparse.Wealsoincludetokenizingasasubtaskoftheprocessmethod.

Dependingontheusecases,wecouldabstractbothoperationsinseparateDoFns,asfollows:

classTermFrequencyAvroextendsDoFn<String,Pair<String,TF>>{

publicvoidprocess(StringJSONTweet,

Emitter<Pair<String,TF>>emitter){

Map<String,Integer>termCount=newHashMap<>();

Stringtweet;

StringdocId;

JSONParserparser=newJSONParser();

try{

Objectobj=parser.parse(JSONTweet);

JSONObjectjsonObject=(JSONObject)obj;

tweet=(String)jsonObject.get("text");

docId=(String)jsonObject.get("id_str");

for(Stringterm:tweet.split("\\s+")){

if(termCount.containsKey(term.toLowerCase())){

termCount.put(term,

termCount.get(term.toLowerCase())+1);

}else{

termCount.put(term.toLowerCase(),1);

}

}

for(Entry<String,Integer>entry:termCount.entrySet()){

emitter.emit(Pair.of(entry.getKey(),newTF(entry.getKey(),

docId,entry.getValue())));

}

}catch(ParseExceptione){


}

}

}

}

Documentfrequencyisstraightforward.ForeachPair<String,TF>generatedintheterm

frequencystep,weemittheterm—thefirstelementofthepair.WeaggregateandcounttheresultingPCollectionoftermstoobtaindocumentfrequency,asfollows:

classDocumentFrequencyStringextendsDoFn<Pair<String,TF>,String>{

@Override

publicvoidprocess(Pair<String,TF>tfAvro,

Emitter<String>emitter){

emitter.emit(tfAvro.first());

}

}

WefinallyjointhePTableTFwiththePTableDFonthesharedkey(term)andfeedtheresultingPair<String,Pair<TF,Long>>objecttoTermFrequencyInvertedDocumentFrequency.

Foreachtermanddocument,wecalculateTF-IDFandreturnaterm,docIf,andtfIdftriple:

classTermFrequencyInvertedDocumentFrequencyextendsMapFn<Pair<String,

Pair<TF,Long>>,Tuple3<String,String,Double>>{

@Override

publicTuple3<String,String,Double>map(

Pair<String,Pair<TF,Long>>input){

Pair<TF,Long>tfDf=input.second();

Longdf=tfDf.second();

TFtf=tfDf.first();

doubleidf=1.0+Math.log(numDocs/df);

doubletfIdf=idf*tf.frequency;

returnTuple3.of(tf.term,tf.docId,tfIdf);

}

}

WeuseMapFnbecausewearegoingtooutputonerecordforeachinput.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/CrunchTermFrequencyInvertedDocumentFrequency.java

Theexamplecanbecompiledandexecutedwiththefollowingcommands:

$./gradlewjar

$./gradlewcopyJars

Ifnotalreadydone,addtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable,asfollows:

$exportCRUNCH_DEPS=build/libjars/crunch-example/lib

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-

cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-

mapred-1.7.5-cdh5.0.3-hadoop2.jar

Furthermore,addthejson-simpleJARtoLIBJARS:

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/json-simple-1.1.1.jar

https://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/CrunchTermFrequencyInvertedDocumentFrequency.java

Finally,runCrunchTermFrequencyInvertedDocumentFrequencyasaMapReducejob,asfollows:

$hadoopjarbuild/libs/crunch-example.jar\

com.learninghadoop2.crunch.CrunchTermFrequencyInvertedDocumentFrequency\

-libjars${LIBJARS}\

tweets.jsontweets.avro-out

KiteMorphlinesKiteMorphlinesisadatatransformationlibrary,inspiredbyUnixpipes,originallydevelopedaspartofClouderaSearch.Amorphlineisanin-memorychainoftransformationcommandsthatreliesonapluginstructuretotapheterogeneousdatasources.ItusesdeclarativecommandstocarryoutETLoperationsonrecords.Commandsaredefinedinaconfigurationfile,whichislaterfedtoadriverclass.

ThegoalistomakeembeddingETLlogicintoanyJavacodebaseatrivialtaskbyprovidingalibrarythatallowsdeveloperstoreplaceprogrammingwithaseriesofconfigurationsettings.

ConceptsMorphlinesarebuiltaroundtwoabstractions:CommandandRecord.

Recordsareimplementationsoftheorg.kitesdk.morphline.api.Recordinterface:

publicfinalclassRecord{

privateArrayListMultimap<String,Object>fields;

…

privateRecord(ArrayListMultimap<String,Object>fields){…}

publicListMultimap<String,Object>getFields(){…}

publicListget(Stringkey){…}

publicvoidput(Stringkey,Objectvalue){…}

…

}

Arecordisasetofnamedfields,whereeachfieldhasalistofoneormorevalues.ARecordisimplementedontopofGoogleGuava’sListMultimapandArrayListMultimapclasses.NotethatavaluecanbeanyJavaobject,fieldscanbemultivalued,andtworecordsdon’tneedtousecommonfieldnames.Arecordcancontainan_attachment_bodyfieldthatcanbeajava.io.InputStreamorabytearray.

Commandsimplementtheorg.kitesdk.morphline.api.Commandinterface:

publicinterfaceCommand{

voidnotify(Recordnotification);

booleanprocess(Recordrecord);

CommandgetParent();

}

Acommandtransformsarecordintozeroormorerecords.CommandscancallthemethodsontheRecordinstanceprovidedforreadandwriteoperationsaswellasforaddingorremovingfields.

Commandsarechainedtogether,andateachstepofamorphlinetheparentcommandsendsrecordstoitschild,whichinturnprocessesthem.Informationbetweenparentsandchildrenisexchangedusingtwocommunicationchannels(planes);notificationsaresentviaacontrolplane,andrecordsaresentoveradataplane.Recordsareprocessedbytheprocess()method,whichreturnsaBooleanvaluetoindicatewhetheramorphlineshouldproceedornot.

Commandsarenotinstantiateddirectly,butviaanimplementationoftheorg.kitesdk.morphline.api.CommandBuilderinterface:

publicinterfaceCommandBuilder{

Collection<String>getNames();

Commandbuild(Configconfig,

Commandparent,

Commandchild,

MorphlineContextcontext);

}

ThegetNamesmethodreturnsthenameswithwhichthecommandcanbeinvoked.Multiplenamesaresupportedtoallowbackwardscompatiblenamechanges.Thebuild()methodcreatesandreturnsacommandrootedatthegivenmorphlineconfiguration.

Theorg.kitesdk.morphline.api.MorphlineContextinterfaceallowsadditionalparameterstobepassedtoallmorphlinecommands.

Thedatamodelofmorphlinesisstructuredfollowingasource-pipe-sinkpattern,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink.

MorphlinecommandsKiteMorphlinescomeswithanumberofdefaultcommandsthatimplementdatatransformationsoncommonserializationformats(plaintext,Avro,JSON).Currentlyavailablecommandsareorganizedassubprojectsofmorphlinesandinclude:

kite-morphlines-core-stdio:willreaddatafrombinarylargeobjects(BLOBs)andtextkite-morphlines-core-stdlib:wrapsaroundJavadatatypesfordatamanipulationandrepresentationkite-morphlines-avro:isusedforserializationintoanddeserializationfromdataintheAvroformatkite-morphlines-json:willserializeanddeserializedatainJSONformatkite-morphlines-hadoop-core:isusedtoaccessHDFSkite-morphlines-hadoop-parquet-avro:isusedtoserializeanddeserializedataintheParquetformatkite-morphlines-hadoop-sequencefile:isusedtoserializeanddeserializedataintheSequencefileformatkite-morphlines-hadoop-rcfile:isusedtoserializeanddeserializedatainRCfileformat

Alistofallavailablecommandscanbefoundathttp://kitesdk.org/docs/0.17.0/kite-morphlines/morphlinesReferenceGuide.html.

Commandsaredefinedbydeclaringachainoftransformationsinaconfigurationfile,morphline.conf,whichisthencompiledandexecutedbyadriverprogram.Forinstance,wecanspecifyaread_tweetsmorphlinethatwillloadtweetsstoredasJSONdata,serializeanddeserializethemusingJackson,andprintthefirst10,bycombiningthe

http://kitesdk.org/docs/0.17.0/kite-morphlines/morphlinesReferenceGuide.html

defaultreadJsonandheadcommandscontainedintheorg.kitesdk.morphlinepackage,asfollows:

morphlines:[{

id:read_tweets

importCommands:["org.kitesdk.morphline.**"]

commands:[{

readJson{

outputClass:com.fasterxml.jackson.databind.JsonNode

}}

{

head{

limit:10

}}

]

}]

WewillnowshowhowthismorphlinecanbeexecutedbothfromastandaloneJavaprogramaswellasfromMapReduce.

MorphlineDriver.javashowshowtousethelibraryembeddedintoahostsystem.Thefirststepthatwecarryoutinthemainmethodistoloadmorphline’sJSONconfiguration,buildaMorphlineContextobject,andcompileitintoaninstanceofCommandthatactsasthestartingnodeofthemorphline.NotethatCompiler.compile()takesafinalChildparameter;inthiscase,itisRecordEmitter.WeuseRecordEmittertoactasasinkforthemorphline,byeitherprintingarecordtostdoutorstoringitintoHDFS.IntheMorphlineDriverexample,weuseorg.kitesdk.morphline.base.Notificationstomanageandmonitorthemorphlinelifecycleinatransactionalfashion.

AcalltoNotifications.notifyStartSession(morphline)startsthetransformationchainwithinatransactiondefinedbycallingNotifications.notifyBeginTransaction.Uponsuccess,weterminatethepipelinewithNotifications.notifyShutdown(morphline).Intheeventoffailure,werollbackthetransaction,Notifications.notifyRollbackTransaction(morphline),andpassanexceptionhandlerfromthemorphlinecontexttothecallingJavacode:

publicclassMorphlineDriver{

privatestaticfinalclassRecordEmitterimplementsCommand{

privatefinalTextline=newText();

@Override

publicCommandgetParent(){

returnnull;

}

@Override

publicvoidnotify(Recordrecord){

}

@Override

publicbooleanprocess(Recordrecord){

line.set(record.get("_attachment_body").toString());

System.out.println(line);

returntrue;

}

}

publicstaticvoidmain(String[]args)throwsIOException{

/*loadamorphlineconfandsetitup*/

FilemorphlineFile=newFile(args[0]);

StringmorphlineId=args[1];

MorphlineContextmorphlineContext=new

MorphlineContext.Builder().build();

Commandmorphline=newCompiler().compile(morphlineFile,

morphlineId,morphlineContext,newRecordEmitter());

/*Preparethemorphlineforexecution

*

*Notificationsaresentthroughthecommunicationchannel

**/

Notifications.notifyBeginTransaction(morphline);

/*Notethatweareusingthelocalfilesystem,nothdfs*/

InputStreamin=newBufferedInputStream(new

FileInputStream(args[2]));

/*fillinarecordandpassitover*/

Recordrecord=newRecord();

record.put(Fields.ATTACHMENT_BODY,in);

try{

Notifications.notifyStartSession(morphline);

booleansuccess=morphline.process(record);

if(!success){

System.out.println("Morphlinefailedtoprocessrecord:"+

record);

}

/*Committhemorphline*/

}catch(RuntimeExceptione){

Notifications.notifyRollbackTransaction(morphline);

morphlineContext.getExceptionHandler().handleException(e,null);

}

finally{

in.close();

}

/*shutitdown*/

Notifications.notifyShutdown(morphline);

}

}

Inthisexample,weloaddatainJSONformatfromthelocalfilesystemintoanInputStreamobjectanduseittoinitializeanewRecordinstance.TheRecordEmitter

classcontainsthelastprocessedrecordinstanceofthechain,onwhichweextract_attachment_bodyandprintittostandardoutput.ThesourcecodeforMorphlineDrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriver.java

UsingthesamemorphlinefromaMapReducejobisstraightforward.DuringthesetupphaseoftheMapper,webuildacontextthatcontainstheinstantiationlogic,whilethemapmethodsetstheRecordobjectupandfiresofftheprocessinglogic,asfollows:

publicstaticclassReadTweets

extendsMapper<Object,Text,Text,NullWritable>{

privatefinalRecordrecord=newRecord();

privateCommandmorphline;

@Override

protectedvoidsetup(Contextcontext)


FilemorphlineConf=newFile(context.getConfiguration()

.get(MORPHLINE_CONF));

StringmorphlineId=context.getConfiguration()

.get(MORPHLINE_ID);

MorphlineContextmorphlineContext=

newMorphlineContext.Builder()

.build();

morphline=neworg.kitesdk.morphline.base.Compiler()

.compile(morphlineConf,

morphlineId,

morphlineContext,

newRecordEmitter(context));

}



record.put(Fields.ATTACHMENT_BODY,

newByteArrayInputStream(

value.toString().getBytes("UTF8")));

if(!morphline.process(record)){

System.out.println(

"Morphlinefailedtoprocessrecord:"+record);

}

record.removeAll(Fields.ATTACHMENT_BODY);

}

}

IntheMapReducecodewemodifyRecordEmittertoextracttheFieldspayloadfrompost-processedrecordsandstoreitintocontext.ThisallowsustowritedataintoHDFSbyspecifyingaFileOutputFormatintheMapReduceconfigurationboilerplate:

privatestaticfinalclassRecordEmitterimplementsCommand{

privatefinalTextline=newText();

privatefinalMapper.Contextcontext;

privateRecordEmitter(Mapper.Contextcontext){

https://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriver.java

this.context=context;

}

@Override

publicvoidnotify(Recordnotification){

}

@Override

publicCommandgetParent(){

returnnull;

}

@Override

publicbooleanprocess(Recordrecord){

line.set(record.get(Fields.ATTACHMENT_BODY).toString());

try{

context.write(line,null);

}catch(Exceptione){


returnfalse;

}

returntrue;

}

}

Noticethatwecannowchangetheprocessingpipelinebehaviorandaddfurtherdatatransformationsbymodifyingmorphline.confwithouttheexplicitneedtoaltertheinstantiationandprocessinglogic.TheMapReducedriversourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriverMapReduce.java

Bothexamplescanbecompiledfromch9/kite/withthefollowingcommands:

$./gradlewjar

$./gradlewcopyJar

WeaddtheruntimedependenciestoLIBJARS,asfollows

$exportKITE_DEPS=/home/cloudera/review/hadoop2book-private-reviews-

gabriele-ch8/src/ch8/kite/build/libjars/kite-example/lib

exportLIBJARS=${LIBJARS},${KITE_DEPS}/kite-morphlines-core-

0.17.0.jar,${KITE_DEPS}/kite-morphlines-json-

0.17.0.jar,${KITE_DEPS}/metrics-core-3.0.2.jar,${KITE_DEPS}/metrics-

healthchecks-3.0.2.jar,${KITE_DEPS}/config-1.0.2.jar,${KITE_DEPS}/jackson-

databind-2.3.1.jar,${KITE_DEPS}/jackson-core-

2.3.1.jar,${KITE_DEPS}/jackson-annotations-2.3.0.jar

WecanruntheMapReducedriverwiththefollowing:

$hadoopjarbuild/libs/kite-example.jar\

com.learninghadoop2.kite.morphlines.MorphlineDriverMapReduce\

-libjars${LIBJARS}\

morphline.conf\

read_tweets\

tweets.json\

morphlines-out

https://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriverMapReduce.java

TheJavastandalonedrivercanbeexecutedwiththefollowingcommand:

$exportCLASSPATH=${CLASSPATH}:${KITE_DEPS}/kite-morphlines-core-

0.17.0.jar:${KITE_DEPS}/kite-morphlines-json-

0.17.0.jar:${KITE_DEPS}/metrics-core-3.0.2.jar:${KITE_DEPS}/metrics-

healthchecks-3.0.2.jar:${KITE_DEPS}/config-1.0.2.jar:${KITE_DEPS}/jackson-

databind-2.3.1.jar:${KITE_DEPS}/jackson-core-

2.3.1.jar:${KITE_DEPS}/jackson-annotations-2.3.0.jar:${KITE_DEPS}/slf4j-

api-1.7.5.jar:${KITE_DEPS}/guava-11.0.2.jar:${KITE_DEPS}/hadoop-common-

2.3.0-cdh5.0.3.jar

$java-cp$CLASSPATH:./build/libs/kite-example.jar\

com.learninghadoop2.kite.morphlines.MorphlineDriver\

morphline.conf\

read_tweetstweets.json\

morphlines-out

SummaryInthischapter,weintroducedfourtoolstoeasedevelopmentonHadoop.Inparticular,wecovered:

HowHadoopstreamingallowsthewritingofMapReducejobsusingdynamiclanguagesHowKiteDatasimplifiesinterfacingwithheterogeneousdatasourcesHowApacheCrunchprovidesahigh-levelabstractiontowritepipelinesofSparkandMapReducejobsthatimplementcommondesignpatternsHowMorphlinesallowsustodeclarechainsofcommandsanddatatransformationsthatcanthenbeembeddedinanyJavacodebase

InChapter10,RunningaHadoop2Cluster,wewillshiftourfocusfromthedomainofsoftwaredevelopmenttosystemadministration.Wewilldiscusshowtosetup,manage,andscaleaHadoopcluster,whiletakingaspectssuchasmonitoringandsecurityintoconsideration.

Chapter10.RunningaHadoopClusterInthischapter,wewillchangeourfocusalittleandlookatsomeoftheconsiderationsyouwillfacewhenrunninganoperationalHadoopcluster.Inparticular,wewillcoverthefollowingtopics:

WhyadevelopershouldcareaboutoperationsandwhyHadoopoperationsaredifferentMoredetailonClouderaManageranditscapabilitiesandlimitationsDesigningaclusterforuseonbothphysicalhardwareandEMRSecuringaHadoopclusterHadoopmonitoringTroubleshootingproblemswithanapplicationrunningonHadoop

I’madeveloper–Idon’tcareaboutoperations!Beforegoinganyfurther,weneedtoexplainwhyweareputtingachapteraboutsystemsoperationsinabooksquarelyaimedatdevelopers.Foranyonewhohasdevelopedformoretraditionalplatforms(forexample,webapps,databaseprogramming,andsoon)thenthenormmightwellhavebeenforaverycleardelineationbetweendevelopmentandoperations.Thefirstgroupbuildsthecodeandpackagesitup,andthesecondgroupcontrolsandoperatestheenvironmentinwhichitruns.

Inrecentyears,theDevOpsmovementhasgainedmomentumwithabeliefthatitisbestforeveryoneifthesesilosareremovedandthattheteamsworkmorecloselytogether.WhenitcomestorunningsystemsandservicesbasedonHadoop,webelievethisisabsolutelyessential.

HadoopandDevOpspracticesEventhoughadevelopercanconceptuallybuildanapplicationreadytobedroppedintoYARNandforgottenabout,therealityisoftenmorenuanced.Howmanyresourcesareallocatedtotheapplicationatruntimeismostlikelysomethingthedeveloperwishestoinfluence.Oncetheapplicationisrunning,theoperationsstafflikelywantsomeinsightintotheapplicationwhentheyaretryingtooptimizethecluster.Therereallyisn’tthesameclear-cutsplitofresponsibilitiesseenintraditionalenterpriseIT.Andthat’slikelyareallygoodthing.

Inotherwords,developersneedtobemoreawareoftheoperationsaspects,andtheoperationsstaffneedtobemoreawareofwhatthedevelopersaredoing.Soconsiderthischapterourcontributiontohelpyouhavethosediscussionswithyouroperationsstaff.Wedon’tintendtomakeyouanexpertHadoopadministratorbytheendofthischapter;thatreallyisemergingasadedicatedroleandskillsetinitself.Instead,wewillgiveawhistle-stoptourofissuesyoudoneedsomeawarenessofandthatwillmakeyourlifeeasieronceyourapplicationsarerunningonliveclusters.

Bythenatureofthiscoverage,wewillbetouchingonalotoftopicsandgoingintothemonlylightly;ifanyareofdeeperinterest,thenweprovidelinksforfurtherinvestigation.Justmakesureyoukeepyouroperationsstaffinvolved!

ClouderaManagerInthisbook,weusedasthemostcommonplatformtheClouderaHadoopDistribution(CDH)withitsconvenientQuickStartvirtualmachineandthepowerfulClouderaManagerapplication.WithaCloudera-basedcluster,ClouderaManagerwillbecome(atleastinitially)yourprimaryinterfaceintothesystemtomanageandmonitorthecluster,solet’sexploreitalittle.

NotethatClouderaManagerhasextensiveandhigh-qualityonlinedocumentation.Wewon’tduplicatethisdocumentationhere;insteadwe’llattempttohighlightwhereClouderaManagerfitsintoyourdevelopmentandoperationalworkflowsandhowitmightormightnotbesomethingyouwanttoembrace.DocumentationforthelatestandpreviousversionsofClouderaManagercanbeaccessedviathemainClouderadocumentationpageathttp://www.cloudera.com/content/support/en/documentation.html.

http://www.cloudera.com/content/support/en/documentation.html

TopayornottopayBeforegettingallexcitedaboutClouderaManager,it’simportanttoconsultthecurrentdocumentationconcerningwhatfeaturesareavailableinthefreeversionandwhichonesrequiresubscriptiontoapaid-forClouderaoffering.Ifyouabsolutelywantsomeofthefeaturesofferedonlyinthepaid-forversionbuteithercan’tordon’twishtopayforsubscriptionservices,thenClouderaManager,andpossiblytheentireClouderadistribution,mightnotbeagoodfitforyou.We’llreturntothistopicinChapter11,WheretoGoNext.

ClustermanagementusingClouderaManagerUsingtheQuickStartVM,itwon’tbeobvious,butClouderaManageristheprimarytooltobeusedformanagementofallservicesinthecluster.Ifyouwanttoenableanewservice,you’lluseClouderaManager.Tochangeaconfiguration,youwillneedClouderaManager.Toupgradetothelatestrelease,youwillagainrequireClouderaManager.

Eveniftheprimarymanagementoftheclusterishandledbyoperationalstaff,asadeveloperyou’lllikelystillwanttobecomefamiliarwiththeClouderaManagerinterfacejusttolooktoseeexactlyhowtheclusterisconfigured.Ifyourjobsarerunningslowly,thenlookingintoClouderaManagertoseejusthowthingsarecurrentlyconfiguredwilllikelybeyourfirststart.ThedefaultportfortheClouderaManagerwebinterfaceis7180,sothehomepagewillusuallybeconnectedtoviaaURLsuchashttp://<hostname>:7180/cmf/home,andcanbeseeninthefollowingscreenshot:

ClouderaManagerhomepage

It’sworthpokingaroundtheinterface;however,ifyouareconnectingwithauseraccountwithadminprivileges,becareful!

ClickontheClusterslink,andthiswillexpandtogivealistoftheclusterscurrentlymanagedbythisinstanceofClouderaManager.ThisshouldtellyouthatasingleClouderaManagerinstancecanmanagemultipleclusters.Thisisveryuseful,especiallyifyouhavemanyclustersspreadacrossdevelopmentandproduction.

Foreachexpandedcluster,therewillbealistoftheservicescurrentlyrunningonthecluster.Clickonaservice,andthenyouwillseealistofadditionalchoices.SelectConfiguration,andyoucanstartbrowsingthedetailedconfigurationofthatparticularservice.ClickonActions,andyouwillgetsomeservice-specificoptions;thiswillusuallyincludestopping,starting,restarting,andotherwisemanagingtheservice.

ClickontheHostsoptioninsteadofClusters,andyoucanstartdrillingdownintotheserversmanagedbyClouderaManager,andfromthere,seewhichservicecomponentsaredeployedoneach.

ClouderaManagerandothermanagementtoolsThatlastcommentmightraiseaquestion:howdoesClouderaManagerintegratewithothersystemsmanagementtools?GivenourearliercommentsregardingtheimportanceofDevOpsphilosophies,howwelldoesitintegratewiththetoolsfavoredinDevOpsenvironments?

Thehonestanswer:notalwaysverywell.ThoughthemainClouderaManagerservercanitselfbemanagedbyautomationtools,suchasPuppetorChef,thereisanexplicitassumptionthatClouderaManagerwillcontroltheinstallationandconfigurationofallthesoftwareClouderaManagerneedsonallthehoststhatwillbeincludedinitsclusters.Tosomeadministrators,thismakesthehardwarebehindClouderaManagerlooklikeabig,blackbox;theymightcontroltheinstallationofthebaseoperatingsystem,butthemanagementoftheconfigurationbaselinegoingforwardisentirelymanagedbyClouderaManager.There’snothingmuchtobedonehere;itiswhatitis—togetthebenefitsofClouderaManager,itwilladditselfasanewmanagementsysteminyourinfrastructure,andhowwellthatfitsinwithyourbroaderenvironmentwillbedeterminedonacase-by-casebasis.

MonitoringwithClouderaManagerAsimilarpointcanbemaderegardingsystemsmonitoringasClouderaManagerisalsoconceptuallyapointofduplicationhere.Butstartclickingaroundtheinterface,anditwillbecomeapparentveryquicklythatClouderaManagerprovidesanexceptionallyrichsetoftoolstoassessthehealthandperformanceofmanagedclusters.

FromgraphingtherelativeperformanceofImpalaqueriesthroughshowingthejobstatusforYARNapplicationsandgivinglow-leveldataontheblocksstoredonHDFS,itisallthereinasingleinterface.We’lldiscusslaterinthischapterhowtroubleshootingonHadoopcanbechallenging,butthesinglepointofvisibilityprovidedbyClouderaManagerisagreattoolwhenlookingtoassessclusterhealthorperformance.We’lldiscussmonitoringinalittlemoredetaillaterinthischapter.

FindingconfigurationfilesOneofthefirstconfusionsfacedwhenrunningaclustermanagedbyClouderaManageristryingtofindtheconfigurationfilesusedbythecluster.InthevanillaApachereleasesofproducts,suchasthecoreHadoop,therewouldbefilestypicallystoredin/etc/hadoop,similarly/etc/hiveforHive,/etc/oozieforOozie,andsoon.

InaClouderaManagermanagedcluster,however,theconfigfilesareregeneratedeachtimeaserviceisrestarted,andinsteadofsittinginthe/etclocationsonthefilesystem,willbefoundat/var/run/cloudera-scm-agent-process/<pid>-<taskname>/,wherethelastdirectorymighthaveanamesuchas7007-yarn-NODEMANAGER.ThismightseemoddtoanyoneusedtoworkingonearlierHadoopclustersorotherdistributionsthatdon’tdosuchathing.ButinaClouderaManager-controlledcluster,itmightoftenbeeasiertousethewebinterfacetobrowsetheconfigurationinsteadoflookingfortheunderlyingconfigfiles.Whichapproachisbest?Thisisalittlephilosophical,andeachteamneedstodecidewhichworksbestforthem.

ClouderaManagerAPIWe’veonlygiventhehighestlevelofoverviewofClouderaManager,andindoingso,havecompletelyignoredoneareathatmightbeveryusefulforsomeorganizations:ClouderaManageroffersanAPIthatallowsintegrationofitscapabilitiesintoothersystemsandtools.Consultthedocumentationifthismightbeofinteresttoyou.

ClouderaManagerlock-inThisbringsustothepointthatisimplicitinthewholediscussionaroundClouderaManager:itdoescauseadegreeoflock-intoClouderaandtheirdistribution.Thatlock-inmightonlyexistincertainways;code,forexample,shouldbeportableacrossclustersmodulotheusualcaveatsaboutdifferentunderlyingversions—buttheclusteritselfmightnoteasilybereconfiguredtouseadifferentdistribution.Assumethatswitchingdistributionswouldbeacompleteremove/reformat/reinstallactivity.

Wearen’tsayingdon’tuseit,ratherthatyouneedtobeawareofthelock-inthatcomeswiththeuseofClouderaManager.Forsmallteamswithlittlededicatedoperationssupportorexistinginfrastructure,theimpactofsuchalock-inislikelyoutweighedbythesignificantcapabilitiesthatClouderaManagergivesyou.

Forlargerteamsoronesworkinginanenvironmentwhereintegrationwithexistingtoolsandprocesseshasmoreweight,thedecisionmightbelessclear.LookatClouderaManager,discusswithyouroperationspeople,anddeterminewhatisrightforyou.

NotethatitispossibletomanuallydownloadandinstallthevariouscomponentsoftheClouderadistributionwithoutusingClouderaManagertomanagetheclusteranditshosts.ThismightbeanattractivemiddlegroundforsomeusersastheClouderasoftwarecanbeused,butdeploymentandmanagementcanbebuiltintotheexistingdeploymentandmanagementtools.Thisisalsopotentiallyawayofavoidingtheadditionalexpenseofthepaid-forlevelsofClouderasupportmentionedearlier.

Ambari–theopensourcealternativeAmbariisanApacheproject(http://ambari.apache.org),whichintheory,providesanopensourcealternativetoClouderaManager.ItistheadministrationconsolefortheHortonworksdistribution.AtthetimeofwritingHortonworksemployeesarealsothevastmajorityoftheprojectcontributors.

Ambari,asonewouldexpectgivenitsopensourcenature,reliesonotheropensourceproducts,suchasPuppetandNagios,toprovidethemanagementandmonitoringofitsmanagedclusters.Italsohashigh-levelfunctionalitysimilartoClouderaManager,thatis,theinstallation,configuration,management,andmonitoringofaHadoopcluster,andthecomponentserviceswithinit.

ItisgoodtobeawareoftheAmbariprojectasthechoiceisnotjustbetweenfulllock-intoClouderaandClouderaManageroramanuallymanagedcluster.Ambariprovidesagraphicaltoolthatmightbeworthconsideration,orindeedinvolvement,asitmatures.OnanHDPcluster,theAmbariUIequivalenttotheClouderaManagerhomepageshownearliercanbereachedathttp://<hostname>:8080/#/main/dashboardandlookslikethefollowingscreenshot:

Ambari

http://ambari.apache.org

OperationsintheHadoop2worldAsmentionedinChapter2,Storage,someofthemostsignificantchangesmadetoHDFSinHadoop2involveitsfaulttoleranceandbetterintegrationwithexternalsystems.Thisisnotjustacuriosity,buttheNameNodeHighAvailabilityfeatures,inparticular,havemadeamassivedifferenceinthemanagementofclusterssinceHadoop1.Inthebadolddaysof2012orso,asignificantpartoftheoperationalpreparednessofaHadoopclusterwasbuiltaroundmitigationsfor,andrestorationprocessesaroundfailureoftheNameNode.IftheNameNodediedinHadoop1,andyoudidn’thaveabackupoftheHDFSfsimagemetadatafile,thenyoubasicallylostaccesstoallyourdata.Ifthemetadatawaspermanentlylost,thensowasthedata.

Hadoop2hasaddedthein-builtNameNodeHAandthemachinerytomakeitwork.Inaddition,therearecomponentssuchastheNFSgatewayintoHDFS,whichmakeitamuchmoreflexiblesystem.Butthisadditionalcapabilitydoescomeattheexpenseofmoremovingparts.ToenableNameNodeHA,thereareadditionalcomponentsintheJournalManagerandFailoverController,andtheNFSgatewayrequiresHadoop-specificimplementationsoftheportmapandnfsdservices.

Hadoop2alsonowhasextensiveotherintegrationpointswithexternalservicesaswellasamuchbroaderselectionofapplicationsandservicesthatrunatopit.Consequently,itmightbeusefultoviewHadoop2intermsofoperationsashavingtradedthesimplicityofHadoop1foradditionalcomplexity,whichdeliversasubstantiallymorecapableplatform.

SharingresourcesInHadoop1,theonlytimeonehadtoconsiderresourcesharingwasinconsideringwhichschedulertousefortheMapReduceJobTracker.SincealljobswereeventuallytranslatedintoMapReducecodehavingapolicyforresourcesharingattheMapReducelevelwasusuallysufficienttomanageclusterworkloadsinthelarge.

Hadoop2andYARNchangedthispicture.AswellasrunningmanyMapReducejobs,aclustermightalsoberunningmanyotherapplicationsatopotherYARNApplicationMasters.TezandSparkareframeworksintheirownrightthatrunadditionalapplicationsatoptheirprovidedinterfaces.

IfeverythingrunsonYARN,thenitprovideswaysofconfiguringthemaximumresourceallocation(intermsofCPU,memory,andsoonI/O)consumedbyeachcontainerallocatedtoanapplication.Theprimarygoalhereistoensurethatenoughresourcesareallocatedtokeepthehardwarefullyutilizedwithouteitherhavingunusedcapacityoroverloadingit.

Thingsgetsomewhatmoreinterestingwhennon-YARNapplications,suchasImpala,arerunningontheclusterandwanttograballocatedslicesofcapacity(particularlymemoryinthecaseofImpala).Thiscouldalsohappenif,say,youwererunningSparkonthesamehostsinitsnon-YARNmodeorindeedanyotherdistributedapplicationthatmightbenefitfromco-locationontheHadoopmachines.

Basically,inHadoop2,youneedtothinkoftheclusterasmuchmoreofamulti-tenancyenvironmentthatrequiresmoreattentiongiventotheallocationofresourcestothevarioustenants.

Therereallyisnosilverbulletrecommendationhere;therightconfigurationwillbeentirelydependentontheservicesco-locatedandtheworkloadstheyarerunning.Thisisanotherexamplewhereyouwanttoworkcloselywithyouroperationsteamtodoaseriesofloadtestswiththresholdstodeterminejustwhattheresourcerequirementsofthevariousclientsareandwhichapproachwillgivethemaximumutilizationandperformance.ThefollowingblogpostfromClouderaengineersgivesagoodoverviewofhowtheyapproachthisveryissueinhavingImpalaandMapReducecoexisteffectively:http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/.

http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/

BuildingaphysicalclusterThereisoneminorrequirementbeforethinkingaboutallocationofhardwareresources:definingandselectingthehardwareusedforyourcluster.Inthissection,we’lldiscussaphysicalclusterandmoveontoAmazonEMRinthenext.

Anyspecifichardwareadvicewillbeoutofdatethemomentitiswritten.WeadviseperusingthewebsitesofthevariousHadoopdistributionvendorsastheyregularlywritenewarticlesonthecurrentlyrecommendedconfigurations.

InsteadoftellingyouhowmanycoresorGBofmemoryyouneed,we’lllookathardwareselectionataslightlyhigherlevel.ThefirstthingtorealizeisthatthehostsrunningyourHadoopclusterwillmostlikelylookverydifferentfromtherestofyourenterprise.Hadoopisoptimizedforlow(er)costhardware,soinsteadofseeingasmallnumberofverylargeservers,expecttoseealargernumberofmachineswithfewerenterprisereliabilityfeatures.Butdon’tthinkthatHadoopwillrungreatonanyjunkyouhavelyingaround.Itmight,butrecentlytheprofileoftypicalHadoopservershasbeenmovingawayfromthebottom-endofthemarket,andinstead,thesweetspotwouldseemtobeinmid-rangeserverswherethemaximumcores/disks/memorycanbeachievedatapricepoint.

YoushouldalsoexpecttohavedifferentresourcerequirementsforthehostsrunningservicessuchastheHDFSNameNodeortheYARNResourceManager,asopposedtotheworkernodesstoringdataandexecutingtheapplicationlogic.Fortheformer,thereisusuallymuchlessrequirementforlotsofstorage,butfrequently,aneedformorememoryandpossiblyfasterdisks.

ForHadoopworkernodes,theratiobetweenthethreemainhardwarecategoriesofcores,memory,andI/Oisoftenthemostimportantthingtogetright.Andthiswilldirectlyinformthedecisionsyoumakeregardingworkloadandresourceallocation.

Forexample,manyworkloadstendtobecomeI/Oboundandhavingmanytimesasmanycontainersallocatedonahostthantherearephysicaldisksmightactuallycauseanoverallslowdownduetocontentionforthespinningdisks.Atthetimeofwriting,currentrecommendationshereareforthenumberofYARNcontainerstobenomorethan1.8timesthenumberofdisks.IfyouhaveworkloadsthatareI/Obound,thenyouwillmostlikelygetmuchbetterperformancebyaddingmorehoststotheclusterinsteadoftryingtogetmorecontainersrunningorindeedfasterprocessorsormorememoryonthecurrenthosts.

Conversely,ifyouexpecttorunlotsofconcurrentImpala,Spark,andothermemory-hungryjobs,thenmemorymightquicklybecometheresourcemostunderpressure.Thisiswhyeventhoughyoucangetcurrenthardwarerecommendationsforgeneral-purposeclustersfromthedistributionvendors,youstillneedtovalidateagainstyourexpectedworkloadsandtailoraccordingly.ThereisreallynosubstituteforbenchmarkingonasmalltestclusterorindeedonEMR,whichcanbeagreatplatformtoexploretheresourcerequirementsofmultipleapplicationsthatcaninformhardwareacquisitiondecisions.PerhapsEMRmightbeyourmainenvironment;ifso,we’lldiscussthatinalatersection.

PhysicallayoutIfyoudouseaphysicalcluster,thereareafewthingsyouwillneedtoconsiderthatarelargelytransparentonEMR.

RackawarenessThefirstoftheseaspectsforclusterslargeenoughtoconsumemorethanonerackofdatacenterspaceisbuildingrackawareness.AsmentionedinChapter2,Storage,whenHDFSplacesreplicasofnewfiles,itattemptstoplacethesecondreplicaonadifferenthostthanthefirst,andthethirdinadifferentrackofequipmentinamulti-racksystem.Thisheuristicisaimedatmaximizingresilience;therewillbeatleastonereplicaavailableevenifanentirerackofequipmentfails.MapReduceusessimilarlogictoattempttogetabetter-balancedtaskspread.

Ifyoudonothing,theneachhostwillbespecifiedasbeinginthesingledefaultrack.But,iftheclustergrowsbeyondthispoint,youwillneedtoupdatetherackname.

Underthecovers,Hadoopdiscoversanode’srackbyexecutingauser-suppliedscriptthatmapsnodehostnametoracknames.ClouderaManagerallowsracknamestobesetonagivenhost,andthisisthenretrievedwhenitsrackawarenessscriptsarecalledbyHadoop.Tosettherackforahost,clickonHosts-><hostname>->AssignRack,andthenassigntherackfromtheClouderaManagerhomepage.

ServicelayoutAsmentionedearlier,youarelikelytohavetwotypesofhardwareinyourcluster:themachinesrunningtheworkersandthoserunningtheservers.Whendeployingaphysicalcluster,youwillneedtodecidewhichservicesandwhichsubcomponentsoftheservicesrunonwhichphysicalmachines.

Fortheworkers,thisisusuallyprettystraightforward;most,thoughnotall,serviceshaveamodelofaworkeragentonallworkerhosts.But,forthemaster/servercomponents,itrequiresalittlethought.Ifyouhavethreemasternodes,thenhowdoyouspreadyourprimaryandbackupNameNodes:theYARNResourceManager,maybeHue,afewHiveservers,andanOoziemanager?Someofthesefeaturesarehighlyavailable,whileothersarenot.Asyouaddmoreandmoreservicestoyourcluster,you’llalsoseethislistofmasterservicesgrowsubstantially.

Inanidealworld,youmighthaveahostperservicemasterbutthatisonlytractableforverylargeclusters;insmallerinstallationsitisprohibitivelyexpensive.Plusitmightalwaysbealittlewasteful.Therearenohard-and-fastruleshereeither,butdolookatyouravailablehardware,andtrytospreadtheservicesacrossthenodesasmuchaspossible.Don’t,forexample,havetwonodesforthetwoNameNodesandthenputeverythingelseonathird.Thinkabouttheimpactofasinglehostfailureandmanagethelayouttominimizeit.Astheclustergrowsacrossmultipleracksofequipment,theconsiderationswillalsoneedtoconsiderhowtosurvivesingle-rackfailures.HadoopitselfhelpswiththissinceHDFSwillattempttoensureeachblockofdatahasreplicasacrossatleasttwo

racks.But,thistypeofresilienceisunderminedif,forexample,allthemasternodesresideinasinglerack.

UpgradingaserviceUpgradingHadoophashistoricallybeenatime-consumingandsomewhatriskytask.Thisremainsthecaseonamanuallydeployedcluster,thatis,onenotmanagedbyatoolsuchasClouderaManager.

IfyouareusingClouderaManager,thenittakesthetime-consumingpartoutoftheactivity,butnotnecessarilytherisk.Anyupgradeshouldalwaysbeviewedasanactivitywithahighchanceofunexpectedissues,andyoushouldarrangeenoughclusterdowntimetoaccountforthissurpriseexcitement.There’sreallynosubstitutefordoingatestupgradeonatestcluster,whichunderlinestheimportanceofthinkingaboutHadoopasacomponentofyourenvironmentthatneedstobetreatedwithadeploymentlifecyclelikeanyother.

SometimesanupgraderequiresmodificationtotheHDFSmetadataormightotherwiseaffectthefilesystem.Thisis,ofcourse,wheretherealriskslie.Inadditiontorunningatestupgrade,beawareoftheabilitytosetHDFSinupgrademode,whicheffectivelymakesasnapshotofthefilesystemstatepriortotheupgradeandwhichwillberetaineduntiltheupgradeisfinalized.Thiscanbereallyhelpfulasevenanupgradethatgoesbadlywrongandcorruptsdatacanpotentiallybefullyrolledback.

BuildingaclusteronEMRElasticMapReduceisaflexiblesolutionthat,dependingonrequirementsandworkloads,cansitnextto,orreplace,aphysicalHadoopcluster.Aswe’veseensofar,EMRprovidesclusterspreloadedandconfiguredwithHive,Streaming,andPigaswellaswithcustomJARclustersthatallowtheexecutionofMapReduceapplications.

Aseconddistinctiontomakeisbetweentransientandlong-runninglifecycles.AtransientEMRclusterisgeneratedondemand;dataisloadedinS3orHDFS,someprocessingworkflowisexecuted,outputresultsarestored,andtheclusterisautomaticallyshutdown.Along-runningclusteriskeptaliveoncetheworkflowterminates,andtheclusterremainsavailablefornewdatatobecopiedoverandnewworkflowstobeexecuted.Long-runningclustersaretypicallywell-suitedfordatawarehousingorworkingwithdatasetslargeenoughthatloadingandprocessingdatawouldbeinefficientcomparedtoatransientinstance.

Inamust-readwhitepaperforprospectiveusers(foundathttps://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf),Amazongivesaheuristictoestimatewhichclustertypeisabetterfitasfollows:

Ifnumberofjobsperday*(timetosetupclusterincludingAmazonS3dataloadtimeifusingAmazonS3+dataprocessingtime)<24hours,considertransientAmazonEMRclustersorphysicalinstances.Long-runninginstancesareinstantiatedbypassingthe–aliveargumenttotheElasticMapreducecommand,whichenablestheKeepAliveoptionanddisablesautotermination.

Notethattransientandlong-runningclusterssharethesamepropertiesandlimitations;inparticular,dataonHDFSisnotpersistedoncetheclusterisshutdown.

https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf

ConsiderationsaboutfilesystemsInourexamplessofarweassumeddatatobeavailableinS3.Inthiscase,abucketismountedinEMRasans3nfilesystem,anditisusedasinputsourceaswellasatemporaryfilesystemtostoreintermediatedataincomputations.WithS3weintroducepotentialI/Ooverhead,operationssuchasreadsandwritesfireoffGETandPUTHTTPrequests.

NoteNotethatEMRdoesnotsupportS3blockstorage.Thes3URImapstos3n.

AnotheroptionwouldbetoloaddataintotheclusterHDFSandrunprocessingfromthere.Inthiscase,wedohavefasterI/Oanddatalocality,butwewouldlosepersistence.Whentheclusterisshutdown,ourdatadisappears.Asaruleofthumb,ifyouarerunningatransientcluster,itmakessensetouseS3asabackend.Inpractice,oneshouldmonitorandtakedecisionsbasedontheworkflowcharacteristics.Iterative,multi-passMapReducejobswouldgreatlybenefitfromHDFS;onecouldarguethatforthosetypesofworkflows,anexecutionenginelikeTezorSparkwouldbemoreappropriate.

GettingdataintoEMRWhencopyingdatafromHDFStoS3,itisrecommendedtouses3distcp(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.htmlinsteadofApachedistcporHadoopdistcp.ThisapproachissuitablealsototransferdatawithinEMRandfromS3toHDFS.TomoveverylargeamountsofdatafromthelocaldiskintoS3,AmazonrecommendsparallelizingtheworkloadusingJets3torGNUParallel.Ingeneral,it’simportanttobeawarethatPUTrequeststoS3arecappedat5GBperfile.Touploadlargerfiles,oneneedstorelyonMultipartUpload(https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/),anAPIthatallowssplittinglargefilesintosmallerpartsandreassemblesthemwhenuploaded.FilescanalsobecopiedwithtoolssuchastheAWSCLIorthepopularS3CMDutility,butthesedonothavetheparallelismadvantagesofass3distcp.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/

EC2instancesandtuningThesizeofanEMRclusterdependsonthedatasetsize,thenumberoffilesandblocks(determinesthenumberofsplits)andthetypeofworkload(trytoavoidspillingtodiskwhenataskrunsoutofmemory).Asaruleofthumb,agoodsizeisonethatmaximizesparallelism.ThenumberofmappersandreducersperinstanceaswellasheapsizeperJVMdaemonisgenerallyconfiguredbyEMRwhentheclusterisprovisionedandtunedintheeventofchangesintheavailableresources.

ClustertuningInadditiontothepreviouscommentsspecifictoaclusterrunonEMR,therearesomegeneralthoughtstokeepinmindwhenrunningworkloadsonanytypeofcluster.Thiswill,ofcourse,bemoreexplicitwhenrunningoutsideofEMRasitoftenabstractssomeofthedetails.

JVMconsiderationsYoushouldberunningthe64-bitversionofaJVMandusingtheservermode.Thiscantakelongertoproduceoptimizedcode,butitalsousesmoreaggressivestrategiesandwillre-optimizecodeovertime.Thismakesitamuchbetterfitforlong-runningservices,suchasHadoopprocesses.

EnsurethatyouallocateenoughmemorytotheJVMtopreventoverly-frequentGarbageCollection(GC)pauses.Theconcurrentmark-and-sweepcollectoriscurrentlythemosttestedandrecommendedforHadoop.TheGarbageFirst(G1)collectorhasbecometheGCoptionofchoiceinnumerousotherworkloadssinceitsintroductionwithJDK7,soit’sworthmonitoringrecommendedbestpracticeasitevolves.TheseoptionscanbeconfiguredascustomJavaargumentswithineachservice’sconfigurationsectionofClouderaManager.

ThesmallfilesproblemHeapallocationtoJavaprocessesonworkernodeswillbesomethingyouconsiderwhenthinkingaboutserviceco-location.ButthereisaparticularsituationregardingtheNameNodeyoushouldbeawareof:thesmallfilesproblem.

Hadoopisoptimizedforverylargefileswithlargeblocksizes.ButsometimesparticularworkloadsordatasourcespushmanysmallfilesontoHDFS.Thisismostlikelysuboptimalasitsuggestseachtaskprocessingablockatatimewillreadonlyasmallamountofdatabeforecompleting,causinginefficiency.

HavingmanysmallfilesalsoconsumesmoreNameNodememory;itholdsin-memorythemappingfromfilestoblocksandconsequentlyholdsmetadataforeachfileandblock.Ifthenumberoffilesandhenceblocksincreasesquickly,thensowilltheNameNodememoryusage.Thisislikelytoonlyhitasubsetofsystemsas,atthetimeofwritingthis,1GBofmemorycansupport2millionfilesorblocks,butwithadefaultheapsizeof2or4GB,thislimitcaneasilybereached.IftheNameNodeneedstostartveryaggressivelyrunninggarbagecollectionoreventuallyrunsoutofmemory,thenyourclusterwillbeveryunhealthy.ThemitigationistoassignmoreheaptotheJVM;thelonger-termapproachistocombinemanysmallfilesintoasmallernumberoflargerones.Ideally,compressedwithasplittablecompressioncodec.

MapandreduceoptimizationsMappersandreducersbothprovideareasforoptimizingperformance;hereareafewpointerstoconsider:

Thenumberofmappersdependsonthenumberofsplits.Whenfilesaresmallerthanthedefaultblocksizeorcompressedusinganonsplittableformat,thenumberofmapperswillequalthenumberoffiles.Otherwise,thenumberofmappersisgivenbythetotalsizeofeachfiledividedbytheblocksize.CompressmappersoutputtoreducewritestodiskandincreaseI/O.LZOisagoodformatforthistask.Avoidspilltodisk:themappersshouldhaveenoughmemorytoretainasmuchdataaspossible.NumberofReducers:itisrecommendedthatyouusefewerreducersthanthetotalreducercapacity(thisavoidsexecutionwaits).

SecurityOnceyoubuiltacluster,thefirstthingyouthoughtaboutwashowtosecureit,right?Don’tworry,mostpeopledon’t.But,asHadoophasmovedonfrombeingsomethingrunningin-houseanalysisintheresearchdepartmenttodirectlydrivingcriticalsystems,it’snotsomethingtoignorefortoolong.

SecuringHadoopisnotsomethingtobedoneonawhimorwithoutsignificanttesting.Wecannotgivedetailedadviceonthistopicandcannotstressstronglyenoughtheneedtotakethistopicseriouslyanddoitproperly.Itmightconsumetime,itmightcostmoney,butweighthisagainstthecostofhavingyourclustercompromised.

SecurityisalsoamuchbiggertopicthanjusttheHadoopcluster.We’llexploresomeofthesecurityfeaturesavailableinHadoop,butyoudoneedacoherentsecuritystrategyintowhichthesediscretecomponentsfit.

EvolutionoftheHadoopsecuritymodelInHadoop1,therewaseffectivelynosecurityprotectionastheprovidedsecuritymodelhadobviousattackvectors.TheUnixuserIDwithwhichyouconnectedtotheclusterwasassumedtobevalid,andyouhadalltheprivilegesofthatuser.Plainly,thismeantthatanyonewithadministrativeaccessonahostthatcouldaccesstheclustercouldeffectivelyimpersonateanyotheruser.

Thisledtothedevelopmentoftheso-called“headnode”accessmodel,wherebytheHadoopclusterwasfirewalledofffromeveryhostexceptone,theheadnode,andallaccesstotheclusterwasmediatedthroughthiscentrally-controllednode.Thiswasaneffectivemitigationforthelackofarealsecuritymodelandcanstillbeusefulinsituationsevenwhenrichersecurityschemesareutilized.

BeyondbasicauthorizationCoreHadoophashadadditionalsecurityfeaturesadded,whichaddressthepreviousconcerns.Inparticular,theyaddressthefollowing:

AclustercanrequireausertoauthenticateviaKerberosandprovetheyarewhotheysaytheyare.Insecuremode,theclustercanalsouseKerberosforallnode-nodecommunications,ensuringthatallcommunicatingnodesareauthenticatedandpreventingmaliciousnodesfromattemptingtojointhecluster.Toeasemanagement,userscanbecollectedintogroupsagainstwhichdata-accessprivilegescanbedefined.ThisiscalledRoleBasedAccessControl(RBAC)andisaprerequisiteforasecureclusterwithmorethanahandfulofusers.Theuser-groupmappingscanberetrievedfromcorporatesystems,suchasLDAPoractivedirectory.HDFScanapplyACLstoreplacethecurrentUnix-inspiredowner/group/worldmodel.

ThesecapabilitiesgiveHadoopasignificantlystrongersecurityposturethaninthepast,butthecommunityismovingfastandadditionaldedicatedApacheprojectshaveemergedtoaddressspecificareasofsecurity.

ApacheSentryhttps://sentry.incubator.apache.orgisasystemtoprovidemuchfiner-grainedauthorizationtoHadoopdataandservices.OtherservicesbuildSentrymappings,andthisallows,forexample,specificrestrictionstobeplacednotonlyonparticularHDFSdirectories,butalsoonentitiessuchasHivetables.

WhereasSentryfocusesonprovidingmuchrichertoolsfortheinternal,fine-grainedaspectsofHadoopsecurity,ApacheKnox(http://knox.apache.org)providesasecuregatewaytoHadoopthatintegrateswithexternalidentitymanagementsystemsandprovidesaccesscontrolmechanismstoallowordisallowaccesstospecificHadoopservicesandoperations.ItdoesthisbypresentingaREST-onlyinterfacetoHadoopandsecuringallcallstothisAPI.

https://sentry.incubator.apache.org

http://knox.apache.org

ThefutureofHadoopsecurityTherearemanyotherdevelopmentshappeningintheHadoopworld.CoreHadoop2.5addedextendedfileattributestoHDFS,whichcanbeusedasthebasisofadditionalaccesscontrolmechanisms.Futureversionswillincorporatecapabilitiesforbettersupportofencryptionfordataintransitaswellasatrest,andtheProjectRhinoinitiativeledbyIntel(https://github.com/intel-hadoop/project-rhino/)isbuildingoutrichersupportforfilesystemcryptographicmodules,asecurefilesystem,and,atsomepoint,afullerkey-managementinfrastructure.

TheHadoopdistributionvendorsaremovingfasttoaddthesecapabilitiestotheirreleases,soifyoucareaboutsecurity(youdo,don’tyou!),thenconsultthedocumentationforthelatestreleaseofyourdistribution.Newsecurityfeaturesarebeingaddedeveninpointupdatesandaren’tbeingdelayeduntilmajorupgrades.

https://github.com/intel-hadoop/project-rhino/

ConsequencesofusingasecuredclusterAfterteasingyouwithallthesecuritygoodnessthatisnowavailableandthatwhichiscoming,it’sonlyfairtogivesomewordsofwarning.Securityisoftenhardtodocorrectly,andoftenthefeelingofsecuritywronglyemployedwithabuggydeploymentisworsethanknowingyouhavenosecurity.

However,evenifyoudoitright,thereareconsequencestorunningasecurecluster.Itmakesthingsharderfortheadministratorscertainlyandoftentheusers,sothereisdefinitelyanoverhead.SpecificHadooptoolsandserviceswillalsoworkdifferentlydependingonwhatsecurityisemployedonacluster.

Oozie,whichwediscussedinChapter8,DataLifecycleManagement,usesitsowndelegationtokensbehindthescenes.Thisallowstheoozieusertosubmitjobsthatarethenexecutedonbehalfoftheoriginallysubmittinguser.Inaclusterusingonlythebasicauthorizationmechanism,thisisveryeasilyconfigured,butusingOozieinasecureclusterwillrequireadditionallogictobeaddedtotheworkflowdefinitionsandthegeneralOozieconfiguration.Thisisn’taproblemwithHadooporOozie;however,similarlyaswiththeadditionalcomplexityresultingfromthemuchbetterHAfeaturesofHDFSinHadoop2,bettersecuritymechanismswillsimplyhavecostsandconsequencesthatyouneedtakeintoconsideration.

MonitoringEarlierinthischapter,wediscussedClouderaManagerasavisualmonitoringtoolandhintedthatitcouldalsobeprogrammaticallyintegratedwithothermonitoringsystems.ButbeforepluggingHadoopintoanymonitoringframework,it’sworthconsideringjustwhatitmeanstooperationallymonitoraHadoopcluster.

Hadoop–wherefailuresdon’tmatterTraditionalsystemsmonitoringtendstobequiteabinarytool;generallyspeaking,eithersomethingisworkingoritisn’t.Ahostisaliveordead,andawebserverisrespondingoritisn’t.ButintheHadoopworld,thingsarealittledifferent;theimportantthingisserviceavailability,andthiscanstillbetreatedasliveevenifparticularpiecesofhardwareorsoftwarehavefailed.NoHadoopclustershouldbeintroubleifasingleworkernodefails.AsofHadoop2,eventhefailureoftheserverprocesses,suchastheNameNodeshouldn’treallybeaconcernifHAisconfigured.So,anymonitoringofHadoopneedstotakeintoaccounttheservicehealthandnotthatofspecifichostmachines,whichshouldbeunimportant.Operationspeopleon24/7pagerarenotgoingtobehappygettingpagedat3AMtodiscoverthatoneworkernodeinaclusterof10,000hasfailed.Indeed,oncethescaleoftheclusterincreasesbeyondacertainpoint,thefailureofindividualpiecesofhardwarebecomesanalmostcommonplaceoccurrence.

MonitoringintegrationYouwon’tbebuildingyourownmonitoringtools;instead,youmightlikelywanttointegratewithexistingtoolsandframeworks.Forpopularopensourcemonitoringtools,suchasNagiosandZabbix,therearemultiplesampletemplatestointegrateHadoop’sservice-wideandnode-specificmetrics.

Thiscangivethesortofseparationhintedpreviously;thefailureoftheYARNResourceManagerwouldbeahigh-criticalityeventthatshouldmostlikelycausealertstobesenttooperationsstaff,butahighloadonspecifichostsshouldonlybecapturedandnotcausealertstobefired.Thisthenprovidesthedualityoffiringalertswhenbadthingshappeninadditiontocapturingandprovidingtheinformationneededtodelveintosystemdataovertimetodotrendanalysis.

ClouderaManagerprovidesaRESTinterface,whichisanotherpointofintegrationagainstwhichtoolssuchasNagioscanintegrateandpulltheClouderaManager-definedservice-levelmetricsinsteadofhavingtodefineitsown.

Forheavier-weightenterprise-monitoringinfrastructurebuiltonframeworks,suchasIBMTivoliorHPOpenView,ClouderaManagercanalsodelivereventsviaSNMPtrapsthatwillbecollectedbythesesystems.

Application-levelmetricsAttimes,youmightalsowantyourapplicationstogathermetricsthatcanbecentrallycapturedwithinthesystem.Themechanismsforthiswilldifferfromonecomputationalmodeltoanother,butthemostwell-knownaretheapplicationcountersavailablewithinMapReduce.

WhenaMapReducejobcompletes,itoutputsanumberofcounters,gatheredbythesystemthroughoutthejobexecution,thatdealwithmetricssuchasthenumberofmaptasks,byteswritten,failedtasks,andsoon.Youcanalsowriteapplication-specificmetricsthatwillbeavailablealongsidethesystemcountersandwhichareautomaticallyaggregatedacrossthemap/reduceexecution.FirstdefineaJavaenum,andnameyourdesiredmetricswithinit,asfollows:

publicenumAppMetrics{

MAX_SEEN,

MIN_SEEN,

BAD_RECORDS

};

Then,withinthemap,reduce,setup,andcleanupmethodsofyourMaporReduceimplementations,youcandosomethinglikethefollowingtoincrementacounterbyone:

Context.getCounter(AppMetrics.BAD_RECORDS).increment(1);

RefertotheJavaDocoftheorg.apache.hadoop.mapreduce.Counterinterfaceformoredetailsofthismechanism.

TroubleshootingMonitoringandloggingcountersoradditionalinformationisallwellandgood,butitcanbeintimidatingtoknowhowtoactuallyfindtheinformationyouneedwhentroubleshootingaproblemwithanapplication.Inthissection,wewilllookathowHadoopstoreslogsandsysteminformation.Wecandistinguishthreetypologiesoflogs,asfollows:

YARNapplications,includingMapReducejobsDaemonlogs(NameNodeandResourceManager)Servicesthatlognon-distributedworkloads,forexample,HiveServer2loggingto/var/log

Nexttotheselogtypologies,Hadoopexposesanumberofmetricsatfilesystem(thestorageavailability,replicationfactor,andnumberofblocks)andsystemlevel.Asmentioned,bothApacheAmbariandClouderaManager,whichcentralizeaccesstodebuginformation,doanicejobasthefrontend.However,underthehood,eachservicelogstoeitherHDFSorthesingle-nodefilesystem.Furthermore,YARN,MapReduce,andHDFSexposetheirlogfilesandmetricsviawebinterfacesandprogrammaticAPIs.

LogginglevelsHadooplogsmessagestoLog4jbydefault.Log4jisconfiguredvialog4j.propertiesintheclasspath.Thisfiledefineswhatisloggedandwithwhichlayout:

log4j.rootLogger=${root.logger}

root.logger=INFO,console

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss}%p

%c{2}:%m%n

ThedefaultrootloggerisINFO,console,whichlogsallmessagesatthelevelINFOandabovetotheconsole’sstderr.SingleapplicationsdeployedonHadoopcanshiptheirownlog4j.propertiesandsetthelevelandotherpropertiesoftheiremittedlogsasrequired.

HadoopdaemonshaveawebpagetogetandsettheloglevelforanyLog4jproperty.Thisinterfaceisexposedbythe/LogLevelendpointineachservicewebUI.ToenabledebugloggingfortheResourceManagerclass,wewillvisithttp://resourcemanagerhost:8088/LogLevel,andthescreenshotcanbeseenasfollows:

GettingandsettingtheloglevelonResourceManager

Alternatively,theYARNdaemonlog<host:port>commandinterfaceswiththeservice/LogLevelendpoint.Wecaninspectthelevelassociatedwithmapreduce.map.log.levelfortheResourceManagerclassusingthe–getlevel<property>parameter,asfollows:

$hadoopdaemonlog-getlevellocalhost.localdomain:8088

mapreduce.map.log.level

Connectingtohttp://localhost.localdomain:8088/logLevel?

log=mapreduce.map.log.levelSubmittedLogName:mapreduce.map.log.levelLog

Class:org.apache.commons.logging.impl.Log4JLoggerEffectivelevel:INFO

Theeffectivelevelcanbemodifiedusingthe-setlevel<property><level>option:

$hadoopdaemonlog-setlevellocalhost.localdomain:8088

mapreduce.map.log.levelDEBUG

Connectingtohttp://localhost.localdomain:8088/logLevel?

log=mapreduce.map.log.level&level=DEBUG

SubmittedLogName:mapreduce.map.log.level

LogClass:org.apache.commons.logging.impl.Log4JLogger

SubmittedLevel:DEBUG

SettingLeveltoDEBUG…

Effectivelevel:DEBUG

NotethatthissettingwillaffectalllogsproducedbytheResourceManagerclass.Thisincludessystem-generatedentriesaswellastheonesgeneratedbyapplicationsrunningonYARN.

AccesstologfilesLogfilelocationsandnamingconventionsarelikelytodifferbasedonthedistribution.ApacheAmbariandClouderaManagercentralizeaccesstologfiles,bothforservicesandsingleapplications.OnCloudera’sQuickStartVM,anoverviewofthecurrentlyrunningprocessesandlinkstotheirlogfiles,thestderrandstdoutchannelscanbefoundathttp://localhost.localdomain:7180/cmf/hardware/hosts/1/processes,andthescreenshotcanbeseenasfollows:

AccesstologresourcesinClouderaManager

AmbariprovidesasimilaroverviewviatheServicesdashboardfoundathttp://127.0.0.1:8080/#/main/servicesontheHDPSandbox,andthescreenshotcanbeseenasfollows:

AccesstologresourcesonApacheAmbari

Non-distributedlogsareusuallyfoundunder/var/log/<service>oneachclusternode.YARNcontainersandMRv2logslocationsalsodependonthedistribution.OnCDH5theseresourcesareavailableinHDFSunder/tmp/logs/<user>.

Thestandardmodalitytoaccessdistributedlogsiseitherviacommand-linetoolsorusingserviceswebUIs.

Forinstance,thecommandisasfollows:

$yarnapplication-list-appStatesALL

TheprecedingcommandwilllistallrunningandretriedYARNapplications.TheURLinthetaskcolumnpointstoawebinterfacethatexposesthetasklog,asfollows:

14/08/0314:44:38INFOclient.RMProxy:ConnectingtoResourceManagerat

localhost.localdomain/127.0.0.1:8032Totalnumberofapplications

(application-types:[]andstates:[NEW,NEW_SAVING,SUBMITTED,ACCEPTED,

RUNNING,FINISHED,FAILED,KILLED]):4Application-Id

Application-NameApplication-TypeUserQueue

StateFinal-StateProgress

Tracking-URLapplication_1405630696162_0002PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002

application_1405630696162_0004PigLatin:DefaultJobName


SUCCEEDED100%




SUCCEEDED100%




SUCCEEDED100%


Forinstance,http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002,alinktoataskbelongingtousercloudera,isafrontendtothecontentstoredunderhdfs:///tmp/logs/cloudera/logs/application_1405630696162_0002/.

Inthefollowingsections,wewillgiveanoverviewoftheavailableUIsfordifferentservices.

NoteProvisioninganEMRclusterwiththe–log-uris3://<bucket>optionwillensurethatHadooplogsarecopiedintothes3://<bucket>location.

ResourceManager,NodeManager,andApplicationManagerOnYARNtheResourceManagerwebUIprovidesinformationandgeneraljobstatisticsoftheHadoopcluster,running/completed/failedjobs,andajobhistorylogfile.Bydefault,theUIisexposedathttp://<resourcemanagerhost>:8088/andcanbeseeninthefollowingscreenshot:

ResourceManager

ApplicationsOntheleft-handsidebar,itispossibletoreviewtheapplicationstatusofinterest:NEW,SUBMITTED,ACCEPTED,RUNNING,FINISHING,FINISHED,FAILED,orKILLED.Dependingontheapplicationstatus,thefollowinginformationisavailable:

TheapplicationIDThesubmittinguserTheapplicationnameTheschedulerqueueinwhichtheapplicationisplacedStart/finishtimesandstateLinktotheTrackingUIforapplicationhistory

Inaddition,theClusterMetricsviewgivesyouinformationonthefollowing:

OverallapplicationstatusNumberofrunningcontainersMemoryusageNodestatus

NodesTheNodesviewisafrontendtotheNodeManagerservicemenu,whichshowshealthandlocationinformationonthenode’srunningapplications,asfollows:

Nodesstatus

EachindividualnodeoftheclusterexposesfurtherinformationandstatisticsathostlevelviaitsownUI.TheseincludewhichversionofHadoopisrunningonthenode,howmuchmemoryisavailableonthenode,thenodestatus,andalistofrunningapplicationsandcontainers,asshowninthefollowingscreenshot:

Singlenodeinfo

SchedulerThefollowingscreenshotshowstheSchedulerwindow:

Scheduler

MapReduceThoughthesameinformationandloggingdetailsareavailableinMapReducev1and

MapReducev2,theaccessmodalityisslightlydifferent.

MapReducev1ThefollowingscreenshotshowstheMapReduceJobTrackerUI:

TheJobTrackerUI

TheJobTrackerUI,availablebydefaultathttp://<jobtracker>:50070,exposesinformationonallcurrentlyrunningaswellasretiredMapReducejobs,asummaryoftheclusterresourcesandhealth,aswellasschedulinginformationandcompletionpercentage,asshowninthefollowingscreenshot:

Jobdetails

Foreachrunningandretiredjob,detailsareavailable,includingitsID,owner,priority,taskassignment,andtasklaunchforthemapper.Clickingonajobidlinkwillleadtoajobdetailspage—thesameURLexposedbythemapredjob–listcommand.Thisresourcegivesdetailsaboutboththemapandreducetasksaswellasgeneralcounterstatisticsatthejob,filesystem,andMapReducelevels;theseincludethememoryused,numberofread/writeoperations,andthenumberofbytesreadandwritten.

ForeachMapandReduceoperation,theJobTrackerexposesthetotal,pending,running,completed,andfailedtasks,asshowninthefollowingscreenshot:

Jobtasksoverview

ClickingonthelinksintheJobtablewillleadtoafurtheroverviewatthetaskandtask-attemptlevels,asshowninthefollowingscreenshot:

Taskattempts

Fromthislastpage,wecanaccessthelogsofeachtaskattempt,bothforsuccessfulandfailed/killedtasksoneachindividualTaskTrackerhost.ThislogcontainsthemostgranularinformationaboutthestatusoftheMapReducejob,includingtheoutputofLog4jappendersaswellasoutputpipedtothestdoutandstderrchannelsandsyslog,asshowninthefollowingscreenshot:

TaskTrackerlogs

MapReducev2(YARN)AswehaveseeninChapter3,Processing–MapReduceandBeyond,withYARN,MapReduceisonlyoneofmanyprocessingframeworksthatcanbedeployed.RecallfrompreviouschaptersthattheJobTrackerandTaskTrackerserviceshavebeenreplacedbytheResourceManagerandNodeManager,respectively.Assuch,boththeserviceUIsandthelogfilesfromYARNaremoregenericthanMapReducev1.

Theapplication_1405630696162_0002nameshowninResourceManagercorrespondstoaMapReducejobwiththejob_1405630696162_0002ID.ThatapplicationIDbelongstothetaskrunninginsidethecontainer,andclickingonitwillrevealanoverviewoftheMapReducejobandallowadrill-downtotheindividualtasksfromeitherphaseuntilthesingle-tasklogisreached,asshowninthefollowingscreenshot:

AYARNapplicationcontainingaMapReducejob

JobHistoryServerYARNshipswithaJobHistoryRESTservicethatexposesdetailsonfinishedapplications.Currently,itonlysupportsMapReduceandprovidesinformationonfinishedjobs.ThisincludesthejobfinalstatusSUCCESSFULorFAILED,whosubmittedthejob,thetotalnumberofmapandreducetasks,andtiminginformation.

AUIisavailableathttp://<jobhistoryhost>:19888/jobhistory,asshowninthefollowingscreenshot:

JobHistoryUI

ClickingoneachjobIDwillleadtotheMapReducejobUIshownintheYARNapplicationscreenshot.

NameNodeandDataNodeThewebinterfacefortheHadoopDistributedFileSystem(HDFS)showsinformationabouttheNameNodeitselfaswellasthefilesystemgenerally.

Bydefault,itislocatedathttp://<namenodehost>:50070/,asshowninthefollowingscreenshot:

NameNodeUI

TheOverviewmenuexposesNameNodeinformationaboutDFScapacityandusageandtheblockpoolstatus,anditgivesasummaryofthestatusofDataNodehealthandavailability.Theinformationcontainedinthispageisforthemostpartequivalenttowhatisshownatthecommand-lineprompt:

$hdfsdfsadmin–report

TheDataNodesmenugivesmoredetailedinformationaboutthestatusofeachnodeandoffersadrill-downatthesingle-hostlevel,bothforavailableanddecommissionednodes,asshowninthefollowingscreenshot:

DatanodeUI

SummaryThishasbeenquiteawhistle-stoptouraroundtheconsiderationsofrunninganoperationalHadoopcluster.Wedidn’ttrytoturndevelopersintoadministrators,buthopefully,thebroaderperspectivewillhelpyoutohelpyouroperationsstaff.Inparticular,wecoveredthefollowingtopics:

HowHadoopisanaturalfitforDevOpsapproachesasitsmultilayeredcomplexitymeansit’snotpossibleordesirabletohavesubstantialknowledgegapsbetweendevelopmentandoperationsstaffClouderaManager,andhowitcanbeagreatmanagementandmonitoringtool;itmightcauseintegrationproblemsthough,ifyouhaveotherenterprisetools,anditcomeswithavendorlock-inriskAmbari,theApacheopensourcealternativetoClouderaManager,andhowitisusedintheHortonworksdistributionHowtothinkaboutselectinghardwareforaphysicalHadoopcluster,andhowthisnaturallyfitsintotheconsiderationsofhowthemultipleworkloadspossibleintheworldofHadoop2canpeacefullycoexistonsharedresourcesThedifferentconsiderationsforfiringupandusingEMRclustersandhowthiscanbebothanadjunctto,aswellasanalternativeto,aphysicalclusterTheHadoopsecurityecosystem,howitisaveryfastmovingarea,andhowthefeaturesavailabletodayarevastlybetterthansomeyearsagoandthereisstillmucharoundthecornerMonitoringofaHadoopcluster,consideringwhateventsareimportantintheHadoopmodelofembracingfailure,andhowthesealertsandmetricscanbeintegratedintootherenterprise-monitoringframeworksHowtotroubleshootissueswithaHadoopcluster,bothintermsofwhatmighthavehappenedandhowtofindtheinformationtoinformyouranalysisAquicktourofthevariouswebUIsprovidedbyHadoop,whichcangiveverygoodoverviewsofhappeningswithinvariouscomponentsinthesystem

ThisconcludesourtreatmentofHadoopindepth.Inthefinalchapter,wewillexpresssomethoughtsonthebroaderHadoopecosystem,givesomepointersforusefulandinterestingtoolsandproductsthatwedidn’thaveachancetocoverinthebook,andsuggesthowtogetinvolvedwiththecommunity.

Chapter11.WheretoGoNextInthepreviouschapterswehaveexaminedmanypartsofHadoop2andtheecosystemaroundit.However,wehavenecessarilybeenlimitedbypagecount;someareaswedidn’tgetintoasmuchdepthaswaspossible,otherareaswereferredtoonlyinpassingordidnotmentionatall.

TheHadoopecosystem,withdistributions,Apacheandnon-Apacheprojects,isanincrediblyvibrantandhealthyplacetoberightnow.Inthischapter,wehopetocomplementthepreviouslydiscussedmoredetailedmaterialwithatravelguide,ifyouwill,forotherinterestingdestinations.Inthischapter,wewilldiscussthefollowingtopics:

HadoopdistributionsOthersignificantApacheandnon-ApacheprojectsSourcesofinformationandhelp

Ofcourse,notethatanyoverviewoftheecosystemisbothskewedbyourinterestsandpreferences,andisoutdatedthemomentitiswritten.Inotherwords,don’tforamomentthinkthisisallthat’savailable,consideritinsteadawhettingoftheappetite.

AlternativedistributionsWe’vegenerallyusedtheClouderadistributionforHadoopinthisbook,buthaveattemptedtokeepthecoveragedistributionindependentasmuchaspossible.We’vealsomentionedtheHortonworksDataPlatform(HDP)throughoutthisbookbutthesearecertainlynottheonlydistributionchoicesavailabletoyou.

Beforetakingalookaround,let’sconsiderwhetheryouneedadistributionatall.ItiscompletelypossibletogototheApachewebsite,downloadthesourcetarballsoftheprojectsinwhichyouareinterested,thenworktobuildthemalltogether.However,givenversiondependencies,thisislikelytoconsumemoretimethanyouwouldexpect.Potentially,vastlymoreso.Inaddition,theendproductwilllikelylacksomepolishintermsoftoolsorscriptsforoperationaldeploymentandmanagement.Formostusers,theseareasarewhyemployinganexistingHadoopdistributionisthenaturalchoice.

Anoteonfreeandcommercialextensions—beinganopensourceprojectwithaquiteliberallicense,distributioncreatorsarealsofreetoenhanceHadoopwithproprietaryextensionsthataremadeavailableeitherasfreeopensourceorcommercialproducts.

Thiscanbeacontroversialissueassomeopensourceadvocatesdislikeanycommercializationofsuccessfulopensourceprojects;tothem,itappearsthatthecommercialentityisfreeloadingbytakingthefruitsoftheopensourcecommunitywithouthavingtobuilditforthemselves.OthersseethisasahealthyaspectoftheflexibleApachelicense;thebaseproductwillalwaysbefree,andindividualsandcompaniescanchoosewhethertogowithcommercialextensionsornot.Wedon’tgivejudgmenteitherway,butbeawarethatthisisanotherofthecontroversiesyouwillalmostcertainlyencounter.

Soyouneedtodecideifyouneedadistributionandifsoforwhatreasons,whichspecificaspectswillbenefityoumostaboverollingyourown?Doyouwishforafullyopensourceproductorareyouwillingtopayforcommercialextensions?Withthesequestionsinmind,let’slookatafewofthemaindistributions.

ClouderaDistributionforHadoopYouwillbefamiliarwiththeClouderadistribution(http://www.cloudera.com)asithasbeenusedthroughoutthisbook.CDHwasthefirstwidelyavailablealternativedistributionanditsbreadthofavailablesoftware,provenlevelofquality,anditsfreecosthasmadeitaverypopularchoice.

Recently,ClouderahasbeenactivelyextendingtheproductsitaddstoitsdistributionbeyondthecoreHadoopprojects.InadditiontoClouderaManagerandImpala(bothCloudera-developedproducts),ithasalsoaddedothertoolssuchasClouderaSearch(basedonApacheSolr)andClouderaNavigator(adatagovernancesolution).WhileCDHversionspriorto5werefocusedmoreontheintegrationbenefitsofadistribution,version5(andpresumablybeyond)isaddingmoreandmorecapabilityatopthebaseApacheHadoopprojects.

Clouderaalsoofferscommercialsupportforitsproductsinadditiontotrainingandconsultancyservices.Detailscanbefoundonthecompanywebpage.

http://www.cloudera.com

HortonworksDataPlatformIn2011,theYahoo!divisionresponsibleforsomuchofthedevelopmentofHadoopwasspunoffintoanewcompanycalledHortonworks.Theyhavealsoproducedtheirownpre-integratedHadoopdistributioncalledtheHortonworksDataPlatform(HDP),availableathttp://hortonworks.com/products/hortonworksdataplatform/.

HDPisconceptuallysimilartoCDHbutbothproductshavedifferencesintheirfocus.HortonworksmakesmuchofthefactHDPisfullyopensource,includingthemanagementtoolAmbari,whichwediscussedbrieflyinChapter10,RunningaHadoopCluster.TheyhavealsopositionedHDPasakeyintegrationplatformthroughitssupportfortoolssuchasTalendOpenStudio.Hortonworksdoesnotofferproprietarysoftware;itsbusinessmodelfocusesinsteadonofferingprofessionalservicesandsupportfortheplatform.

BothClouderaandHortonworksareventure-backedcompanieswithsignificantengineeringexpertise;bothcompaniesemploymanyofthemostprolificcontributorstoHadoop.Theunderlyingtechnologyis,however,comprisedofthesameApacheprojects;thedistinguishingfactorsarehowtheyarepackaged,theversionsemployed,andtheadditionalvalue-addedofferingsprovidedbythecompanies.

http://hortonworks.com/products/hortonworksdataplatform/

MapRAdifferenttypeofdistributionisofferedbyMapRTechnologies,althoughthecompanyanddistributionareusuallyreferredtosimplyasMapR.Thedistributionavailablefromhttp://www.mapr.comisbasedonHadoop,buthasaddedanumberofchangesandenhancements.

ThefocusoftheMapRdistributionisonperformanceandavailability.Forexample,itwasthefirstdistributiontoofferahigh-availabilitysolutionfortheHadoopNameNodeandJobTracker,whichyouwillrememberfromChapter2,Storage,wasasignificantweaknessincoreHadoop1.ItalsoofferednativeintegrationwithNFSfilesystemslongbeforeHadoop2,whichmakesprocessingofexistingdatamucheasier.Toachievethesefeatures,MapRreplacedHDFSwithafullPOSIXcompliantfilesystemthatalsofeaturesnoNameNode,resultinginatruedistributedsystemwithnomaster,andaclaimofmuchbetterhardwareutilizationthanApacheHDFS.

MapRprovidesbothacommunityandenterpriseeditionofitsdistribution;notalltheextensionsareavailableinthefreeproduct.Thecompanyalsoofferssupportservicesaspartoftheenterpriseproductsubscriptioninadditiontotrainingandconsultancy.

http://www.mapr.com

Andtherest…Hadoopdistributionsarenotjusttheterritoryofyoungstart-ups,noraretheyastaticmarketplace.Intelhaditsowndistributionuntilearly2014whenitdecidedtofolditschangesintoCDHinstead.IBMhasitsowndistributioncalledIBMInfosphereBigInsightsavailableinbothfreeandcommercialeditions.Therearealsovariousstoriesofnumerouslargeenterprisesrollingtheirowndistributions,someofwhicharemadeopenlyavailablewhileothersarenot.Youwillhavenoshortageofoptionswithsomanyhigh-qualitydistributionsavailable.

ChoosingadistributionThisraisesthequestion:howtochooseadistribution?Ascanbeseen,theavailabledistributions(andwedidn’tcoverthemall)rangefromconvenientpackagingandintegrationoffullyopensourceproductsthroughtoentirebespokeintegrationandanalysislayersatopthem.Thereisnooverallbestdistribution;thinkcarefullyaboutyourrequirementsandconsiderthealternatives.Sincealltheseofferafreedownloadofatleastabasicversion,it’sgoodtosimplyplayandexperiencetheoptionsforyourself.

OthercomputationalframeworksWe’vefrequentlydiscussedthemyriadpossibilitiesbroughttotheHadoopplatformbyYARN.Wewentintodetailsoftwonewmodels,SamzaandSpark.Additionally,othermoreestablishedframeworkssuchasPigarealsobeingportedtotheframework.

Togiveaviewofthemuchbiggerpictureinthissection,wewillillustratethebreadthofprocessingpossibleusingYARNbypresentingasetofcomputationalmodelsthatarecurrentlybeingportedtoHadoopontopofYARN.

ApacheStormStorm(http://storm.apache.org)isadistributedcomputationframeworkwritten(mainly)intheClojureprogramminglanguage.Itusescustom-createdspoutsandboltstodefineinformationsourcesandmanipulationstoallowdistributedprocessingofstreamingdata.AStormapplicationisdesignedasatopologyofinterfacesthatcreatesastreamoftransformations.ItprovidessimilarfunctionalitytoaMapReducejobwiththeexceptionthatthetopologywilltheoreticallyrunindefinitelyuntilitismanuallyterminated.

ThoughinitiallybuiltdistinctfromHadoop,aYARNportisbeingdevelopedbyYahoo!andcanbefoundathttps://github.com/yahoo/storm-yarn.

http://storm.apache.org

https://github.com/yahoo/storm-yarn

ApacheGiraphGiraphoriginatedastheopensourceimplementationofGoogle’sPregelpaper(whichcanbefoundathttp://kowshik.github.io/JPregel/pregel_paper.pdf).BothGiraphandPregelareinspiredbytheBulkSynchronousParallel(BSP)modelofdistributedcomputationintroducedbyValiantin1990.Giraphaddsseveralfeaturesincludingmastercomputation,shardedaggregators,edge-orientedinput,andout-of-corecomputation.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/GIRAPH-13.

http://kowshik.github.io/JPregel/pregel_paper.pdf

https://issues.apache.org/jira/browse/GIRAPH-13

ApacheHAMAHamaisatop-levelApacheprojectthataims,likeothermethodswe’veencounteredsofar,toaddresstheweaknessofMapReducewithregardtoiterativeprogramming.SimilartotheaforementionedGiraph,HamaimplementstheBSPtechniquesandhasbeenheavilyinspiredbythePregelpaper.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/HAMA-431.

https://issues.apache.org/jira/browse/HAMA-431

OtherinterestingprojectsWhetheryouuseabundleddistributionorstickwiththebaseApacheHadoopdownload,youwillencountermanyreferencestootherrelatedprojects.We’vecoveredseveralofthesesuchasHive,Samza,andCrunchinthisbook;we’llnowhighlightsomeoftheothers.

Notethatthiscoverageseekstopointoutthehighlights(fromtheauthors’perspective)aswellasgiveatasteofthebreadthoftypesofprojectsavailable.Asmentionedearlier,keeplookingout,astherewillbenewoneslaunchingallthetime.

HBasePerhapsthemostpopularApacheHadoop-relatedprojectthatwedidn’tcoverinthisbookisHBase(http://hbase.apache.org).BasedontheBigTablemodelofdatastoragepublicizedbyGoogleinanacademicpaper(soundfamiliar?),HBaseisanonrelationaldatastoresittingatopHDFS.

WhilebothMapReduceandHivefocusonbatch-likedataaccesspatterns,HBaseinsteadseekstoprovideverylow-latencyaccesstodata.ConsequentlyHBasecan,unliketheaforementionedtechnologies,directlysupportuser-facingservices.

TheHBasedatamodelisnottherelationalapproachthatwasusedinHiveandallotherRDBMSs,nordoesitofferthefullACIDguaranteesthataretakenforgrantedwithrelationalstores.Instead,itisakey-valueschema-lesssolutionthattakesacolumn-orientedviewofdata;columnscanbeaddedatruntimeanddependonthevaluesinsertedintoHBase.Eachlookupoperationisthenveryfast,asitiseffectivelyakey-valuemappingfromtherowkeytothedesiredcolumn.HBasealsotreatstimestampsasanotherdimensiononthedatasoonecandirectlyretrievedatafromapointintime.

Thedatamodelisverypowerfulbutdoesnotsuitallusecasesjustastherelationalmodelisn’tuniversallyapplicable.Butifyouhavearequirementforstructuredlow-latencyviewsonlarge-scaledatastoredinHadoop,thenHBaseisabsolutelysomethingyoushouldlookat.

http://hbase.apache.org

SqoopInChapter7,HadoopandSQL,welookedattoolsforpresentingarelational-likeinterfacetodatastoredonHDFS.Often,suchdataeitherneedstoberetrievedfromanexistingrelationaldatabaseortheoutputofitsprocessingneedstobestoredback.

ApacheSqoop(http://sqoop.apache.org)providesamechanismfordeclarativelyspecifyingdatamovementbetweenrelationaldatabasesandHadoop.IttakesataskdefinitionandfromthisgeneratesMapReducejobstoexecutetherequireddataretrievalorstorage.ItwillalsogeneratecodetohelpmanipulaterelationalrecordswithcustomJavaclasses.Inaddition,itcanintegratewithHBaseandHcatalog/Hiveanditprovidesaveryrichsetofintegrationpossibilities.

Atthetimeofwriting,Sqoopisslightlyinflux.Itsoriginalversion,Sqoop1,wasapureclient-sideapplication.MuchliketheoriginalHivecommand-linetool,Sqoop1hasnoserverandgeneratesallcodeontheclient.Thisunfortunatelymeansthateachclientneedstoknowalotofdetailsaboutphysicaldatasources,includingexacthostnamesaswellasauthenticationcredentials.

Sqoop2providesacentralizedSqoopserverthatencapsulatesallthesedetailsandoffersthevariousconfigureddatasourcestotheconnectingclients.Itisasuperiormodelbutatthetimeofwriting,thegeneralcommunityrecommendationistostickwithSqoop1untilthenewversionevolvesfurther.Checkonthecurrentstatusifyouareinterestedinthistypeoftool.

http://sqoop.apache.org

WhirWhenlookingtousecloudservicessuchasAmazonAWSforHadoopdeployments,itisusuallyaloteasiertouseahigherlevelservicesuchasElasticMapReduceasopposedtosettingupyourownclusteronEC2.Thoughtherearescriptstohelp,thefactisthattheoverheadofHadoop-baseddeploymentsoncloudinfrastructurescanbeinvolved.That’swhereApacheWhir(https://whirr.apache.org/)comesin.

Whirisn’tfocusedonHadoop;it’saboutsupplier-independentinstantiationofcloudservicesofwhichHadoopisasingleexample.WhiraimstoprovideaprogrammaticwayofspecifyingandcreatingHadoop-baseddeploymentsoncloudinfrastructuresinawaythathandlesalltheunderlyingserviceaspectsforyou.Itdoesthisinaprovider-independentfashionsothatonceyou’velaunchedonsayEC2thenyoucanusethesamecodetocreatetheidenticalsetuponanotherprovidersuchasRightscaleorEucalyptus.Thismakesvendorlock-in,oftenaconcernwithclouddeployments,lessofanissue.

Whirisn’tquitethereyet.Today,itislimitedinservicesitcancreateandprovidersitsupports,however,ifyouareinterestedinclouddeploymentwithlesspainthenit’sworthwatchingitsprogress.

NoteIfyouarebuildingoutyourfullinfrastructureonAmazonWebServicesthenyoumightfindcloudformationgivesmuchofthesameabilitytodefineapplicationrequirements,thoughobviouslyinanAWS-specificfashion.

https://whirr.apache.org/

MahoutApacheMahout(http://mahout.apache.org/)isacollectionofdistributedalgorithms,Javaclasses,andtoolsforperformingadvancedanalyticsontopofHadoop.SimilartoSpark’sMLLibbrieflymentionedinChapter5,IterativeComputationwithSpark,Mahoutshipswithanumberofalgorithmsforcommonusecases:recommendation,clustering,regression,andfeatureengineering.Althoughthesystemisfocusedonnaturallanguageprocessingandtext-miningtasks,itsbuildingblocks(linearalgebraoperations)aresuitabletobeappliedtoanumberofdomains.AsofVersion0.9,theprojectisbeingdecoupledfromtheMapReduceframeworkinfavorofricherprogrammingmodelssuchasSpark.Thecommunityendgoalistoobtainaplatform-independentlibrarybasedonaScalaDSL.

http://mahout.apache.org/

HueInitiallydevelopedbyClouderaandmarketedasthe“UserInterfaceforHadoop”,Hue(http://gethue.com/)isacollectionofapplications,bundledtogetherunderacommonwebinterface,thatactasclientsforcoreservicesandanumberofcomponentsoftheHadoopecosystem:

TheHueQueryEditorforHive

Hueleveragesmanyofthetoolswediscussedinpreviouschaptersandprovidesanintegratedinterfaceforanalyzingandvisualizingdata.Therearetwocomponentsthatareremarkablyinteresting.Ononehand,thereisaqueryeditorthatallowstheusertocreateandsaveHive(orImpala)queries,exporttheresultsetinCSVorMicrosoftOfficeExcelformataswellasplotitinthebrowser.TheeditorfeaturesthecapabilityofsharingbothHiveQLandresultsets,thusfacilitatingcollaborationwithinanorganization.Ontheotherhand,thereisanOozieworkflowandcoordinatoreditorthatallowsausertocreateanddeployOoziejobsmanually,automatingthegenerationofXMLconfigurationsandboilerplate.

BothClouderaandHortonworksdistributionsshipwithHueandtypicallyincludethefollowing:

AfilemanagerforHDFSAJobBrowserforYARN(MapReduce)AnApacheHBasebrowserAHivemetastoreexplorerQueryeditorsforHiveandImpalaAscripteditorforPigAjobeditorforMapReduceandSpark

http://gethue.com/

AneditorforSqoop2jobsAnOozieworkfloweditoranddashboardAnApacheZooKeeperbrowser

Ontopofthis,HueisaframeworkwithanSDKthatcontainsanumberofwebassets,APIs,andpatternsfordevelopingthird-partyapplicationsthatinteractwithHadoop.

OtherprogrammingabstractionsHadoopisn’tjustextendedbyadditionalfunctionality,therearetoolstoprovideentirelydifferentparadigmsforwritingthecodeusedtoprocessyourdatawithinHadoop.

CascadingDevelopedbyConcurrent,andopensourcedunderanApachelicense,Cascading(http://www.cascading.org/)isapopularframeworkthatabstractsthecomplexityofMapReduceawayandallowsustocreatecomplexworkflowsontopofHadoop.Cascadingjobscancompileto,andbeexecutedon,MapReduce,Tez,andSpark.Conceptually,theframeworkissimilartoApacheCrunch,coveredinChapter9,MakingDevelopmentEasier,thoughpracticallytherearedifferencesintermsofdataabstractionsandendgoals.Cascadingadoptsatupledatamodel(similartoPig)ratherthanarbitraryobjects,andencouragestheusertorelyonahigherlevelDSL,powerfulbuilt-intypes,andtoolstomanipulatedata.

Putinsimpleterms,CascadingistoPigLatinandHiveQLwhatCrunchistoauser-definedfunction.

LikeMorphlines,whichwealsosawinChapter9,MakingDevelopmentEasier,theCascadingdatamodelfollowsasource-pipe-sinkapproach,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink,readytobepickedupbyanotherapplication.

CascadingencouragesdeveloperstowritecodeinanumberofJVMlanguages.PortsoftheframeworkexistforPython(PyCascading),JRuby(Cascading.jruby),Clojure(Cascalog),andScala(Scalding).CascalogandScaldinginparticularhavegainedalotoftractionandspawnedofftheirveryownecosystems.

AnareawhereCascadingexcelsisdocumentation.TheprojectprovidescomprehensivejavadocsoftheAPI,extensivetutorials(http://www.cascading.org/documentation/tutorials/)andaninteractiveexercise-basedlearningenvironment(https://github.com/Cascading/Impatient).

AnotherstrongsellingpointofCascadingisitsintegrationwiththird-partyenvironments.AmazonEMRsupportsCascadingasafirst-classprocessingframeworkandallowsustolaunchCascadingclustersbothwiththecommandlineandwebinterfaces(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.htmlPluginsfortheSDKexistforboththeIntelliJIDEAandEclipseintegrateddevelopmentenvironments.Oneoftheframework’stopprojects,CascadingPatterns,acollectionofmachine-learningalgorithms,featuresautilityfortranslatingPredictiveModelMarkupLanguage(PMML)documentsintoapplicationsonApacheHadoop,thusfacilitatinginteroperabilitywithpopularstatisticalenvironmentsandscientifictoolssuchasR(http://cran.r-project.org/web/packages/pmml/index.html).

http://www.cascading.org/

http://www.cascading.org/documentation/tutorials/

https://github.com/Cascading/Impatient

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.html

http://cran.r-project.org/web/packages/pmml/index.html

AWSresourcesManyHadooptechnologiescanbedeployedonAWSaspartofaself-managedcluster.However,justasAmazonofferssupportforElasticMapReduce,whichhandlesHadoopasamanagedservice,thereareafewotherservicesthatareworthmentioning.

SimpleDBandDynamoDBForsometime,AWShasofferedSimpleDBasahostedserviceprovidinganHBase-likedatamodel.

Ithas,however,largelybeensupersededbyamorerecentservicefromAWS,DynamoDB,locatedathttp://aws.amazon.com/dynamodb.ThoughitsdatamodelisverysimilartothatofSimpleDBandHBase,itisaimedataverydifferenttypeofapplication.WhereSimpleDBhasquitearichsearchAPIbutisverylimitedintermsofsize,DynamoDBprovidesamoreconstrainedthoughconstantlyevolvingAPI,butwithaserviceguaranteeofnear-unlimitedscalability.

TheDynamoDBpricingmodelisparticularlyinteresting;insteadofpayingforacertainnumberofservershostingtheservice,youallocateacertaincapacityforread-and-writeoperations,andDynamoDBmanagestheresourcesrequiredtomeetthisprovisionedcapacity.Thisisaninterestingdevelopmentasitisamorepureservicemodel,wherethemechanismofdeliveringthedesiredperformanceiskeptcompletelyopaquetotheserviceuser.HavealookatDynamoDBbutifyouneedamuchlargerscaleofdatastorethanSimpleDBcanoffer;however,doconsiderthepricingmodelcarefullyasprovisioningtoomuchcapacitycanbecomeveryexpensiveveryquickly.AmazonprovidessomegoodbestpracticesforDynamoDBatthefollowingURLthatillustratethatminimizingtheservicecostscanresultinadditionalapplication-layercomplexity:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html.

NoteOfcoursethediscussionofDynamoDBandSimpleDBassumesanon-relationaldatamodel;thereistheAmazonRelationalDatabaseService(AmazonRDS)forarelationaldatabaseinthecloudservice.

http://aws.amazon.com/dynamodb

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html

KinesisJustasEMRishostedHadoopandDynamoDBhassimilaritiestoahostedHBase,itwasn’tsurprisingtoseeAWSannounceKinesis,ahostedstreamingdataservicein2013.Thiscanbefoundathttp://aws.amazon.com/kinesisandithasverysimilarconceptualbuildingblockstothestackofSamzaatopKafka.KinesisprovidesapartitionedviewofmessagesasastreamofdataandanAPItohavecallbacksthatexecutewhenmessagesarrive.AswithmostAWSservices,thereistightintegrationwithotherservicesmakingiteasytogetdataintoandoutoflocationssuchasS3.

http://aws.amazon.com/kinesis

DataPipelineThefinalAWSservicethatwe’llmentionisDataPipeline,whichcanbefoundathttp://aws.amazon.com/datapipeline.Asthenamesuggests,itisaframeworkforbuildingupdata-processingjobsthatinvolvemultiplesteps,datamovements,andtransformations.IthasquiteaconceptualoverlapwithOozie,butwithafewtwists.Firstly,DataPipelinehastheexpecteddeepintegrationwithmanyotherAWSservices,enablingeasydefinitionofdataworkflowsthatincorporatediverserepositoriessuchasRDS,S3,andDynamoDB.Inadditionhowever,DataPipelinedoeshavetheabilitytointegrateagentsinstalledonlocalinfrastructure,providinganinterestingavenueforbuildingworkflowsthatspanacrosstheAWSandon-premisesenvironments.

http://aws.amazon.com/datapipeline

SourcesofinformationYoudon’tjustneednewtechnologiesandtools—eveniftheyarecool.Sometimes,alittlehelpfromamoreexperiencedsourcecanpullyououtofahole.Inthisregard,youarewellcovered,astheHadoopcommunityisextremelystronginmanyareas.

SourcecodeIt’ssometimeseasytooverlook,butHadoopandalltheotherApacheprojectsareafterallfullyopensource.Theactualsourcecodeistheultimatesource(pardonthepun)ofinformationabouthowthesystemworks.Becomingfamiliarwiththesourceandtracingthroughsomeofthefunctionalitycanbehugelyinformative.Nottomentionhelpfulwhenyouarehittingunexpectedbehavior.

MailinglistsandforumsAlmostalltheprojectsandserviceslistedinthischapterhavetheirownmailinglistsand/orforums;checkoutthehomepagesforthespecificlinks.Mostdistributionsalsohavetheirownforumsandothermechanismstoshareknowledgeandget(non-commercial)helpfromthecommunity.Additionally,ifusingAWS,makesuretocheckouttheAWSdeveloperforumsathttps://forums.aws.amazon.com.

Alwaysremembertoreadpostingguidelinescarefullyandunderstandtheexpectedetiquette.Thesearetremendoussourcesofinformation;thelistsandforumsareoftenfrequentlyvisitedbythedevelopersoftheparticularproject.ExpecttoseethecoreHadoopdevelopersontheHadooplists,HivedevelopersontheHivelist,EMRdevelopersontheEMRforums,andsoon.

https://forums.aws.amazon.com

LinkedIngroupsThereareanumberofHadoopandrelatedgroupsontheprofessionalsocialnetworkLinkedIn.Doasearchforyourparticularareasofinterest,butagoodstartingpointmightbethegeneralHadoopusers’groupathttp://www.linkedin.com/groups/Hadoop-Users-988957.

http://www.linkedin.com/groups/Hadoop-Users-988957

HUGsIfyouwantmoreface-to-faceinteractionthenlookforaHadoopUserGroup(HUG)inyourarea,mostofwhichwillbelistedathttp://wiki.apache.org/hadoop/HadoopUserGroups.Thesetendtoarrangesemi-regularget-togethersthatcombinethingssuchasqualitypresentations,theabilitytodiscusstechnologywithlike-mindedindividuals,andoftenpizzaanddrinks.

NoHUGnearwhereyoulive?Considerstartingone.

http://wiki.apache.org/hadoop/HadoopUserGroups

ConferencesThoughsomeindustriestakedecadestobuildupaconferencecircuit,Hadoopalreadyhassomesignificantconferenceactioninvolvingtheopensource,academic,andcommercialworlds.EventssuchastheHadoopSummitandStrataareprettybig;theseandsomeotherarelinkedfromhttp://wiki.apache.org/hadoop/Conferences.

http://wiki.apache.org/hadoop/Conferences

SummaryInthischapter,wetookaquickgalloparoundthebroaderHadoopecosystem,lookingatthefollowingtopics:

WhyalternativeHadoopdistributionsexistandsomeofthemorepopularonesOtherprojectsthatprovidecapabilities,extensions,orHadoopsupportingtoolsAlternativewaysofwritingorcreatingHadoopjobsSourcesofinformationandhowtoconnectwithotherenthusiasts

Now,gohavefunandbuildsomethingamazing!

IndexA

additionaldata,collectingabout/Collectingadditionaldataworkflows,scheduling/SchedulingworkflowsOozietriggers/OtherOozietriggers

addMappermethod,argumentsjob/Textcleanupusingchainmapperclass/TextcleanupusingchainmapperinputKeyClass/TextcleanupusingchainmapperinputValueClass/TextcleanupusingchainmapperoutputKeyClass/TextcleanupusingchainmapperoutputValueClass/TextcleanupusingchainmappermapperConf/Textcleanupusingchainmapper

alternativedistributionsabout/AlternativedistributionsClouderaDistribution/ClouderaDistributionforHadoopHortonworksDataPlatform(HDP)/HortonworksDataPlatformMapR/MapRselecting/Choosingadistribution

Amazonaccountreferencelink/CreatinganAWSaccount

AmazonCLIreferencelink/TheAWScommand-lineinterface

AmazonEMRabout/AmazonEMRAWSaccount,creating/CreatinganAWSaccountrequiredservices,signingup/Signingupforthenecessaryservices

AmazonRelationalDatabaseService(AmazonRDS)/SimpleDBandDynamoDBAmazonWebServices

Hive,workingwith/HiveandAmazonWebServicesAmbari

about/Ambari–theopensourcealternativeURL/Ambari–theopensourcealternative

AMPLabatUCBerkeley,URL/ApacheSpark

ApacheAvroabout/AvroURL/Avro

ApacheCrunchabout/ApacheCrunchURL/ApacheCrunch

JARs/Gettingstartedlibraries/Gettingstartedconcepts/ConceptsPCollection<T>interface/ConceptsPTable<Key,Value>interface/Conceptsdataserialization/Dataserializationdataprocessingpatterns/DataprocessingpatternsPipelinesimplementation/Pipelinesimplementationandexecutionexecution/Pipelinesimplementationandexecutionexamples/CrunchexamplesKiteMorphlines/KiteMorphlines

ApacheDataFureferencelink/ContributedUDFs,ApacheDataFuabout/ApacheDataFu

ApacheGiraphabout/ApacheGiraphURL/ApacheGiraph

ApacheHAMAabout/ApacheHAMA

ApacheKafkaURL/ApacheSamza,Samza’sbestfriend–ApacheKafkaabout/Samza’sbestfriend–ApacheKafkaTwitterdata,gettinginto/GettingTwitterdataintoKafka

ApacheKnoxabout/BeyondbasicauthorizationURL/Beyondbasicauthorization

ApacheSentryURL/Beyondbasicauthorization

ApacheSparkabout/ApacheSpark,GettingstartedwithSparkURL/ApacheSpark,GettingstartedwithSparkclustercomputing,withworkingsets/ClustercomputingwithworkingsetsResilientDistributedDatasets(RDDs)/ResilientDistributedDatasets(RDDs)actions/Actionsdeployment/DeploymentonYARN/SparkonYARNonEC2/SparkonEC2standaloneapplications,writing/WritingandrunningstandaloneapplicationsScalaAPI/ScalaAPIJavaAPI/JavaAPIWordCount,inJava/WordCountinJavaPythonAPI/PythonAPIdata,processing/ProcessingdatawithApacheSpark

ApacheSpark,ecosystem

about/TheSparkecosystemSparkStreaming/SparkStreamingGraphX/GraphXMLLib/MLlibSparkSQL/SparkSQL

ApacheStormabout/ApacheStormURL/ApacheStorm

ApacheThriftabout/ThriftURL/Thrift

ApacheTikaabout/MultijobworkflowsURL/Multijobworkflows

ApacheTwillURL/Thinkinginlayers

ApacheZooKeeperabout/ApacheZooKeeper–adifferenttypeoffilesystemURL/ApacheZooKeeper–adifferenttypeoffilesystemdistributedlock,implementingwithsequentialZNodes/ImplementingadistributedlockwithsequentialZNodesgroupmembership,implementing/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesleaderelection,implementingwithephemeralZNodes/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesJavaAPI/JavaAPIblocks,building/Buildingblocksused,forenablingautomaticNameNodefailover/AutomaticNameNodefailover

applicationdevelopmentframework,selecting/Choosingaframework

ApplicationManagerabout/ResourceManager,NodeManager,andApplicationManager

ApplicationMaster(AM)about/AnatomyofaYARNapplication

architecturalprinciples,HDFSandMapReduce/CommonbuildingblocksArraywrapperclasses

about/ArraywrapperclassesautomaticNameNodefailover

enabling/AutomaticNameNodefailoverAvro

about/AvroAvroschemaevolution,using

thoughts/FinalthoughtsonusingAvroschemaevolution

additivechanges,making/Onlymakeadditivechangesschemaversions,managingexplicitly/Manageschemaversionsexplicitlyschemadistribution/Thinkaboutschemadistribution

Avroschemasabout/UsingtheJavaAPI

AvroSerdeURL/Avroabout/Avro

AWSabout/DistributionsofApacheHadoop,AWS–infrastructureondemandfromAmazonSimpleStorageService(S3)/SimpleStorageService(S3)ElasticMapReduce(EMR)/ElasticMapReduce(EMR)

AWScommand-lineinterfaceabout/TheAWScommand-lineinterfacereferencelink/TheAWScommand-lineinterface

AWScredentialsabout/AWScredentialsaccountID/AWScredentialsaccesskey/AWScredentialssecretaccesskey/AWScredentialskeypairs/AWScredentialsreferencelink/AWScredentials

AWSdeveloperforumsURL/Mailinglistsandforums

AWSresourcesabout/AWSresourcesSimpleDB/SimpleDBandDynamoDBDynamoDB/SimpleDBandDynamoDBDataPipeline/DataPipeline

Bblockreplication

about/BlockreplicationBulkSynchronousParallel(BSP)model

about/ApacheGiraph

CCascading

about/CascadingURL/Cascadingreferencelinks/Cascading

ClouderaURL/DistributionsofApacheHadoopURL,fordocumentation/ClouderaManagerURL,forblogpost/Sharingresources

Clouderadistributionabout/ClouderaDistributionforHadoopURL/ClouderaDistributionforHadoop

ClouderaHadoopDistribution(CDH)about/ClouderaManager

ClouderaKittenURL/Thinkinginlayers

ClouderaManagerabout/ClouderaManagerpayment,forsubscriptionservices/Topayornottopayclustermanagement,performing/ClustermanagementusingClouderaManagerintegrating,withsystemsmanagementtools/ClouderaManagerandothermanagementtoolsmonitoringwith/MonitoringwithClouderaManagerlogfiles,finding/Findingconfigurationfiles

ClouderaManagerAPIabout/ClouderaManagerAPI

ClouderaManagerlock-inabout/ClouderaManagerlock-in

ClouderaQuickstartVMabout/ClouderaQuickStartVMadvantages/ClouderaQuickStartVM

clusterbuilding,onEMR/BuildingaclusteronEMR

cluster,APacheSparkcomputing,withworkingsets/Clustercomputingwithworkingsets

cluster,onEMRfilesystem,considerations/Considerationsaboutfilesystemsdata,obtainingintoEMR/GettingdataintoEMREC2instances/EC2instancesandtuningEC2tuning/EC2instancesandtuning

clustermanagementperforming,ClouderaManagerused/ClustermanagementusingClouderaManager

clusterstartup,HDFSabout/ClusterstartupNameNodestartup/NameNodestartupDataNodestartup/DataNodestartup

clustertuningabout/ClustertuningJVMconsiderations/JVMconsiderationsmapoptimization/Mapandreduceoptimizationsreduceoptimization/Mapandreduceoptimizations

column-orienteddataformatsabout/Column-orienteddataformatsRCFile/RCFileORC/ORCParquet/ParquetAvro/AvroJavaAPI,using/UsingtheJavaAPI

columnarabout/Columnarstores

columnarstores/Columnarstorescombinerclass,JavaAPItoMapReduce

about/CombinercombineValuesoperation

about/Conceptscommand-lineaccess,HDFSfilesystem

about/Command-lineaccesstotheHDFSfilesystemhdfscommand/Command-lineaccesstotheHDFSfilesystemdfscommand/Command-lineaccesstotheHDFSfilesystemdfsadmincommand/Command-lineaccesstotheHDFSfilesystem

Comparableinterfaceabout/TheComparableandWritableComparableinterfaces

complexdatatypesmap/Pigdatatypestuple/Pigdatatypesbag/Pigdatatypes

complexeventprocessing(CEP)about/HowSamzaworks

components,Hadoopabout/ComponentsofHadoopcommonbuildingblocks/Commonbuildingblocksstorage/Storagecomputation/Computation

components,YARNabout/ThecomponentsofYARNResourceManager(RM)/ThecomponentsofYARN

NodeManager(NM)/ThecomponentsofYARNcomputation

about/Computationcomputation,Hadoop2

about/ComputationinHadoop2computationalframeworks

about/OthercomputationalframeworksApacheStorm/ApacheStormApacheGiraph/ApacheGiraph,ApacheHAMA

conferencesabout/Conferencesreferencelink/Conferences

configurationfile,Samzaabout/Theconfigurationfile

containersabout/SerializationandContainers

contributedUDFsabout/ContributedUDFsPiggybank/PiggybankElephantBird/ElephantBirdApacheDataFu/ApacheDataFu

create.hqlscriptreferencelink/ExtractingdataandingestingintoHive

Crunchexamplesabout/Crunchexampleswordco-occurrence/Wordco-occurrenceTF-IDF/TF-IDF

Curatorprojectreferencelink/Buildingblocks

Ddata,managing

about/ManagingandserializingdataWritableinterface/TheWritableinterfacewrapperclasses/IntroducingthewrapperclassesArraywrapperclasses/ArraywrapperclassesComparableinterface/TheComparableandWritableComparableinterfacesWritableComparableinterface/TheComparableandWritableComparableinterfaces

data,Pigworkingwith/WorkingwithdataFILTERoperator/Filteringaggregation/AggregationFOREACHoperator/ForeachJOINoperator/Join

data,storingabout/Storingdataserializationfileformat/SerializationandContainerscontainersfileformat/SerializationandContainersfilecompression/Compressiongeneral-purposefileformats/General-purposefileformatscolumn-orienteddataformats/Column-orienteddataformats

Datacoreabout/DataCore

DataCrunchabout/DataCrunch

DataHCatalogabout/DataHCatalog

DataHiveabout/DataHive

datalifecyclemanagementabout/Whatdatalifecyclemanagementisimportance/Importanceofdatalifecyclemanagementtools/Toolstohelp

DataMapReduceabout/DataMapReduce

DataNode/NameNodeandDataNodeDataNodes

about/StorageinHadoop2DataNodestartup

about/DataNodestartupDataPipeline

about/DataPipeline

referencelink/DataPipelinedataprocessing

about/DataprocessingwithHadoopdataset,generatingfromTwitter/WhyTwitter?dataset,building/Buildingourfirstdatasetprogrammaticaccess,withPython/ProgrammaticaccesswithPython

dataprocessing,ApacheSparkabout/ProcessingdatawithApacheSparkexamples,running/Buildingandrunningtheexamplesexamples,building/Buildingandrunningtheexamplesexamples,runningonYARN/RunningtheexamplesonYARNpopulartopics,finding/Findingpopulartopicssentiment,assigningtotopics/Assigningasentimenttotopicsonstreams/Dataprocessingonstreamsstatemanagement/Statemanagementdataanalysis,withSparkSQL/DataanalysiswithSparkSQLSQL,ondatastreams/SQLondatastreams

dataprocessingpatterns,Crunchabout/Dataprocessingpatternsaggregationandsorting/Aggregationandsortingjoiningdata/Joiningdata

dataserialization,Crunchabout/Dataserialization

dataset,buildingwithTwitterabout/BuildingourfirstdatasetmultipleAPIs,using/Oneservice,multipleAPIsanatomy,ofTweet/AnatomyofaTweetTwittercredentials/Twittercredentials

DataSparkabout/DataSpark

datatypes,Hivenumeric/Datatypesdateandtime/Datatypesstring/Datatypescollections/Datatypesmisc/Datatypes

datatypes,Pigscalardatatypes/Pigdatatypescomplexdatatypes/Pigdatatypes

DDLstatements,Hive/DDLstatementsdecayFactorfunction/StatemanagementDEFINEoperator

about/ExtendingPig(UDFs)deriveddata,producing

about/Producingderiveddatamultipleactions,performinginparallel/Performingmultipleactionsinparallelsubworkflow,calling/Callingasubworkflowglobalsettings,adding/Addingglobalsettings

DevOpspractices/HadoopandDevOpspractices

directedacyclicgraph(DAG)about/YARN

documentfrequencyabout/Calculatedocumentfrequencycalculating,TF-IDFused/Calculatedocumentfrequency

DrillURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond

Driverclass,JavaAPItoMapReduceabout/TheDriverclass

dynamicinvokersabout/Dynamicinvokersreferencelink/Dynamicinvokers

DynamoDBURL/SimpleDBandDynamoDBabout/SimpleDBandDynamoDB

EEC2

ApacheSparkon/SparkonEC2EC2key-valuepair

referencelink/TheAWScommand-lineinterfaceElasticMapReduce

Hive,usingwith/HiveonElasticMapReduceElasticMapReduce(EMR)

about/DistributionsofApacheHadoop,ElasticMapReduce(EMR)URL/ElasticMapReduce(EMR)using/UsingElasticMapReduce

ElephantBirdreferencelink/ContributedUDFs,ElephantBird

EMRcluster,buildingon/BuildingaclusteronEMRURL,forbestpractices/BuildingaclusteronEMR

EMRdocumentationURL/HiveonElasticMapReduce

entitiesabout/Tweetmetadata

ephemeralZNodesabout/ImplementinggroupmembershipandleaderelectionusingephemeralZNodes

evalfunctions,PigAVG(expression)/EvalCOUNT(expression)/EvalCOUNT_STAR(expression)/EvalIsEmpty(expression)/EvalMAX(expression)/EvalMIN(expression)/EvalSUM(expression)/EvalTOKENIZE(expression)/Eval

examplesrunning/Runningtheexamples

examples,MapReduceprogramsreferencelink/Runningtheexampleslocalcluster/LocalclusterElasticMapReduce/ElasticMapReduce

examplesandsourcecodedownloadlink/Gettingstarted

ExecutionEngineinterface/AnoverviewofPigexternaldata,challenges

about/Challengesofexternaldata

datavalidation/Datavalidationvalidationactions/Validationactionsformatchanges,handling/Handlingformatchangesschemaevolution,handlingwithAvro/HandlingschemaevolutionwithAvro

EXTERNALkeyword/DDLstatementsExtract-Transform-Load(ETL)/DDLstatementsextract_for_hive.pig

URL,forsourcecode/Prerequisites

FFalcon

URL/Othertoolstohelpabout/Othertoolstohelp

fileformat,Hiveabout/FileformatsandstorageJSON/JSON

FileFormatclasses,HiveTextInputFormat/FileformatsandstorageHiveIgnoreKeyTextOutputFormat/FileformatsandstorageSequenceFileInputFormat/FileformatsandstorageSequenceFileOutputFormat/Fileformatsandstorage

filesystemmetadata,HDFSprotecting/ProtectingthefilesystemmetadataSecondaryNameNode,demerits/SecondaryNameNodenottotherescueHadoop2NameNodeHA/Hadoop2NameNodeHAclientconfiguration/Clientconfigurationfailover,working/Howafailoverworks

FILTERoperatorabout/Filtering

FlumeJavareferencelink/ApacheCrunch

FOREACHoperatorabout/Foreach

forknodeabout/Performingmultipleactionsinparallel

functions,Pigabout/Pigfunctionsbuilt-infunctions/Pigfunctionsreferencelink,forbuilt-infunctions/Pigfunctionsload/storefunctions/Load/storeeval/Evaltuple/Thetuple,bag,andmapfunctionsbag/Thetuple,bag,andmapfunctionsmap/Thetuple,bag,andmapfunctionsstring/Themath,string,anddatetimefunctionsmath/Themath,string,anddatetimefunctionsdatetime/Themath,string,anddatetimefunctionsdynamicinvokers/Dynamicinvokersmacros/Macros

GGarbageCollection(GC)/JVMconsiderationsGarbageFirst(G1)collector/JVMconsiderationsgeneral-purposefileformats

about/General-purposefileformatsTextfiles/General-purposefileformatsSequenceFile/General-purposefileformats

generalavailability(GA)/AnoteonversioningGoogleChubbysystem

referencelink/ApacheZooKeeper–adifferenttypeoffilesystemGoogleFileSystem(GFS)

referencelink/ThebackgroundofHadoopGradle

URL/RunningtheexamplesGraphX

about/GraphXURL/GraphX

groupByKey()method/AggregationandsortinggroupByKey(GroupingOptionsoptions)method/AggregationandsortinggroupByKey(intnumPartitions)method/AggregationandsortinggroupByKeyoperation

about/ConceptsGROUPoperator

about/AggregationGrunt

about/Grunt–thePiginteractiveshellshcommand/Grunt–thePiginteractiveshellhelpcommand/Grunt–thePiginteractiveshell

GuavalibraryURL/TheTopNpattern

HHadoop

versioning/Anoteonversioningbackground/ThebackgroundofHadoopcomponents/ComponentsofHadoopdualapproach/Adualapproachabout/Gettingstartedusing/GettingHadoopupandrunningEMR,using/HowtouseEMRAWScredentials/AWScredentialsdataprocessing/DataprocessingwithHadooppractices/HadoopandDevOpspracticesalternativedistributions/Alternativedistributionscomputationalframeworks/Othercomputationalframeworksinterestingprojects/Otherinterestingprojectsprogrammingabstractions/OtherprogrammingabstractionsAWSresources/AWSresourcessourcesofinformation/Sourcesofinformation

Hadoop-providedInputFormat,MapReducejobabout/Hadoop-providedInputFormatFileInputFormat/Hadoop-providedInputFormatSequenceFileInputFormat/Hadoop-providedInputFormatTextInputFormat/Hadoop-providedInputFormatKeyValueTextInputFormat/Hadoop-providedInputFormat

Hadoop-providedMapperandReducerimplementations,JavaAPItoMapReduceabout/Hadoop-providedmapperandreducerimplementationsmappers/Hadoop-providedmapperandreducerimplementationsreducers/Hadoop-providedmapperandreducerimplementations

Hadoop-providedOutputFormat,MapReducejobabout/Hadoop-providedOutputFormatFileOutputFormat/Hadoop-providedOutputFormatNullOutputFormat/Hadoop-providedOutputFormatSequenceFileOutputFormat/Hadoop-providedOutputFormatTextOutputFormat/Hadoop-providedOutputFormat

Hadoop-providedRecordReader,MapReducejobabout/Hadoop-providedRecordReaderLineRecordReader/Hadoop-providedRecordReaderSequenceFileRecordReader/Hadoop-providedRecordReader

Hadoop2about/Hadoop2–what’sthebigdeal?storage/StorageinHadoop2computation/ComputationinHadoop2diagrammaticrepresentation,architecture/ComputationinHadoop2

referencelink/Gettingstartedoperations/OperationsintheHadoop2world

Hadoop2NameNodeHAabout/Hadoop2NameNodeHAenabling/Hadoop2NameNodeHAkeeping,insync/KeepingtheHANameNodesinsync

HadoopDistributedFileSystem(HDFS)/NameNodeandDataNodeHadoopdistributions

about/DistributionsofApacheHadoopHortonworks/DistributionsofApacheHadoopCloudera/DistributionsofApacheHadoopMapR/DistributionsofApacheHadoopreferencelink/DistributionsofApacheHadoop

Hadoopfilesystemsabout/Hadoopfilesystemsreferencelink/HadoopfilesystemsHadoopinterfaces/Hadoopinterfaces

Hadoopinterfacesabout/HadoopinterfacesJavaFileSystemAPI/JavaFileSystemAPILibhdfs/LibhdfsApacheThrift/Thrift

Hadoopoperationsabout/I’madeveloper–Idon’tcareaboutoperations!

Hadoopsecurityfuture/ThefutureofHadoopsecurity

Hadoopsecuritymodelevolution/EvolutionoftheHadoopsecuritymodeladditionalsecurityfeatures/Beyondbasicauthorization

Hadoopstreamingabout/Hadoopstreamingwordcount,streaminginPython/StreamingwordcountinPythondifferencesinjobs/Differencesinjobswhenusingstreamingimportanceofwords,determining/Findingimportantwordsintext

HadoopUIURL/Othertoolstohelpabout/Othertoolstohelp

HadoopUserGroup(HUG)/HUGshashtagRegExp/Trendingtopicshashtags

about/SentimentofhashtagsHBase

about/HBaseURL/HBase

HCatalogabout/IntroducingHCatalogusing/UsingHCatalog

HCatCLItoolabout/UsingHCatalog

hcatutilityabout/UsingHCatalog

HDFSabout/ComponentsofHadoop,Storage,SamzaandHDFScharacteristics/Storagearchitecture/TheinnerworkingsofHDFSNameNode/TheinnerworkingsofHDFSDataNodes/TheinnerworkingsofHDFSclusterstartup/Clusterstartupblockreplication/Blockreplication

HDFSandMapReducemerits/Bettertogether

HDFSfilesystemcommand-lineaccess/Command-lineaccesstotheHDFSfilesystemexploring/ExploringtheHDFSfilesystem

HDFSsnapshotsabout/HDFSsnapshots

HelloSamzaabout/HelloSamza!URL/HelloSamza!

high-availability(HA)about/StorageinHadoop2

HighPerformanceComputing(HPC)/ComputationinHadoop2Hive

about/Hive-on-tezURL/Hive-on-tezoverview/OverviewofHivedatatypes/DatatypesDDLstatements/DDLstatementsfileformats/Fileformatsandstoragestorage/Fileformatsandstoragequeries/Queriesscripts,writing/Writingscriptsworking,withAmazonWebServices/HiveandAmazonWebServicesusing,withS3/HiveandS3using,withElasticMapReduce/HiveonElasticMapReduceURL,forsourcecodeofJDBCclient/JDBCURL,forsourcecodeofThriftclient/Thrift

Hive-JSON-Serde

URL/JSONhive-jsonmodule

URL/JSONabout/JSON

Hive-on-tezabout/Hive-on-tez

Hive0.13about/Hive-on-tez

Hivearchitectureabout/Hivearchitecture

HiveQLabout/WhySQLonHadoop,Queriesextending/ExtendingHiveQL

HiveServer2about/HivearchitectureURL/Hivearchitecture

Hivetablesabout/ThenatureofHivetablesstructuring,fromworkloads/StructuringHivetablesforgivenworkloads

Hortonwork’sHDPURL/SparkonYARN

HortonworksURL/DistributionsofApacheHadoop

HortonworksDataPlatform(HDP)about/Alternativedistributions,HortonworksDataPlatformURL/HortonworksDataPlatform

Hueabout/HueURL/Hue

HUGsabout/HUGsreferencelink/HUGs

IIAMconsole

URL/HiveandS3IBMInfosphereBigInsights

about/Andtherest…IdentityandAccessManagement(IAM)/AWScredentialsImpala

about/Impalareferences/Impala,Co-existingwithHivearchitecture/ThearchitectureofImpalaco-existing,withHive/Co-existingwithHive

in-syncreplicas(ISR)about/GettingTwitterdataintoKafka

indicesattribute,entityabout/Tweetmetadata

input/output,MapReducejobabout/Input/Output

InputFormat,MapReducejobabout/InputFormatandRecordReader

JJava

WordCount/WordCountinJavaJavaAPI

about/JavaAPIandScalaAPI,differences/JavaAPI

JavaAPItoMapReduceabout/JavaAPItoMapReduceMapperclass/TheMapperclassReducerclass/TheReducerclassDriverclass/TheDriverclasscombinerclass/Combinerpartitioning/PartitioningHadoop-providedMapperandReducerimplementations/Hadoop-providedmapperandreducerimplementationsreferencedata,sharing/Sharingreferencedata

JavaFileSystemAPIabout/JavaFileSystemAPI

JDBCabout/JDBC

JobTrackermonitoring,MapReducejobabout/OngoingJobTrackermonitoring

joinnodeabout/Performingmultipleactionsinparallel

JOINoperatorabout/Join

/QueriesJSON

about/JSONJSONSimple

URL/BuildingatweetparsingjobJVMconsiderations,clustertuning

about/JVMconsiderationssmallfilesproblem/Thesmallfilesproblem

Kkite-morphlines-avrocommand/Morphlinecommandskite-morphlines-core-stdiocommand/Morphlinecommandskite-morphlines-core-stdlibcommand/Morphlinecommandskite-morphlines-hadoop-corecommand/Morphlinecommandskite-morphlines-hadoop-parquet-avrocommand/Morphlinecommandskite-morphlines-hadoop-rcfilecommand/Morphlinecommandskite-morphlines-hadoop-sequencefilecommand/Morphlinecommandskite-morphlines-jsoncommand/MorphlinecommandsKiteData

about/KiteDataDatacore/DataCoreDatacore/DataCoreDataHCatalog/DataHCatalogDataHive/DataHiveDataMapReduce/DataMapReduceDataSpark/DataSparkDataCrunch/DataCrunch

Kiteexamplesreferencelink/KiteData

KiteJARsreferencelink/KiteData

KiteMorphlinesabout/KiteMorphlinesconcepts/ConceptsRecordabstractions/Conceptscommands/Morphlinecommands

KiteSDKURL/KiteData

KVMreferencelink/ClouderaQuickStartVM

LLambdasyntax

URL/PythonAPILibhdfs

about/LibhdfsLinkedIngroups

about/LinkedIngroupsURL/LinkedIngroups

Log4jabout/Logginglevels

logfilesaccessingto/Accesstologfiles

logginglevelsabout/Logginglevels

MMachineLearning(ML)

about/MLlibmacros

about/MacrosMahout

about/MahoutURL/Mahout

mapoptimization,clustertuningconsiderations/Mapandreduceoptimizations

Mapperclass,JavaAPItoMapReduceabout/TheMapperclass

mapperexecution,MapReducejobabout/Mapperexecution

mapperinput,MapReducejobabout/Mapperinput

mapperoutput,MapReducejobabout/Mapperoutputandreducerinput

mappers,MapperandReducerimplementationsInverseMapper/Hadoop-providedmapperandreducerimplementationsTokenCounterMapper/Hadoop-providedmapperandreducerimplementationsIdentityMapper/Hadoop-providedmapperandreducerimplementations

MapRURL/DistributionsofApacheHadoop,MapRabout/MapR

MapReducereferencelink/ThebackgroundofHadoop,MapReduceabout/MapReduceMapphase/MapReduce

MapReduceAPIabout/ComponentsofHadoop,Computation

MapReducedriversourcecodereferencelink/Morphlinecommands

MapReducejobabout/WalkingthrougharunofaMapReducejobstartup/Startupinput,splitting/Splittingtheinputtaskassignment/Taskassignmenttaskstartup/TaskstartupJobTrackermonitoring/OngoingJobTrackermonitoringmapperinput/Mapperinputmapperexecution/Mapperexecutionmapperoutput/Mapperoutputandreducerinput

reducerinput/Reducerinputreducerexecution/Reducerexecutionreduceroutput/Reduceroutputshutdown/Shutdowninput/output/Input/OutputInputFormat/InputFormatandRecordReaderRecordReader/InputFormatandRecordReaderHadoop-providedInputFormat/Hadoop-providedInputFormatHadoop-providedRecordReader/Hadoop-providedRecordReaderOutputFormat/OutputFormatandRecordWriterRecordWriter/OutputFormatandRecordWriterHadoop-providedOutputFormat/Hadoop-providedOutputFormatsequencefiles/Sequencefiles

MapReduceprogramswriting/WritingMapReduceprograms,Gettingstartedexamples,running/RunningtheexamplesWordCountexample/WordCount,theHelloWorldofMapReducewordco-occurrences/Wordco-occurrencessocialnetworktopics/Trendingtopicsreferencelink,forHashTagCountexamplesourcecode/TrendingtopicsTopNpattern/TheTopNpatternreferencelink,forTopTenHashTagsourcecode/TheTopNpatternhashtags/Sentimentofhashtagsreferencelink,forHashTagSentimentsourcecode/Sentimentofhashtagstextcleanup,chainmapperused/Textcleanupusingchainmapperreferencelink,forHashTagSentimentChainsourcecode/Textcleanupusingchainmapper

MassivelyParallelProcessing(MPP)about/ThearchitectureofImpala

MemPipelineabout/MemPipeline

MessagePassingInterface(MPI)/ComputationinHadoop2MLLib

about/MLlibmonitoring

about/MonitoringHadoop/Hadoop–wherefailuresdon’tmatterapplication-levelmetrics/Application-levelmetrics

monitoringtoolsabout/Monitoringintegration

MoprhlineDrviersourcecodereferencelink/Morphlinecommands

Morphlinecommandskite-morphlines-core-stdio/Morphlinecommands

kite-morphlines-core-stdlib/Morphlinecommandskite-morphlines-avro/Morphlinecommandskite-morphlines-json/Morphlinecommandskite-morphlines-hadoop-parquet-avro/Morphlinecommandskite-morphlines-hadoop-sequencefile/Morphlinecommandskite-morphlines-hadoop-rcfile/Morphlinecommandsreferencelink/Morphlinecommands

MRExecutionEngine/AnoverviewofPigMultipartUpload

URL/GettingdataintoEMR

NNameNode

about/StorageinHadoop2/NameNodeandDataNodeNameNodeHA

about/StorageinHadoop2NameNodestartup

about/NameNodestartupNFSshare/KeepingtheHANameNodesinsyncNodeManager

about/ResourceManager,NodeManager,andApplicationManagerNodeManager(NM)

about/ThecomponentsofYARN

OOozie

about/IntroducingOozieURL/IntroducingOoziefeatures/IntroducingOozieactionnodes/IntroducingOozieHDFSfilepermissions/AnoteonHDFSfilepermissionsdevelopment,makingeasier/Makingdevelopmentalittleeasierdata,extracting/ExtractingdataandingestingintoHivedata,ingestingintoHive/ExtractingdataandingestingintoHiveworkflowdirectorystructure/AnoteonworkflowdirectorystructureHCatalog/IntroducingHCatalogsharelib/TheOoziesharelibHCatalogandpartitionedtables/HCatalogandpartitionedtablesusing/Pullingitalltogether

Oozietriggers/OtherOozietriggersOozieworkflow

about/IntroducingOozie/IntroducingOozieoperations,Hadoop2

about/OperationsintheHadoop2worldopinionlexicon

URL/SentimentofhashtagsOptimizedRowColumnarfileformat(ORC)

about/ORCreferencelink/ORC

ORCURL/Columnarstores

org.apache.zookeeper.ZooKeeperclassabout/JavaAPI

OutputFormat,MapReducejobabout/OutputFormatandRecordWriter

PparallelDooperation

about/ConceptsPARALLELoperator

about/AggregationParquet

referencelink/Parquetabout/ParquetURL/Columnarstores

partitioning,JavaAPItoMapReduceabout/Partitioningoptionalpartitionfunction/Theoptionalpartitionfunction

PCollection<T>interface,Crunchabout/Concepts

physicalclusterbuilding/Buildingaphysicalcluster

physicalcluster,considerationsabout/Physicallayoutrackawareness/Rackawarenessservicelayout/Servicelayoutservice,upgrading/Upgradingaservice

Pigoverview/AnoverviewofPigusecases/AnoverviewofPigabout/Gettingstarted,WhySQLonHadooprunning/RunningPigreferencelink,forsourcecodeandbinarydistributions/RunningPigGrunt/Grunt–thePiginteractiveshellElasticMapReduce/ElasticMapReducefundamentals/FundamentalsofApachePigreferencelink,forparallelfeature/FundamentalsofApachePigreferencelink,formulti-queryimplementation/FundamentalsofApachePigprogramming/ProgrammingPigdatatypes/Pigdatatypesfunctions/Pigfunctionsdata,workingwith/Workingwithdata

Piggybankabout/Piggybank

PigLatin/AnoverviewofPigPigUDFs

extending/ExtendingPig(UDFs)contributedUDFs/ContributedUDFs

pipelinesimplementation,ApacheCrunch

about/PipelinesimplementationandexecutionSparkPipeline/SparkPipelineMemPipeline/MemPipeline

positive_wordsoperatorabout/Join

pre-requisitesabout/Prerequisites

PredictiveModelMarkupLanguage(PMML)/Cascadingprocessingmodels,YARN

ClouderaKitten/ThinkinginlayersApacheTwill/Thinkinginlayers

programmaticinterfacesabout/ProgrammaticinterfacesJDBC/JDBCThrift/Thrift

ProjectRhinoURL/ThefutureofHadoopsecurity

PTable<Key,Value>interface,Crunchabout/Concepts

Pythonused,forprogrammaticaccess/ProgrammaticaccesswithPython

PythonAPIabout/PythonAPI

QQJMmechanism

about/KeepingtheHANameNodesinsyncqueries,Hive/Queries

RRDDs

about/Clustercomputingwithworkingsets,ResilientDistributedDatasets(RDDs)

RDDs,operationsmap/Actionsfilter/Actionsreduce/Actionscollect/Actionsforeach/ActionsgroupByKey/ActionssortByKey/Actions

Recordabstractionsimplementing/Concepts

RecordReader,MapReducejobabout/InputFormatandRecordReader

RecordWriter,MapReducejobabout/OutputFormatandRecordWriter

Reducefunctionabout/MapReduce

reduceoptimization,clustertuningconsiderations/Mapandreduceoptimizations

Reducerclass,JavaAPItoMapReduceabout/TheReducerclass

reducerexecution,MapReducejobabout/Reducerexecution

reducerinput,MapReducejobabout/Reducerinput

reduceroutput,MapReducejobabout/Reduceroutput

reducers,MapperandReducerimplementationsIntSumReducer/Hadoop-providedmapperandreducerimplementationsLongSumReducer/Hadoop-providedmapperandreducerimplementationsIdentityReducer/Hadoop-providedmapperandreducerimplementations

referencedata,JavaAPItoMapReducesharing/Sharingreferencedata

REGISTERoperatorabout/ExtendingPig(UDFs)

requiredservices,AWSSimpleStorageService(S3)/SigningupforthenecessaryservicesElasticMapReduce/SigningupforthenecessaryservicesElasticComputeCloud(EC2)/Signingupforthenecessaryservices

ResourceManager

about/ResourceManager,NodeManager,andApplicationManagerapplications/ApplicationsNodesview/NodesSchedulerwindow/SchedulerMapReduce/MapReduceMapReducev1/MapReducev1MapReducev2(YARN)/MapReducev2(YARN)JobHistoryServer/JobHistoryServer

resourcessharing/Sharingresources

RoleBasedAccessControl(RBAC)/BeyondbasicauthorizationRowColumnarFile(RCFile)

about/RCFilereferencelink/RCFile

SS3

Hive,usingwith/HiveandS3s3distcp

URL/GettingdataintoEMRs3n/HadoopfilesystemsSamza

about/ApacheSamzaURL/ApacheSamza,StreamprocessingwithSamzaYARN-independentframeworks/YARN-independentframeworksused,forstreamprocessing/StreamprocessingwithSamzaworking/HowSamzaworksarchitecture/Samzahigh-levelarchitectureApacheKafka/Samza’sbestfriend–ApacheKafkaintegrating,withYARN/YARNintegrationindependentmodel/AnindependentmodelHelloSamza/HelloSamza!tweetparsingjob,building/Buildingatweetparsingjobconfigurationfile/TheconfigurationfileURL,forconfigurationoptions/TheconfigurationfileTwitterdata,gettingintoApacheKafka/GettingTwitterdataintoKafkaHDFS/SamzaandHDFSwindowfunction,adding/Windowingfunctionsmultijobworkflows/Multijobworkflowstweetsentimentanalysis,performing/Tweetsentimentanalysistasksprocessing/StatefultasksandSparkStreaming,comparing/ComparingSamzaandSparkStreaming

Samza,layersstreaming/Samzahigh-levelarchitectureexecution/Samzahigh-levelarchitectureprocessing/Samzahigh-levelarchitecture

Samzajobexecuting/RunningaSamzajob

sbtURL/GettingstartedwithSpark

ScalaandJavasourcecode,examplesURL/Buildingandrunningtheexamples

ScalaAPIabout/ScalaAPI

scalardatatypesint/Pigdatatypeslong/Pigdatatypesfloat/Pigdatatypes

double/Pigdatatypeschararray/Pigdatatypesbytearray/Pigdatatypesboolean/Pigdatatypesdatetime/Pigdatatypesbiginteger/Pigdatatypesbigdecimal/Pigdatatypes

ScalasourcecodeURL/Dataprocessingonstreams

SecondaryNameNodeabout/SecondaryNameNodenottotherescuedemerits/SecondaryNameNodenottotherescue

securedclusterusing,consequences/Consequencesofusingasecuredcluster

securityabout/Security

sentimentanalysisabout/Sentimentofhashtags

SequenceFileabout/General-purposefileformats

SequenceFileclass,MapReducejobabout/Sequencefiles

sequencefiles,MapReducejobabout/Sequencefilesadvantages/Sequencefiles

SerDeclasses,HiveMetadataTypedColumnsetSerDe/FileformatsandstorageThriftSerDe/FileformatsandstorageDynamicSerDe/Fileformatsandstorage

serializationabout/SerializationandContainers

sharelib,Oozieabout/TheOoziesharelib

SimpleDBabout/SimpleDBandDynamoDB

SimpleStorageService(S3),AWSabout/SimpleStorageService(S3)URL/SimpleStorageService(S3)

sourcesofinformation,Hadoopabout/Sourcesofinformationsourcecode/Sourcecodemailinglists/Mailinglistsandforumsforums/MailinglistsandforumsLinkedIngroups/LinkedIngroups

HUGs/HUGsconferences/Conferences

Sparkabout/ApacheSparkURL/ApacheSpark

SparkContextobject/ScalaAPISparkPipeline

about/SparkPipelineSparkSQL

about/SparkSQLdataanalysiswith/DataanalysiswithSparkSQL

SparkStreamingURL/SparkStreamingabout/SparkStreamingandSamza,comparing/ComparingSamzaandSparkStreaming

specializedjoinreferencelink/Join

speedofthoughtanalysis/AdifferentphilosophySQL

ondatastreams/SQLondatastreamsondatastreams,URL/SQLondatastreams

SQL-on-Hadoopneedfor/WhySQLonHadoopsolutions/OtherSQL-on-Hadoopsolutions

Sqoopabout/SqoopURL/Sqoop

Sqoop1about/Sqoop

Sqoop2about/Sqoop

standaloneapplications,ApacheSparkwriting/Writingandrunningstandaloneapplicationsrunning/Writingandrunningstandaloneapplications

statementsabout/FundamentalsofApachePig

Stingerinitiativeabout/Stingerinitiative

storageabout/Storage

storage,Hadoop2about/StorageinHadoop2

storage,Hiveabout/Fileformatsandstorage

columnarstores/ColumnarstoresStorm

URL/HowSamzaworksabout/HowSamzaworks

stream.pyreferencelink/ProgrammaticaccesswithPython

streamprocessingwithSamza/StreamprocessingwithSamza

streamsdata,processingon/Dataprocessingonstreams

systemsmanagementtoolsClouderaManager,integratingwith/ClouderaManagerandothermanagementtools

Ttablepartitioning

about/Partitioningatabledata,overwriting/Overwritingandupdatingdatadata,updating/Overwritingandupdatingdatabucketing/Bucketingandsortingsorting/Bucketingandsortingdata,sampling/Samplingdata

TajoURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond

tasksprocessing,Samzaabout/Statefultasks

termfrequencyabout/Calculatetermfrequencycalculating,withTF-IDF/Calculatetermfrequency

textattribute,entityabout/Tweetmetadata

Textfilesabout/General-purposefileformats

Tezabout/TezURL/Tez,Stingerinitiativereferencelink,forcanonicalWordCountexample/TezHive-on-tez/Hive-on-tez

/AnoverviewofPigTF-IDF

about/Findingimportantwordsintextdefinition/Findingimportantwordsintexttermfrequency,calculating/Calculatetermfrequencydocumentfrequency,calculating/Calculatedocumentfrequencyimplementing/Puttingitalltogether–TF-IDF

Thriftabout/Thrift

TOBAG(expression)function/Thetuple,bag,andmapfunctionsTOMAP(expression)function/Thetuple,bag,andmapfunctionstools,datalifecyclemanagement

orchestrationservices/Toolstohelpconnectors/Toolstohelpfileformats/Toolstohelp

TOP(n,column,relation)function/Thetuple,bag,andmapfunctionsTOTUPLE(expression)function/Thetuple,bag,andmapfunctionstroubleshooting

about/Troubleshootingtuples

about/FundamentalsofApachePigTweet,structure

referencelink/AnatomyofaTweettweetanalysiscapability

building/Buildingatweetanalysiscapabilitytweetdata,obtaining/GettingthetweetdataOozie/IntroducingOoziederiveddata,producing/Producingderiveddata

tweetsentimentanalysisperforming/Tweetsentimentanalysisbootstrapstreams/Bootstrapstreams

Twitterused,forgeneratingdataset/DataprocessingwithHadoopURL/DataprocessingwithHadoopabout/WhyTwitter?signuppage/Twittercredentialswebform/Twittercredentials

Twitterdata,propertiesunstructured/WhyTwitter?structured/WhyTwitter?graph/WhyTwitter?geolocated/WhyTwitter?realtime/WhyTwitter?

TwitterSearchURL/Trendingtopics

Twitterstreamanalyzing/AnalyzingtheTwitterstreamprerequisites/Prerequisitesdatasetexploration/Datasetexplorationtweetmetadata/Tweetmetadatadatapreparation/Datapreparationtopnstatistics/Topnstatisticsdatetimemanipulation/Datetimemanipulationsessions/Sessionsusers’interaction,capturing/Capturinguserinteractionslinkanalysis/Linkanalysisinfluentialusers,identifying/Influentialusers

Uunionoperation

about/ConceptsupdateFuncfunction/StatemanagementUserDefinedAggregateFunctions(UDAFs/ExtendingHiveQLUserDefinedFunctions(UDFs)/AnoverviewofPig,ExtendingHiveQL

about/FundamentalsofApachePigUserDefinedTableFunctions(UDTF)/ExtendingHiveQL

Vversioning,Hadoop

about/AnoteonversioningVirtualBox

referencelink/ClouderaQuickStartVMVMware

referencelink/ClouderaQuickStartVM

WWhir

about/WhirURL/Whir

WhotoFollowservicereferencelink/Influentialusers

windowfunctionadding/Windowingfunctions

WordCountinJava/WordCountinJava

WordCountexample,MapReduceprogramsabout/WordCount,theHelloWorldofMapReducereferencelink,forsourcecode/Wordco-occurrences

workflow-appabout/IntroducingOozie

workflow.xmlfilereferencelink/ExtractingdataandingestingintoHive

workflowsbuilding,Oozieused/Pullingitalltogether

workloadsHivetables,structuringfrom/StructuringHivetablesforgivenworkloads

wrapperclassesabout/Introducingthewrapperclasses

WritableComparableinterfaceabout/TheComparableandWritableComparableinterfaces

Writableinterfaceabout/TheWritableinterface

YYARN

about/ComputationinHadoop2,YARN,YARNintherealworld–ComputationbeyondMapReducearchitecture/YARNarchitecturecomponents/ThecomponentsofYARNprocessingframeworks/Thinkinginlayersprocessingmodels/Thinkinginlayersissues,withMapReduce/TheproblemwithMapReduceTez/TezApacheSpark/ApacheSparkApacheSamza/ApacheSamzafuture/YARNtodayandbeyondpresentsituation/YARNtodayandbeyondSamza,integrating/YARNintegrationApacheSparkon/SparkonYARNexamples,runningon/RunningtheexamplesonYARNURL/RunningtheexamplesonYARN

YARNAPIabout/Thinkinginlayers

YARNapplicationanatomy/AnatomyofaYARNapplicationApplicationMaster(AM)/AnatomyofaYARNapplicationlifecycle/LifecycleofaYARNapplicationfault-tolerance/Faulttoleranceandmonitoringmonitoring/Faulttoleranceandmonitoringexecutionmodels/Executionmodels

ZZooKeeperFailoverController(ZKFC)/AutomaticNameNodefailoverZooKeeperquorum/AutomaticNameNodefailover