the-eye.euthe-eye.eu/public/site-dumps/index-of/index-of.co.uk/big-data... · table of contents...
TRANSCRIPT
![Page 1: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/1.jpg)
![Page 2: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/2.jpg)
LearningHadoop2
![Page 3: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/3.jpg)
TableofContents
LearningHadoop2
Credits
AbouttheAuthors
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.Introduction
Anoteonversioning
ThebackgroundofHadoop
ComponentsofHadoop
Commonbuildingblocks
Storage
Computation
Bettertogether
Hadoop2–what’sthebigdeal?
StorageinHadoop2
![Page 4: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/4.jpg)
ComputationinHadoop2
DistributionsofApacheHadoop
Adualapproach
AWS–infrastructureondemandfromAmazon
SimpleStorageService(S3)
ElasticMapReduce(EMR)
Gettingstarted
ClouderaQuickStartVM
AmazonEMR
CreatinganAWSaccount
Signingupforthenecessaryservices
UsingElasticMapReduce
GettingHadoopupandrunning
HowtouseEMR
AWScredentials
TheAWScommand-lineinterface
Runningtheexamples
DataprocessingwithHadoop
WhyTwitter?
Buildingourfirstdataset
Oneservice,multipleAPIs
AnatomyofaTweet
Twittercredentials
ProgrammaticaccesswithPython
Summary
2.Storage
TheinnerworkingsofHDFS
Clusterstartup
NameNodestartup
DataNodestartup
Blockreplication
![Page 5: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/5.jpg)
Command-lineaccesstotheHDFSfilesystem
ExploringtheHDFSfilesystem
Protectingthefilesystemmetadata
SecondaryNameNodenottotherescue
Hadoop2NameNodeHA
KeepingtheHANameNodesinsync
Clientconfiguration
Howafailoverworks
ApacheZooKeeper–adifferenttypeoffilesystem
ImplementingadistributedlockwithsequentialZNodes
ImplementinggroupmembershipandleaderelectionusingephemeralZNodes
JavaAPI
Buildingblocks
Furtherreading
AutomaticNameNodefailover
HDFSsnapshots
Hadoopfilesystems
Hadoopinterfaces
JavaFileSystemAPI
Libhdfs
Thrift
Managingandserializingdata
TheWritableinterface
Introducingthewrapperclasses
Arraywrapperclasses
TheComparableandWritableComparableinterfaces
Storingdata
SerializationandContainers
Compression
General-purposefileformats
Column-orienteddataformats
![Page 6: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/6.jpg)
RCFile
ORC
Parquet
Avro
UsingtheJavaAPI
Summary
3.Processing–MapReduceandBeyond
MapReduce
JavaAPItoMapReduce
TheMapperclass
TheReducerclass
TheDriverclass
Combiner
Partitioning
Theoptionalpartitionfunction
Hadoop-providedmapperandreducerimplementations
Sharingreferencedata
WritingMapReduceprograms
Gettingstarted
Runningtheexamples
Localcluster
ElasticMapReduce
WordCount,theHelloWorldofMapReduce
Wordco-occurrences
Trendingtopics
TheTopNpattern
Sentimentofhashtags
Textcleanupusingchainmapper
WalkingthrougharunofaMapReducejob
Startup
Splittingtheinput
![Page 7: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/7.jpg)
Taskassignment
Taskstartup
OngoingJobTrackermonitoring
Mapperinput
Mapperexecution
Mapperoutputandreducerinput
Reducerinput
Reducerexecution
Reduceroutput
Shutdown
Input/Output
InputFormatandRecordReader
Hadoop-providedInputFormat
Hadoop-providedRecordReader
OutputFormatandRecordWriter
Hadoop-providedOutputFormat
Sequencefiles
YARN
YARNarchitecture
ThecomponentsofYARN
AnatomyofaYARNapplication
LifecycleofaYARNapplication
Faulttoleranceandmonitoring
Thinkinginlayers
Executionmodels
YARNintherealworld–ComputationbeyondMapReduce
TheproblemwithMapReduce
Tez
Hive-on-tez
ApacheSpark
ApacheSamza
![Page 8: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/8.jpg)
YARN-independentframeworks
YARNtodayandbeyond
Summary
4.Real-timeComputationwithSamza
StreamprocessingwithSamza
HowSamzaworks
Samzahigh-levelarchitecture
Samza’sbestfriend–ApacheKafka
YARNintegration
Anindependentmodel
HelloSamza!
Buildingatweetparsingjob
Theconfigurationfile
GettingTwitterdataintoKafka
RunningaSamzajob
SamzaandHDFS
Windowingfunctions
Multijobworkflows
Tweetsentimentanalysis
Bootstrapstreams
Statefultasks
Summary
5.IterativeComputationwithSpark
ApacheSpark
Clustercomputingwithworkingsets
ResilientDistributedDatasets(RDDs)
Actions
Deployment
SparkonYARN
SparkonEC2
GettingstartedwithSpark
![Page 9: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/9.jpg)
Writingandrunningstandaloneapplications
ScalaAPI
JavaAPI
WordCountinJava
PythonAPI
TheSparkecosystem
SparkStreaming
GraphX
MLlib
SparkSQL
ProcessingdatawithApacheSpark
Buildingandrunningtheexamples
RunningtheexamplesonYARN
Findingpopulartopics
Assigningasentimenttotopics
Dataprocessingonstreams
Statemanagement
DataanalysiswithSparkSQL
SQLondatastreams
ComparingSamzaandSparkStreaming
Summary
6.DataAnalysiswithApachePig
AnoverviewofPig
Gettingstarted
RunningPig
Grunt–thePiginteractiveshell
ElasticMapReduce
FundamentalsofApachePig
ProgrammingPig
Pigdatatypes
Pigfunctions
![Page 10: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/10.jpg)
Load/store
Eval
Thetuple,bag,andmapfunctions
Themath,string,anddatetimefunctions
Dynamicinvokers
Macros
Workingwithdata
Filtering
Aggregation
Foreach
Join
ExtendingPig(UDFs)
ContributedUDFs
Piggybank
ElephantBird
ApacheDataFu
AnalyzingtheTwitterstream
Prerequisites
Datasetexploration
Tweetmetadata
Datapreparation
Topnstatistics
Datetimemanipulation
Sessions
Capturinguserinteractions
Linkanalysis
Influentialusers
Summary
7.HadoopandSQL
WhySQLonHadoop
OtherSQL-on-Hadoopsolutions
![Page 11: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/11.jpg)
Prerequisites
OverviewofHive
ThenatureofHivetables
Hivearchitecture
Datatypes
DDLstatements
Fileformatsandstorage
JSON
Avro
Columnarstores
Queries
StructuringHivetablesforgivenworkloads
Partitioningatable
Overwritingandupdatingdata
Bucketingandsorting
Samplingdata
Writingscripts
HiveandAmazonWebServices
HiveandS3
HiveonElasticMapReduce
ExtendingHiveQL
Programmaticinterfaces
JDBC
Thrift
Stingerinitiative
Impala
ThearchitectureofImpala
Co-existingwithHive
Adifferentphilosophy
Drill,Tajo,andbeyond
Summary
![Page 12: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/12.jpg)
8.DataLifecycleManagement
Whatdatalifecyclemanagementis
Importanceofdatalifecyclemanagement
Toolstohelp
Buildingatweetanalysiscapability
Gettingthetweetdata
IntroducingOozie
AnoteonHDFSfilepermissions
Makingdevelopmentalittleeasier
ExtractingdataandingestingintoHive
Anoteonworkflowdirectorystructure
IntroducingHCatalog
UsingHCatalog
TheOoziesharelib
HCatalogandpartitionedtables
Producingderiveddata
Performingmultipleactionsinparallel
Callingasubworkflow
Addingglobalsettings
Challengesofexternaldata
Datavalidation
Validationactions
Handlingformatchanges
HandlingschemaevolutionwithAvro
FinalthoughtsonusingAvroschemaevolution
Onlymakeadditivechanges
Manageschemaversionsexplicitly
Thinkaboutschemadistribution
Collectingadditionaldata
Schedulingworkflows
OtherOozietriggers
![Page 13: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/13.jpg)
Pullingitalltogether
Othertoolstohelp
Summary
9.MakingDevelopmentEasier
Choosingaframework
Hadoopstreaming
StreamingwordcountinPython
Differencesinjobswhenusingstreaming
Findingimportantwordsintext
Calculatetermfrequency
Calculatedocumentfrequency
Puttingitalltogether–TF-IDF
KiteData
DataCore
DataHCatalog
DataHive
DataMapReduce
DataSpark
DataCrunch
ApacheCrunch
Gettingstarted
Concepts
Dataserialization
Dataprocessingpatterns
Aggregationandsorting
Joiningdata
Pipelinesimplementationandexecution
SparkPipeline
MemPipeline
Crunchexamples
Wordco-occurrence
![Page 14: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/14.jpg)
TF-IDF
KiteMorphlines
Concepts
Morphlinecommands
Summary
10.RunningaHadoopCluster
I’madeveloper–Idon’tcareaboutoperations!
HadoopandDevOpspractices
ClouderaManager
Topayornottopay
ClustermanagementusingClouderaManager
ClouderaManagerandothermanagementtools
MonitoringwithClouderaManager
Findingconfigurationfiles
ClouderaManagerAPI
ClouderaManagerlock-in
Ambari–theopensourcealternative
OperationsintheHadoop2world
Sharingresources
Buildingaphysicalcluster
Physicallayout
Rackawareness
Servicelayout
Upgradingaservice
BuildingaclusteronEMR
Considerationsaboutfilesystems
GettingdataintoEMR
EC2instancesandtuning
Clustertuning
JVMconsiderations
Thesmallfilesproblem
![Page 15: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/15.jpg)
Mapandreduceoptimizations
Security
EvolutionoftheHadoopsecuritymodel
Beyondbasicauthorization
ThefutureofHadoopsecurity
Consequencesofusingasecuredcluster
Monitoring
Hadoop–wherefailuresdon’tmatter
Monitoringintegration
Application-levelmetrics
Troubleshooting
Logginglevels
Accesstologfiles
ResourceManager,NodeManager,andApplicationManager
Applications
Nodes
Scheduler
MapReduce
MapReducev1
MapReducev2(YARN)
JobHistoryServer
NameNodeandDataNode
Summary
11.WheretoGoNext
Alternativedistributions
ClouderaDistributionforHadoop
HortonworksDataPlatform
MapR
Andtherest…
Choosingadistribution
Othercomputationalframeworks
![Page 16: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/16.jpg)
ApacheStorm
ApacheGiraph
ApacheHAMA
Otherinterestingprojects
HBase
Sqoop
Whir
Mahout
Hue
Otherprogrammingabstractions
Cascading
AWSresources
SimpleDBandDynamoDB
Kinesis
DataPipeline
Sourcesofinformation
Sourcecode
Mailinglistsandforums
LinkedIngroups
HUGs
Conferences
Summary
Index
![Page 17: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/17.jpg)
LearningHadoop2
![Page 18: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/18.jpg)
LearningHadoop2Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:February2015
Productionreference:1060215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78328-551-8
www.packtpub.com
![Page 19: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/19.jpg)
CreditsAuthors
GarryTurkington
GabrieleModena
Reviewers
AtdheBuja
AmitGurdasani
JakobHoman
JamesLampton
DavideSetti
ValerieParham-Thompson
CommissioningEditor
EdwardGordon
AcquisitionEditor
JoanneFitzpatrick
ContentDevelopmentEditor
VaibhavPawar
TechnicalEditors
IndrajitA.Das
MenzaMathew
CopyEditors
RoshniBanerjee
SarangChari
PranjaliChury
ProjectCoordinator
KrantiBerde
Proofreaders
SimranBhogal
MartinDiver
LawrenceA.Herman
![Page 20: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/20.jpg)
PaulHindle
Indexer
HemanginiBari
Graphics
AbhinashSahu
ProductionCoordinator
NiteshThakur
CoverWork
NiteshThakur
![Page 21: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/21.jpg)
AbouttheAuthorsGarryTurkingtonhasover15yearsofindustryexperience,mostofwhichhasbeenfocusedonthedesignandimplementationoflarge-scaledistributedsystems.InhiscurrentroleastheCTOatImproveDigital,heisprimarilyresponsiblefortherealizationofsystemsthatstore,process,andextractvaluefromthecompany’slargedatavolumes.BeforejoiningImproveDigital,hespenttimeatAmazon.co.uk,whereheledseveralsoftwaredevelopmentteams,buildingsystemsthatprocesstheAmazoncatalogdataforeveryitemworldwide.Priortothis,hespentadecadeinvariousgovernmentpositionsinboththeUKandtheUSA.
HehasBScandPhDdegreesinComputerSciencefromQueensUniversityBelfastinNorthernIreland,andaMaster’sdegreeinEngineeringinSystemsEngineeringfromStevensInstituteofTechnologyintheUSA.HeistheauthorofHadoopBeginnersGuide,publishedbyPacktPublishingin2013,andisacommitterontheApacheSamzaproject.
IwouldliketothankmywifeLeaandmotherSarahfortheirsupportandpatiencethroughthewritingofanotherbookandmydaughterMayaforfrequentlycheeringmeupandaskingmehardquestions.IwouldalsoliketothankGabrieleforbeingsuchanamazingco-authoronthisproject.
GabrieleModenaisadatascientistatImproveDigital.Inhiscurrentposition,heusesHadooptomanage,process,andanalyzebehavioralandmachine-generateddata.Gabrieleenjoysusingstatisticalandcomputationalmethodstolookforpatternsinlargeamountsofdata.PriortohiscurrentjobinadtechheheldanumberofpositionsinAcademiaandIndustrywherehedidresearchinmachinelearningandartificialintelligence.
HeholdsaBScdegreeinComputerSciencefromtheUniversityofTrento,ItalyandaResearchMScdegreeinArtificialIntelligence:LearningSystems,fromtheUniversityofAmsterdamintheNetherlands.
Firstandforemost,IwanttothankLauraforhersupport,constantencouragementandendlesspatienceputtingupwithfartoomany“can’tdo,I’mworkingontheHadoopbook”.SheismyrockandIdedicatethisbooktoher.
AspecialthankyougoestoAmit,Atdhe,Davide,Jakob,JamesandValerie,whoseinvaluablefeedbackandcommentarymadethisworkpossible.
Finally,I’dliketothankmyco-author,Garry,forbringingmeonboardwiththisproject;ithasbeenapleasureworkingtogether.
![Page 22: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/22.jpg)
AbouttheReviewersAtdheBujaisacertifiedethicalhacker,DBA(MCITP,OCA11g),anddeveloperwithgoodmanagementskills.HeisaDBAattheAgencyforInformationSociety/MinistryofPublicAdministration,wherehealsomanagessomeprojectsofe-governanceandhasmorethan10years’experienceworkingonSQLServer.
AtdheisaregularcolumnistforUBTNews.Currently,heholdsanMScdegreeincomputerscienceandengineeringandhasabachelor’sdegreeinmanagementandinformation.Hespecializesinandiscertifiedinmanytechnologies,suchasSQLServer(allversions),Oracle11g,CEH,WindowsServer,MSProject,SCOM2012R2,BizTalk,andintegrationbusinessprocesses.
Hewasthereviewerofthebook,MicrosoftSQLServer2012withHadoop,publishedbyPacktPublishing.Hiscapabilitiesgobeyondtheaforementionedknowledge!
IthankDonikaandmyfamilyforalltheencouragementandsupport.
AmitGurdasaniisasoftwareengineeratAmazon.Hearchitectsdistributedsystemstoprocessproductcataloguedata.Priortobuildinghigh-throughputsystemsatAmazon,hewasworkingontheentiresoftwarestack,bothasasystems-leveldeveloperatEricssonandIBMaswellasanapplicationdeveloperatManhattanAssociates.Hemaintainsastronginterestinbulkdataprocessing,datastreaming,andservice-orientedsoftwarearchitectures.
JakobHomanhasbeeninvolvedwithbigdataandtheApacheHadoopecosystemformorethan5years.HeisaHadoopcommitteraswellasacommitterfortheApacheGiraph,Spark,Kafka,andTajoprojects,andisaPMCmember.HehasworkedinbringingallthesesystemstoscaleatYahoo!andLinkedIn.
JamesLamptonisaseasonedpractitionerofallthingsdata(bigorsmall)with10yearsofhands-onexperienceinbuildingandusinglarge-scaledatastorageandprocessingplatforms.Heisabelieverinholisticapproachestosolvingproblemsusingtherighttoolfortherightjob.HisfavoritetoolsincludePython,Java,Hadoop,Pig,Storm,andSQL(whichsometimesIlikeandsometimesIdon’t).HehasrecentlycompletedhisPhDfromtheUniversityofMarylandwiththereleaseofPigSqueal:amechanismforrunningPigscriptsonStorm.
Iwouldliketothankmyspouse,Andrea,andmyson,Henry,forgivingmetimetoreadwork-relatedthingsathome.IwouldalsoliketothankGarry,Gabriele,andthefolksatPacktPublishingfortheopportunitytoreviewthismanuscriptandfortheirpatienceandunderstanding,asmyfreetimewasconsumedwhenwritingmydissertation.
DavideSetti,aftergraduatinginphysicsfromtheUniversityofTrento,joinedtheSoNetresearchunitattheFondazioneBrunoKesslerinTrento,whereheappliedlarge-scaledataanalysistechniquestounderstandpeople’sbehaviorsinsocialnetworksandlargecollaborativeprojectssuchasWikipedia.
In2010,DavidemovedtoFondazione,whereheledthedevelopmentofdataanalytic
![Page 23: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/23.jpg)
toolstosupportresearchoncivicmedia,citizenjournalism,anddigitalmedia.
In2013,DavidebecametheCTOofSpazioDati,whereheleadsthedevelopmentoftoolstoperformsemanticanalysisofmassiveamountsofdatainthebusinessinformationsector.
Whennotsolvinghardproblems,Davideenjoystakingcareofhisfamilyvineyardandplayingwithhistwochildren.
![Page 24: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/24.jpg)
www.PacktPub.com
![Page 25: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/25.jpg)
Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
![Page 26: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/26.jpg)
Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser
![Page 27: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/27.jpg)
FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.
![Page 28: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/28.jpg)
PrefaceThisbookwilltakeyouonahands-onexplorationofthewonderfulworldthatisHadoop2anditsrapidlygrowingecosystem.Buildingonthesolidfoundationfromtheearlierversionsoftheplatform,Hadoop2allowsmultipledataprocessingframeworkstobeexecutedonasingleHadoopcluster.
Togiveanunderstandingofthissignificantevolution,wewillexplorebothhowthesenewmodelsworkandalsoshowtheirapplicationsinprocessinglargedatavolumeswithbatch,iterative,andnear-real-timealgorithms.
![Page 29: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/29.jpg)
WhatthisbookcoversChapter1,Introduction,givesthebackgroundtoHadoopandtheBigDataproblemsitlookstosolve.WealsohighlighttheareasinwhichHadoop1hadroomforimprovement.
Chapter2,Storage,delvesintotheHadoopDistributedFileSystem,wheremostdataprocessedbyHadoopisstored.WeexaminetheparticularcharacteristicsofHDFS,showhowtouseit,anddiscusshowithasimprovedinHadoop2.WealsointroduceZooKeeper,anotherstoragesystemwithinHadoop,uponwhichmanyofitshigh-availabilityfeaturesrely.
Chapter3,Processing–MapReduceandBeyond,firstdiscussesthetraditionalHadoopprocessingmodelandhowitisused.WethendiscusshowHadoop2hasgeneralizedtheplatformtousemultiplecomputationalmodels,ofwhichMapReduceismerelyone.
Chapter4,Real-timeComputationwithSamza,takesadeeperlookatoneofthesealternativeprocessingmodelsenabledbyHadoop2.Inparticular,welookathowtoprocessreal-timestreamingdatawithApacheSamza.
Chapter5,IterativeComputationwithSpark,delvesintoaverydifferentalternativeprocessingmodel.Inthischapter,welookathowApacheSparkprovidesthemeanstodoiterativeprocessing.
Chapter6,DataAnalysiswithPig,demonstrateshowApachePigmakesthetraditionalcomputationalmodelofMapReduceeasiertousebyprovidingalanguagetodescribedataflows.
Chapter7,HadoopandSQL,looksathowthefamiliarSQLlanguagehasbeenimplementedatopdatastoredinHadoop.ThroughtheuseofApacheHiveanddescribingalternativessuchasClouderaImpala,weshowhowBigDataprocessingcanbemadepossibleusingexistingskillsandtools.
Chapter8,DataLifecycleManagement,takesalookatthebiggerpictureofjusthowtomanageallthatdatathatistobeprocessedinHadoop.UsingApacheOozie,weshowhowtobuildupworkflowstoingest,process,andmanagedata.
Chapter9,MakingDevelopmentEasier,focusesonaselectionoftoolsaimedathelpingadevelopergetresultsquickly.ThroughtheuseofHadoopstreaming,ApacheCrunchandKite,weshowhowtheuseoftherighttoolcanspeedupthedevelopmentlooporprovidenewAPIswithrichersemanticsandlessboilerplate.
Chapter10,RunningaHadoopCluster,takesalookattheoperationalsideofHadoop.Byfocusingontheareasofinteresttodevelopers,suchasclustermanagement,monitoring,andsecurity,thischaptershouldhelpyoutoworkbetterwithyouroperationsstaff.
Chapter11,WheretoGoNext,takesyouonawhirlwindtourthroughanumberofotherprojectsandtoolsthatwefeelareuseful,butcouldnotcoverindetailinthebookduetospaceconstraints.Wealsogivesomepointersonwheretofindadditionalsourcesofinformationandhowtoengagewiththevariousopensourcecommunities.
![Page 30: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/30.jpg)
WhatyouneedforthisbookBecausemostpeopledon’thavealargenumberofsparemachinessittingaround,weusetheClouderaQuickStartvirtualmachineformostoftheexamplesinthisbook.ThisisasinglemachineimagewithallthecomponentsofafullHadoopclusterpre-installed.ItcanberunonanyhostmachinesupportingeithertheVMwareortheVirtualBoxvirtualizationtechnology.
WealsoexploreAmazonWebServicesandhowsomeoftheHadooptechnologiescanberunontheAWSElasticMapReduceservice.TheAWSservicescanbemanagedthroughawebbrowseroraLinuxcommand-lineinterface.
![Page 31: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/31.jpg)
WhothisbookisforThisbookisprimarilyaimedatapplicationandsystemdevelopersinterestedinlearninghowtosolvepracticalproblemsusingtheHadoopframeworkandrelatedcomponents.Althoughweshowexamplesinafewprogramminglanguages,astrongfoundationinJavaisthemainprerequisite.
Dataengineersandarchitectsmightalsofindthematerialconcerningdatalifecycle,fileformats,andcomputationalmodelsuseful.
![Page 32: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/32.jpg)
ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduce.jarfiletoourenvironmentbeforeaccessingindividualfields.”
Ablockofcodeissetasfollows:
topic_edges_grouped=FOREACHtopic_edges_grouped{
GENERATE
group.topic_idastopic,
group.source_idassource,
topic_edges.(destination_id,w)asedges;
}
Anycommand-lineinputoroutputiswrittenasfollows:
$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/
$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/
$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxes,appearinthetextlikethis:“Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.”
NoteWarningsorimportantnotesappearinaboxlikethis.
TipTipsandtricksappearlikethis.
![Page 33: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/33.jpg)
ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.
Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.
![Page 34: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/34.jpg)
CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.
![Page 35: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/35.jpg)
DownloadingtheexamplecodeThesourcecodeforthisbookcanbefoundonGitHubathttps://github.com/learninghadoop2/book-examples.Theauthorswillbeapplyinganyerratatothiscodeandkeepingituptodateasthetechnologiesevolve.Inadditionyoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
![Page 36: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/36.jpg)
ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.
Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.
![Page 37: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/37.jpg)
PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.
Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.
![Page 38: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/38.jpg)
QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.
![Page 39: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/39.jpg)
Chapter1.IntroductionThisbookwillteachyouhowtobuildamazingsystemsusingthelatestreleaseofHadoop.Beforeyouchangetheworldthough,weneedtodosomegroundwork,whichiswherethischaptercomesin.
Inthisintroductorychapter,wewillcoverthefollowingtopics:
AbriefrefresheronthebackgroundtoHadoopAwalk-throughofHadoop’sevolutionThekeyelementsinHadoop2TheHadoopdistributionswe’lluseinthisbookThedatasetwe’lluseforexamples
![Page 40: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/40.jpg)
AnoteonversioningInHadoop1,theversionhistorywassomewhatconvolutedwithmultipleforkedbranchesinthe0.2xrange,leadingtooddsituations,wherea1.xversioncould,insomesituations,havefewerfeaturesthana0.23release.Intheversion2codebase,thisisfortunatelymuchmorestraightforward,butit’simportanttoclarifyexactlywhichversionwewilluseinthisbook.
Hadoop2.0wasreleasedinalphaandbetaversions,andalongtheway,severalincompatiblechangeswereintroduced.Therewas,inparticular,amajorAPIstabilizationeffortbetweenthebetaandfinalreleasestages.
Hadoop2.2.0wasthefirstgeneralavailability(GA)releaseoftheHadoop2codebase,anditsinterfacesarenowdeclaredstableandforwardcompatible.Wewillthereforeusethe2.2productandinterfacesinthisbook.Thoughtheprincipleswillbeusableona2.0beta,inparticular,therewillbeAPIincompatibilitiesinthebeta.ThisisparticularlyimportantasMapReducev2wasback-portedtoHadoop1byseveraldistributionvendors,buttheseproductswerebasedonthebetaandnottheGAAPIs.Ifyouareusingsuchaproduct,thenyouwillencountertheseincompatiblechanges.ItisrecommendedthatareleasebaseduponHadoop2.2orlaterisusedforboththedevelopmentandtheproductiondeploymentsofanyHadoop2workloads.
![Page 41: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/41.jpg)
ThebackgroundofHadoopWe’reassumingthatmostreaderswillhavealittlefamiliaritywithHadoop,orattheveryleast,withbigdata-processingsystems.Consequently,wewon’tgiveadetailedbackgroundastowhyHadoopissuccessfulorthetypesofproblemithelpstosolveinthisbook.However,particularlybecauseofsomeaspectsofHadoop2andtheotherproductswewilluseinlaterchapters,itisusefultogiveasketchofhowweseeHadoopfittingintothetechnologylandscapeandwhicharetheparticularproblemareaswherewebelieveitgivesthemostbenefit.
Inancienttimes,beforetheterm“bigdata”cameintothepicture(whichequatestomaybeadecadeago),therewerefewoptionstoprocessdatasetsofsizesinterabytesandbeyond.Somecommercialdatabasescould,withveryspecificandexpensivehardwaresetups,bescaledtothislevel,buttheexpertiseandcapitalexpenditurerequiredmadeitanoptionforonlythelargestorganizations.Alternatively,onecouldbuildacustomsystemaimedatthespecificproblemathand.Thissufferedfromsomeofthesameproblems(expertiseandcost)andaddedtheriskinherentinanycutting-edgesystem.Ontheotherhand,ifasystemwassuccessfullyconstructed,itwaslikelyaverygoodfittotheneed.
Fewsmall-tomid-sizecompaniesevenworriedaboutthisspace,notonlybecausethesolutionswereoutoftheirreach,buttheygenerallyalsodidn’thaveanythingclosetothedatavolumesthatrequiredsuchsolutions.Astheabilitytogenerateverylargedatasetsbecamemorecommon,sodidtheneedtoprocessthatdata.
Eventhoughlargedatabecamemoredemocratizedandwasnolongerthedomainoftheprivilegedfew,majorarchitecturalchangeswererequiredifthedata-processingsystemscouldbemadeaffordabletosmallercompanies.Thefirstbigchangewastoreducetherequiredupfrontcapitalexpenditureonthesystem;thatmeansnohigh-endhardwareorexpensivesoftwarelicenses.Previously,high-endhardwarewouldhavebeenutilizedmostcommonlyinarelativelysmallnumberofverylargeserversandstoragesystems,eachofwhichhadmultipleapproachestoavoidhardwarefailures.Thoughveryimpressive,suchsystemsarehugelyexpensive,andmovingtoalargernumberoflower-endserverswouldbethequickestwaytodramaticallyreducethehardwarecostofanewsystem.Movingmoretowardcommodityhardwareinsteadofthetraditionalenterprise-gradeequipmentwouldalsomeanareductionincapabilitiesintheareaofresilienceandfaulttolerance.Thoseresponsibilitieswouldneedtobetakenupbythesoftwarelayer.Smartersoftware,dumberhardware.
GooglestartedthechangethatwouldeventuallybeknownasHadoop,whenin2003,andin2004,theyreleasedtwoacademicpapersdescribingtheGoogleFileSystem(GFS)(http://research.google.com/archive/gfs.html)andMapReduce(http://research.google.com/archive/mapreduce.html).Thetwotogetherprovidedaplatformforverylarge-scaledataprocessinginahighlyefficientmanner.Googlehadtakenthebuild-it-yourselfapproach,butinsteadofconstructingsomethingaimedatonespecificproblemordataset,theyinsteadcreatedaplatformonwhichmultipleprocessingapplicationscouldbeimplemented.Inparticular,theyutilizedlargenumbersof
![Page 42: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/42.jpg)
commodityserversandbuiltGFSandMapReduceinawaythatassumedhardwarefailureswouldbecommonplaceandweresimplysomethingthatthesoftwareneededtodealwith.
Atthesametime,DougCuttingwasworkingontheNutchopensourcewebcrawler.HewasworkingonelementswithinthesystemthatresonatedstronglyoncetheGoogleGFSandMapReducepaperswerepublished.DougstartedworkonopensourceimplementationsoftheseGoogleideas,andHadoopwassoonborn,firstly,asasubprojectofLucene,andthenasitsowntop-levelprojectwithintheApacheSoftwareFoundation.
Yahoo!hiredDougCuttingin2006andquicklybecameoneofthemostprominentsupportersoftheHadoopproject.InadditiontooftenpublicizingsomeofthelargestHadoopdeploymentsintheworld,Yahoo!allowedDougandotherengineerstocontributetoHadoopwhileemployedbythecompany,nottomentioncontributingbacksomeofitsowninternallydevelopedHadoopimprovementsandextensions.
![Page 43: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/43.jpg)
ComponentsofHadoopThebroadHadoopumbrellaprojecthasmanycomponentsubprojects,andwe’lldiscussseveraloftheminthisbook.Atitscore,Hadoopprovidestwoservices:storageandcomputation.AtypicalHadoopworkflowconsistsofloadingdataintotheHadoopDistributedFileSystem(HDFS)andprocessingusingtheMapReduceAPIorseveraltoolsthatrelyonMapReduceasanexecutionframework.
Hadoop1:HDFSandMapReduce
BothlayersaredirectimplementationsofGoogle’sownGFSandMapReducetechnologies.
![Page 44: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/44.jpg)
CommonbuildingblocksBothHDFSandMapReduceexhibitseveralofthearchitecturalprinciplesdescribedintheprevioussection.Inparticular,thecommonprinciplesareasfollows:
Botharedesignedtorunonclustersofcommodity(thatis,lowtomediumspecification)serversBothscaletheircapacitybyaddingmoreservers(scale-out)asopposedtothepreviousmodelsofusinglargerhardware(scale-up)BothhavemechanismstoidentifyandworkaroundfailuresBothprovidemostoftheirservicestransparently,allowingtheusertoconcentrateontheproblemathandBothhaveanarchitecturewhereasoftwareclustersitsonthephysicalserversandmanagesaspectssuchasapplicationloadbalancingandfaulttolerance,withoutrelyingonhigh-endhardwaretodeliverthesecapabilities
![Page 45: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/45.jpg)
StorageHDFSisafilesystem,thoughnotaPOSIX-compliantone.Thisbasicallymeansthatitdoesnotdisplaythesamecharacteristicsasthatofaregularfilesystem.Inparticular,thecharacteristicsareasfollows:
HDFSstoresfilesinblocksthataretypicallyatleast64MBor(morecommonlynow)128MBinsize,muchlargerthanthe4-32KBseeninmostfilesystemsHDFSisoptimizedforthroughputoverlatency;itisveryefficientatstreamingreadsoflargefilesbutpoorwhenseekingformanysmallonesHDFSisoptimizedforworkloadsthataregenerallywrite-onceandread-manyInsteadofhandlingdiskfailuresbyhavingphysicalredundanciesindiskarraysorsimilarstrategies,HDFSusesreplication.Eachoftheblockscomprisingafileisstoredonmultiplenodeswithinthecluster,andaservicecalledtheNameNodeconstantlymonitorstoensurethatfailureshavenotdroppedanyblockbelowthedesiredreplicationfactor.Ifthisdoeshappen,thenitschedulesthemakingofanothercopywithinthecluster.
![Page 46: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/46.jpg)
ComputationMapReduceisanAPI,anexecutionengine,andaprocessingparadigm;itprovidesaseriesoftransformationsfromasourceintoaresultdataset.Inthesimplestcase,theinputdataisfedthroughamapfunctionandtheresultanttemporarydataisthenfedthroughareducefunction.
MapReduceworksbestonsemistructuredorunstructureddata.Insteadofdataconformingtorigidschemas,therequirementisinsteadthatthedatacanbeprovidedtothemapfunctionasaseriesofkey-valuepairs.Theoutputofthemapfunctionisasetofotherkey-valuepairs,andthereducefunctionperformsaggregationtocollectthefinalsetofresults.
Hadoopprovidesastandardspecification(thatis,interface)forthemapandreducephases,andtheimplementationoftheseareoftenreferredtoasmappersandreducers.AtypicalMapReduceapplicationwillcompriseanumberofmappersandreducers,andit’snotunusualforseveralofthesetobeextremelysimple.Thedeveloperfocusesonexpressingthetransformationbetweenthesourceandtheresultantdata,andtheHadoopframeworkmanagesallaspectsofjobexecutionandcoordination.
![Page 47: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/47.jpg)
BettertogetherItispossibletoappreciatetheindividualmeritsofHDFSandMapReduce,buttheyareevenmorepowerfulwhencombined.Theycanbeusedindividually,butwhentheyaretogether,theybringoutthebestineachother,andthiscloseinterworkingwasamajorfactorinthesuccessandacceptanceofHadoop1.
WhenaMapReducejobisbeingplanned,Hadoopneedstodecideonwhichhosttoexecutethecodeinordertoprocessthedatasetmostefficiently.IftheMapReduceclusterhostsareallpullingtheirdatafromasinglestoragehostorarray,thenthislargelydoesn’tmatterasthestoragesystemisasharedresourcethatwillcausecontention.IfthestoragesystemwasmoretransparentandallowedMapReducetomanipulateitsdatamoredirectly,thentherewouldbeanopportunitytoperformtheprocessingclosertothedata,buildingontheprincipleofitbeinglessexpensivetomoveprocessingthandata.
ThemostcommondeploymentmodelforHadoopseestheHDFSandMapReduceclustersdeployedonthesamesetofservers.EachhostthatcontainsdataandtheHDFScomponenttomanagethedataalsohostsaMapReducecomponentthatcanscheduleandexecutedataprocessing.WhenajobissubmittedtoHadoop,itcanusethelocalityoptimizationtoscheduledataonthehostswheredataresidesasmuchaspossible,thusminimizingnetworktrafficandmaximizingperformance.
![Page 48: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/48.jpg)
Hadoop2–what’sthebigdeal?IfwelookatthetwomaincomponentsofthecoreHadoopdistribution,storageandcomputation,weseethatHadoop2hasaverydifferentimpactoneachofthem.WhereastheHDFSfoundinHadoop2ismostlyamuchmorefeature-richandresilientproductthantheHDFSinHadoop1,forMapReduce,thechangesaremuchmoreprofoundandhave,infact,alteredhowHadoopisperceivedasaprocessingplatformingeneral.Let’slookatHDFSinHadoop2first.
![Page 49: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/49.jpg)
StorageinHadoop2We’lldiscusstheHDFSarchitectureinmoredetailinChapter2,Storage,butfornow,it’ssufficienttothinkofamaster-slavemodel.Theslavenodes(calledDataNodes)holdtheactualfilesystemdata.Inparticular,eachhostrunningaDataNodewilltypicallyhaveoneormoredisksontowhichfilescontainingthedataforeachHDFSblockarewritten.TheDataNodeitselfhasnounderstandingoftheoverallfilesystem;itsroleistostore,serve,andensuretheintegrityofthedataforwhichitisresponsible.
Themasternode(calledtheNameNode)isresponsibleforknowingwhichoftheDataNodesholdswhichblockandhowtheseblocksarestructuredtoformthefilesystem.Whenaclientlooksatthefilesystemandwishestoretrieveafile,it’sviaarequesttotheNameNodethatthelistofrequiredblocksisretrieved.
ThismodelworkswellandhasbeenscaledtoclusterswithtensofthousandsofnodesatcompaniessuchasYahoo!So,thoughitisscalable,thereisaresiliencyrisk;iftheNameNodebecomesunavailable,thentheentireclusterisrenderedeffectivelyuseless.NoHDFSoperationscanbeperformed,andsincethevastmajorityofinstallationsuseHDFSasthestoragelayerforservices,suchasMapReduce,theyalsobecomeunavailableeveniftheyarestillrunningwithoutproblems.
Morecatastrophically,theNameNodestoresthefilesystemmetadatatoapersistentfileonitslocalfilesystem.IftheNameNodehostcrashesinawaythatthisdataisnotrecoverable,thenalldataontheclusteriseffectivelylostforever.ThedatawillstillexistonthevariousDataNodes,butthemappingofwhichblockscomprisewhichfilesislost.Thisiswhy,inHadoop1,thebestpracticewastohavetheNameNodesynchronouslywriteitsfilesystemmetadatatobothlocaldisksandatleastoneremotenetworkvolume(typicallyviaNFS).
SeveralNameNodehigh-availability(HA)solutionshavebeenmadeavailablebythird-partysuppliers,butthecoreHadoopproductdidnotoffersuchresilienceinVersion1.Giventhisarchitecturalsinglepointoffailureandtheriskofdataloss,itwon’tbeasurprisetohearthatNameNodeHAisoneofthemajorfeaturesofHDFSinHadoop2andissomethingwe’lldiscussindetailinlaterchapters.ThefeatureprovidesbothastandbyNameNodethatcanbeautomaticallypromotedtoserviceallrequestsshouldtheactiveNameNodefail,butalsobuildsadditionalresilienceforthecriticalfilesystemmetadataatopthismechanism.
HDFSinHadoop2isstillanon-POSIXfilesystem;itstillhasaverylargeblocksizeanditstilltradeslatencyforthroughput.However,itdoesnowhaveafewcapabilitiesthatcanmakeitlookalittlemorelikeatraditionalfilesystem.Inparticular,thecoreHDFSinHadoop2nowcanberemotelymountedasanNFSvolume.Thisisanotherfeaturethatwaspreviouslyofferedasaproprietarycapabilitybythird-partysuppliersbutisnowinthemainApachecodebase.
Overall,theHDFSinHadoop2ismoreresilientandcanbemoreeasilyintegratedintoexistingworkflowsandprocesses.It’sastrongevolutionoftheproductfoundinHadoop
![Page 50: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/50.jpg)
1.
![Page 51: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/51.jpg)
ComputationinHadoop2TheworkonHDFS2wasstartedbeforeadirectionforMapReducecrystallized.ThiswaslikelyduetothefactthatfeaturessuchasNameNodeHAweresuchanobviouspaththatthecommunityknewthemostcriticalareastoaddress.However,MapReducedidn’treallyhaveasimilarlistofareasofimprovement,andthat’swhy,whentheMRv2initiativestarted,itwasn’tcompletelyclearwhereitwouldlead.
PerhapsthemostfrequentcriticismofMapReduceinHadoop1washowitsbatchprocessingmodelwasill-suitedtoproblemdomainswherefasterresponsetimeswererequired.Hive,forexample,whichwe’lldiscussinChapter7,HadoopandSQL,providesaSQL-likeinterfaceontoHDFSdata,but,behindthescenes,thestatementsareconvertedintoMapReducejobsthatarethenexecutedlikeanyother.Anumberofotherproductsandtoolstookasimilarapproach,providingaspecificuser-facinginterfacethathidaMapReducetranslationlayer.
Thoughthisapproachhasbeenverysuccessful,andsomeamazingproductshavebeenbuilt,thefactremainsthatinmanycases,thereisamismatchasalloftheseinterfaces,someofwhichexpectacertaintypeofresponsiveness,arebehindthescenes,beingexecutedonabatch-processingplatform.WhenlookingtoenhanceMapReduce,improvementscouldbemadetomakeitabetterfittotheseusecases,butthefundamentalmismatchwouldremain.ThissituationledtoasignificantchangeoffocusoftheMRv2initiative;perhapsMapReduceitselfdidn’tneedchange,buttherealneedwastoenabledifferentprocessingmodelsontheHadoopplatform.ThuswasbornYetAnotherResourceNegotiator(YARN).
LookingatMapReduceinHadoop1,theproductactuallydidtwoquitedifferentthings;itprovidedtheprocessingframeworktoexecuteMapReducecomputations,butitalsomanagedtheallocationofthiscomputationacrossthecluster.Notonlydiditdirectdatatoandbetweenthespecificmapandreducetasks,butitalsodeterminedwhereeachtaskwouldrun,andmanagedthefulljoblifecycle,monitoringthehealthofeachtaskandnode,reschedulingifanyfailed,andsoon.
Thisisnotatrivialtask,andtheautomatedparallelizationofworkloadshasalwaysbeenoneofthemainbenefitsofHadoop.IfwelookatMapReduceinHadoop1,weseethataftertheuserdefinesthekeycriteriaforthejob,everythingelseistheresponsibilityofthesystem.Critically,fromascaleperspective,thesameMapReducejobcanbeappliedtodatasetsofanyvolumehostedonclustersofanysize.Ifthedatais1GBinsizeandonasinglehost,thenHadoopwillscheduletheprocessingaccordingly.Ifthedataisinstead1PBinsizeandhostedacross1,000machines,thenitdoeslikewise.Fromtheuser’sperspective,theactualscaleofthedataandclusteristransparent,andasidefromaffectingthetimetakentoprocessthejob,itdoesnotchangetheinterfacewithwhichtointeractwiththesystem.
InHadoop2,thisroleofjobschedulingandresourcemanagementisseparatedfromthatofexecutingtheactualapplication,andisimplementedbyYARN.
![Page 52: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/52.jpg)
YARNisresponsibleformanagingtheclusterresources,andsoMapReduceexistsasanapplicationthatrunsatoptheYARNframework.TheMapReduceinterfaceinHadoop2iscompletelycompatiblewiththatinHadoop1,bothsemanticallyandpractically.However,underthecovers,MapReducehasbecomeahostedapplicationontheYARNframework.
ThesignificanceofthissplitisthatotherapplicationscanbewrittenthatprovideprocessingmodelsmorefocusedontheactualproblemdomainandcanoffloadalltheresourcemanagementandschedulingresponsibilitiestoYARN.ThelatestversionsofmanydifferentexecutionengineshavebeenportedontoYARN,eitherinaproduction-readyorexperimentalstate,andithasshownthattheapproachcanallowasingleHadoopclustertoruneverythingfrombatch-orientedMapReducejobsthroughfast-responseSQLqueriestocontinuousdatastreamingandeventoimplementmodelssuchasgraphprocessingandtheMessagePassingInterface(MPI)fromtheHighPerformanceComputing(HPC)world.ThefollowingdiagramshowsthearchitectureofHadoop2:
Hadoop2
ThisiswhymuchoftheattentionandexcitementaroundHadoop2hasbeenfocusedonYARNandframeworksthatsitontopofit,suchasApacheTezandApacheSpark.WithYARN,theHadoopclusterisnolongerjustabatch-processingengine;itisthesingleplatformonwhichavastarrayofprocessingtechniquescanbeappliedtotheenormousdatavolumesstoredinHDFS.Moreover,applicationscanbuildonthesecomputationparadigmsandexecutionmodels.
TheanalogythatisachievingsometractionistothinkofYARNastheprocessingkerneluponwhichotherdomain-specificapplicationscanbebuilt.We’lldiscussYARNinmoredetailinthisbook,particularlyinChapter3,Processing–MapReduceandBeyond,Chapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark.
![Page 53: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/53.jpg)
![Page 54: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/54.jpg)
DistributionsofApacheHadoopIntheveryearlydaysofHadoop,theburdenofinstalling(oftenbuildingfromsource)andmanagingeachcomponentanditsdependenciesfellontheuser.Asthesystembecamemorepopularandtheecosystemofthird-partytoolsandlibrariesstartedtogrow,thecomplexityofinstallingandmanagingaHadoopdeploymentincreaseddramaticallytothepointwhereprovidingacoherentofferofsoftwarepackages,documentation,andtrainingbuiltaroundthecoreApacheHadoophasbecomeabusinessmodel.EntertheworldofdistributionsforApacheHadoop.
HadoopdistributionsareconceptuallysimilartohowLinuxdistributionsprovideasetofintegratedsoftwarearoundacommoncore.Theytaketheburdenofbundlingandpackagingsoftwarethemselvesandprovidetheuserwithaneasywaytoinstall,manage,anddeployApacheHadoopandaselectednumberofthird-partylibraries.Inparticular,thedistributionreleasesdeliveraseriesofproductversionsthatarecertifiedtobemutuallycompatible.Historically,puttingtogetheraHadoop-basedplatformwasoftengreatlycomplicatedbythevariousversioninterdependencies.
Cloudera(http://www.cloudera.com),Hortonworks(http://www.hortonworks.com),andMapR(http://www.mapr.com)areamongstthefirsttohavereachedthemarket,eachcharacterizedbydifferentapproachesandsellingpoints.Hortonworkspositionsitselfastheopensourceplayer;ClouderaisalsocommittedtoopensourcebutaddsproprietarybitsforconfiguringandmanagingHadoop;MapRprovidesahybridopensource/proprietaryHadoopdistributioncharacterizedbyaproprietaryNFSlayerinsteadofHDFSandafocusonprovidingservices.
AnotherstrongplayerinthedistributionsecosystemisAmazon,whichoffersaversionofHadoopcalledElasticMapReduce(EMR)ontopoftheAmazonWebServices(AWS)infrastructure.
WiththeadventofHadoop2,thenumberofavailabledistributionsforHadoophasincreaseddramatically,farinexcessofthefourwementioned.ApossiblyincompletelistofsoftwareofferingsthatincludesApacheHadoopcanbefoundathttp://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support.
![Page 55: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/55.jpg)
AdualapproachInthisbook,wewilldiscussboththebuildingandthemanagementoflocalHadoopclustersinadditiontoshowinghowtopushtheprocessingintothecloudviaEMR.
Thereasonforthisistwofold:firstly,thoughEMRmakesHadoopmuchmoreaccessible,thereareaspectsofthetechnologythatonlybecomeapparentwhenmanuallyadministeringthecluster.AlthoughitisalsopossibletouseEMRinamoremanualmode,we’llgenerallyusealocalclusterforsuchexplorations.Secondly,thoughitisn’tnecessarilyaneither/ordecision,manyorganizationsuseamixtureofin-houseandcloud-hostedcapacities,sometimesduetoaconcernofoverrelianceonasingleexternalprovider,butpracticallyspeaking,it’softenconvenienttododevelopmentandsmall-scaletestsonlocalcapacityandthendeployatproductionscaleintothecloud.
Inafewofthelaterchapters,wherewediscussadditionalproductsthatintegratewithHadoop,we’llmostlygiveexamplesoflocalclusters,asthereisnodifferencebetweenhowtheproductsworkregardlessofwheretheyaredeployed.
![Page 56: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/56.jpg)
AWS–infrastructureondemandfromAmazonAWSisasetofcloud-computingservicesofferedbyAmazon.Wewilluseseveraloftheseservicesinthisbook.
![Page 57: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/57.jpg)
SimpleStorageService(S3)Amazon’sSimpleStorageService(S3),foundathttp://aws.amazon.com/s3/,isastorageservicethatprovidesasimplekey-valuestoragemodel.Usingweb,command-line,orprogrammaticinterfacestocreateobjects,whichcanbeanythingfromtextfilestoimagestoMP3s,youcanstoreandretrieveyourdatabasedonahierarchicalmodel.Inthismodel,youcreatebucketsthatcontainobjects.Eachbuckethasauniqueidentifier,andwithineachbucket,everyobjectisuniquelynamed.ThissimplestrategyenablesanextremelypowerfulserviceforwhichAmazontakescompleteresponsibility(forservicescaling,inadditiontoreliabilityandavailabilityofdata).
![Page 58: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/58.jpg)
ElasticMapReduce(EMR)Amazon’sElasticMapReduce,foundathttp://aws.amazon.com/elasticmapreduce/,isbasicallyHadoopinthecloud.Usinganyofthemultipleinterfaces(webconsole,CLI,orAPI),aHadoopworkflowisdefinedwithattributessuchasthenumberofHadoophostsrequiredandthelocationofthesourcedata.TheHadoopcodeimplementingtheMapReducejobsisprovided,andthevirtualGobuttonispressed.
Initsmostimpressivemode,EMRcanpullsourcedatafromS3,processitonaHadoopclusteritcreatesonAmazon’svirtualhoston-demandserviceEC2,pushtheresultsbackintoS3,andterminatetheHadoopclusterandtheEC2virtualmachineshostingit.Naturally,eachoftheseserviceshasacost(usuallyonperGBstoredandserver-timeusagebasis),buttheabilitytoaccesssuchpowerfuldata-processingcapabilitieswithnoneedfordedicatedhardwareisapowerfulone.
![Page 59: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/59.jpg)
GettingstartedWewillnowdescribethetwoenvironmentswewillusethroughoutthebook:Cloudera’sQuickStartvirtualmachinewillbeourreferencesystemonwhichwewillshowallexamples,butwewilladditionallydemonstratesomeexamplesonAmazon’sEMRwhenthereissomeparticularlyvaluableaspecttorunningtheexampleintheon-demandservice.
Althoughtheexamplesandcodeprovidedareaimedatbeingasgeneral-purposeandportableaspossible,ourreferencesetup,whentalkingaboutalocalcluster,willbeClouderarunningatopCentOSLinux.
Forthemostpart,wewillshowexamplesthatmakeuseof,orareexecutedfrom,aterminalprompt.AlthoughHadoop’sgraphicalinterfaceshaveimprovedsignificantlyovertheyears(forexample,theexcellentHUEandClouderaManager),whenitcomestodevelopment,automation,andprogrammaticaccesstothesystem,thecommandlineisstillthemostpowerfultoolforthejob.
Allexamplesandsourcecodepresentedinthisbookcanbedownloadedfromhttps://github.com/learninghadoop2/book-examples.Inaddition,wehaveahomepageforthebookwherewewillpublishupdatesandrelatedmaterialathttp://learninghadoop2.com.
![Page 60: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/60.jpg)
ClouderaQuickStartVMOneoftheadvantagesofHadoopdistributionsisthattheygiveaccesstoeasy-to-install,packagedsoftware.ClouderatakesthisonestepfurtherandprovidesafreelydownloadableVirtualMachineinstanceofitslatestdistribution,knownastheCDHQuickStartVM,deployedontopofCentOSLinux.
Intheremainingpartsofthisbook,wewillusetheCDH5.0.0VMasthereferenceandbaselinesystemtorunexamplesandsourcecode.ImagesoftheVMareavailableforVMware(http://www.vmware.com/nl/products/player/),KVM(http://www.linux-kvm.org/page/Main_Page),andVirtualBox(https://www.virtualbox.org/)virtualizationsystems.
![Page 61: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/61.jpg)
AmazonEMRBeforeusingElasticMapReduce,weneedtosetupanAWSaccountandregisteritwiththenecessaryservices.
CreatinganAWSaccountAmazonhasintegrateditsgeneralaccountswithAWS,whichmeansthat,ifyoualreadyhaveanaccountforanyoftheAmazonretailwebsites,thisistheonlyaccountyouwillneedtouseAWSservices.
NoteNotethatAWSserviceshaveacost;youwillneedanactivecreditcardassociatedwiththeaccounttowhichchargescanbemade.
IfyourequireanewAmazonaccount,gotohttp://aws.amazon.com,selectCreateanewAWSaccount,andfollowtheprompts.Amazonhasaddedafreetierforsomeservices,soyoumightfindthatintheearlydaysoftestingandexploration,youarekeepingmanyofyouractivitieswithinthenonchargedtier.Thescopeofthefreetierhasbeenexpanding,somakesureyouknowwhatyouwillandwon’tbechargedfor.
SigningupforthenecessaryservicesOnceyouhaveanAmazonaccount,youwillneedtoregisteritforusewiththerequiredAWSservices,thatis,SimpleStorageService(S3),ElasticComputeCloud(EC2),andElasticMapReduce.ThereisnocosttosimplysignuptoanyAWSservice;theprocessjustmakestheserviceavailabletoyouraccount.
GototheS3,EC2,andEMRpageslinkedfromhttp://aws.amazon.com,clickontheSignupbuttononeachpage,andthenfollowtheprompts.
![Page 62: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/62.jpg)
UsingElasticMapReduceHavingcreatedanaccountwithAWSandregisteredalltherequiredservices,wecanproceedtoconfigureprogrammaticaccesstoEMR.
![Page 63: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/63.jpg)
GettingHadoopupandrunningNoteCaution!Thiscostsrealmoney!
Beforegoinganyfurther,itiscriticaltounderstandthatuseofAWSserviceswillincurchargesthatwillappearonthecreditcardassociatedwithyourAmazonaccount.Mostofthechargesarequitesmallandincreasewiththeamountofinfrastructureconsumed;storing10GBofdatainS3costs10timesmorethan1GB,andrunning20EC2instancescosts20timesasmuchasasingleone.Therearetieredcostmodels,sotheactualcoststendtohavesmallermarginalincreasesathigherlevels.Butyoushouldreadcarefullythroughthepricingsectionsforeachservicebeforeusinganyofthem.NotealsothatcurrentlydatatransferoutofAWSservices,suchasEC2andS3,ischargeable,butdatatransferbetweenservicesisnot.Thismeansitisoftenmostcost-effectivetocarefullydesignyouruseofAWStokeepdatawithinAWSthroughasmuchofthedataprocessingaspossible.ForinformationregardingAWSandEMR,consulthttp://aws.amazon.com/elasticmapreduce/#pricing.
HowtouseEMRAmazonprovidesbothwebandcommand-lineinterfacestoEMR.Bothinterfacesarejustafrontendtotheverysamesystem;aclustercreatedwiththecommand-lineinterfacecanbeinspectedandmanagedwiththewebtoolsandvice-versa.
Forthemostpart,wewillbeusingthecommand-linetoolstocreateandmanageclustersprogrammaticallyandwillfallbackonthewebinterfacecaseswhereitmakessensetodoso.
AWScredentialsBeforeusingeitherprogrammaticorcommand-linetools,weneedtolookathowanaccountholderauthenticatestoAWStomakesuchrequests.
EachAWSaccounthasseveralidentifiers,suchasthefollowing,thatareusedwhenaccessingthevariousservices:
AccountID:eachAWSaccounthasanumericID.Accesskey:theassociatedaccesskeyisusedtoidentifytheaccountmakingtherequest.Secretaccesskey:thepartnertotheaccesskeyisthesecretaccesskey.Theaccesskeyisnotasecretandcouldbeexposedinservicerequests,butthesecretaccesskeyiswhatyouusetovalidateyourselfastheaccountowner.Treatitlikeyourcreditcard.Keypairs:thesearethekeypairsusedtologintoEC2hosts.Itispossibletoeithergeneratepublic/privatekeypairswithinEC2ortoimportexternallygeneratedkeysintothesystem.
![Page 64: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/64.jpg)
UsercredentialsandpermissionsaremanagedviaawebservicecalledIdentityandAccessManagement(IAM),whichyouneedtosignuptoinordertoobtainaccessandsecretkeys.
Ifthissoundsconfusing,it’sbecauseitis,atleastatfirst.WhenusingatooltoaccessanAWSservice,there’susuallythesingle,upfrontstepofaddingtherightcredentialstoaconfiguredfile,andtheneverythingjustworks.However,ifyoudodecidetoexploreprogrammaticorcommand-linetools,itwillbeworthinvestingalittletimetoreadthedocumentationforeachservicetounderstandhowitssecurityworks.MoreinformationoncreatinganAWSaccountandobtainingaccesscredentialscanbefoundathttp://docs.aws.amazon.com/iam.
![Page 65: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/65.jpg)
TheAWScommand-lineinterfaceEachAWSservicehistoricallyhaditsownsetofcommand-linetools.Recentlythough,Amazonhascreatedasingle,unifiedcommand-linetoolthatallowsaccesstomostservices.TheAmazonCLIcanbefoundathttp://aws.amazon.com/cli.
Itcanbeinstalledfromatarballorviathepiporeasy_installpackagemanagers.
OntheCDHQuickStartVM,wecaninstallawscliusingthefollowingcommand:
$pipinstallawscli
InordertoaccesstheAPI,weneedtoconfigurethesoftwaretoauthenticatetoAWSusingouraccessandsecretkeys.
ThisisalsoagoodmomenttosetupanEC2keypairbyfollowingtheinstructionsprovidedathttps://console.aws.amazon.com/ec2/home?region=us-east-1#c=EC2&s=KeyPairs.
AlthoughakeypairisnotstrictlynecessarytorunanEMRcluster,itwillgiveusthecapabilitytoremotelylogintothemasternodeandgainlow-levelaccesstothecluster.
Thefollowingcommandwillguideyouthroughaseriesofconfigurationstepsandstoretheresultingconfigurationinthe.aws/credentialfile:
$awsconfigure
OncetheCLIisconfigured,wecanqueryAWSwithaws<service><arguments>.TocreateandqueryanS3bucketusesomethinglikethefollowingcommand.NotethatS3bucketsneedtobegloballyuniqueacrossallAWSaccounts,somostcommonnames,suchass3://mybucket,willnotbeavailable:
$awss3mbs3://learninghadoop2
$awss3ls
WecanprovisionanEMRclusterwithfivem1.xlargenodesusingthefollowingcommands:
$awsemrcreate-cluster--name"EMRcluster"\
--ami-version3.2.0\
--instance-typem1.xlarge\
--instance-count5\
--log-uris3://learninghadoop2/emr-logs
Where--ami-versionistheIDofanAmazonMachineImagetemplate(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html),and--log-uriinstructsEMRtocollectlogsandstoretheminthelearninghadoop2S3bucket.
NoteIfyoudidnotspecifyadefaultregionwhensettinguptheAWSCLI,thenyouwillalsohavetoaddonetomostEMRcommandsintheAWSCLIusingthe—regionargument;forexample,--regioneu-west-1isruntousetheEUIrelandregion.Youcanfind
![Page 66: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/66.jpg)
detailsofallavailableAWSregionsathttp://docs.aws.amazon.com/general/latest/gr/rande.html.
Wecansubmitworkflowsbyaddingstepstoarunningclusterusingthefollowingcommand:
$awsemradd-steps--cluster-id<cluster>--steps<steps>
Toterminatethecluster,usethefollowingcommandline:
$awsemrterminate-clusters--cluster-id<cluster>
Inlaterchapters,wewillshowyouhowtoaddstepstoexecuteMapReducejobsandPigscripts.
MoreinformationonusingtheAWSCLIcanbefoundathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage.html.
![Page 67: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/67.jpg)
RunningtheexamplesThesourcecodeofallexamplesisavailableathttps://github.com/learninghadoop2/book-examples.
Gradle(http://www.gradle.org/)scriptsandconfigurationsareprovidedtocompilemostoftheJavacode.ThegradlewscriptincludedwiththeexamplewillbootstrapGradleanduseittofetchdependenciesandcompilecode.
JARfilescanbecreatedbyinvokingthejartaskviaagradlewscript,asfollows:
./gradlewjar
JobsareusuallyexecutedbysubmittingaJARfileusingthehadoopjarcommand,asfollows:
$hadoopjarexample.jar<MainClass>[-libjars$LIBJARS]arg1arg2…argN
Theoptional-libjarsparameterspecifiesruntimethird-partydependenciestoshiptoremotenodes.
NoteSomeoftheframeworkswewillworkwith,suchasApacheSpark,comewiththeirownbuildandpackagemanagementtools.Additionalinformationandresourceswillbeprovidedfortheseparticularcases.
ThecopyJarGradletaskcanbeusedtodownloadthird-partydependenciesintobuild/libjars/<example>/lib,asfollows:
./gradlewcopyJar
Forconvenience,weprovideafatJarGradletaskthatbundlestheexampleclassesandtheirdependenciesintoasingleJARfile.Althoughthisapproachisdiscouragedinfavorofusing–libjar,itmightcomeinhandywhendealingwithdependencyissues.
Thefollowingcommandwillgeneratebuild/libs/<example>-all.jar:
$./gradlewfatJar
![Page 68: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/68.jpg)
DataprocessingwithHadoopIntheremainingchaptersofthisbook,wewillintroducethecorecomponentsoftheHadoopecosystemaswellasanumberofthird-partytoolsandlibrariesthatwillmakewritingrobust,distributedcodeanaccessibleandhopefullyenjoyabletask.Whilereadingthisbook,youwilllearnhowtocollect,process,store,andextractinformationfromlargeamountsofstructuredandunstructureddata.
WewilluseadatasetgeneratedfromTwitter’s(http://www.twitter.com)real-timefirehose.Thisapproachwillallowustoexperimentwithrelativelysmalldatasetslocallyand,onceready,scaletheexamplesuptoproduction-leveldatasizes.
![Page 69: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/69.jpg)
WhyTwitter?ThankstoitsprogrammaticAPIs,Twitterprovidesaneasywaytogeneratedatasetsofarbitrarysizeandinjectthemintoourlocal-orcloud-basedHadoopclusters.Otherthanthesheersize,thedatasetthatwewillusehasanumberofpropertiesthatfitseveralinterestingdatamodelingandprocessingusecases.
Twitterdatapossessesthefollowingproperties:
Unstructured:eachstatusupdateisatextmessagethatcancontainreferencestomediacontentsuchasURLsandimagesStructured:tweetsaretimestamped,sequentialrecordsGraph:relationshipssuchasrepliesandmentionscanbemodeledasanetworkofinteractionsGeolocated:thelocationwhereatweetwaspostedorwhereauserresidesRealtime:alldatageneratedonTwitterisavailableviaareal-timefirehose
ThesepropertieswillbereflectedinthetypeofapplicationthatwecanbuildwithHadoop.Theseincludeexamplesofsentimentanalysis,socialnetwork,andtrendanalysis.
![Page 70: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/70.jpg)
BuildingourfirstdatasetTwitter’stermsofserviceprohibitredistributionofuser-generateddatainanyform;forthisreason,wecannotmakeavailableacommondataset.Instead,wewilluseaPythonscripttoprogrammaticallyaccesstheplatformandcreateadumpofusertweetscollectedfromalivestream.
Oneservice,multipleAPIsTwitteruserssharemorethan200milliontweets,alsoknownasstatusupdates,aday.TheplatformoffersaccesstothiscorpusofdataviafourtypesofAPIs,eachofwhichrepresentsafacetofTwitterandaimsatsatisfyingspecificusecases,suchaslinkingandinteractingwithtwittercontentfromthird-partysources(TwitterforProducts),programmaticaccesstospecificusers’orsites’content(REST),searchcapabilitiesacrossusers’orsites’timelines(Search),andaccesstoallcontentcreatedontheTwitternetworkinrealtime(Streaming).
TheStreamingAPIallowsdirectaccesstotheTwitterstream,trackingkeywords,retrievinggeotaggedtweetsfromacertainregion,andmuchmore.Inthisbook,wewillmakeuseofthisAPIasadatasourcetoillustrateboththebatchandreal-timecapabilitiesofHadoop.Wewillnot,however,interactwiththeAPIitself;rather,wewillmakeuseofthird-partylibrariestooffloadchoressuchasauthenticationandconnectionmanagement.
AnatomyofaTweetEachtweetobjectreturnedbyacalltothereal-timeAPIsisrepresentedasaserializedJSONstringthatcontainsasetofattributesandmetadatainadditiontoatextualmessage.ThisadditionalcontentincludesanumericalIDthatuniquelyidentifiesthetweet,thelocationwherethetweetwasshared,theuserwhosharedit(userobject),whetheritwasrepublishedbyotherusers(retweeted)andhowmanytimes(retweetcount),themachine-detectedlanguageofitstext,whetherthetweetwaspostedinreplytosomeoneand,ifso,theuserandtweetIDsitrepliedto,andsoon.
ThestructureofaTweet,andanyotherobjectexposedbytheAPI,isconstantlyevolving.Anup-to-datereferencecanbefoundathttps://dev.twitter.com/docs/platform-objects/tweets.
TwittercredentialsTwittermakesuseoftheOAuthprotocoltoauthenticateandauthorizeaccessfromthird-partysoftwaretoitsplatform.
Theapplicationobtainsthroughanexternalchannel,forinstanceawebform,thefollowingpairofcredentials:
ConsumerkeyConsumersecret
Theconsumersecretisneverdirectlytransmittedtothethirdpartyasitisusedtosign
![Page 71: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/71.jpg)
eachrequest.
Theuserauthorizestheapplicationtoaccesstheserviceviaathree-wayprocessthat,oncecompleted,grantstheapplicationatokenconsistingofthefollowing:
AccesstokenAccesssecret
Similarly,totheconsumer,theaccesssecretisneverdirectlytransmittedtothethirdparty,anditisusedtosigneachrequest.
InordertousetheStreamingAPI,wewillfirstneedtoregisteranapplicationandgrantitprogrammaticaccesstothesystem.IfyourequireanewTwitteraccount,proceedtothesignuppageathttps://twitter.com/signup,andfillintherequiredinformation.Oncethisstepiscompleted,weneedtocreateasampleapplicationthatwillaccesstheAPIonourbehalfandgrantittheproperauthorizationrights.Wewilldosousingthewebformfoundathttps://dev.twitter.com/apps.
Whencreatinganewapp,weareaskedtogiveitaname,adescription,andaURL.ThefollowingscreenshotshowsthesettingsofasampleapplicationnamedLearningHadoop2BookDataset.Forthepurposeofthisbook,wedonotneedtospecifyavalidURL,soweusedaplaceholderinstead.
Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.
Wearenowpresentedwithapagethatsummarizesourapplicationdetailsasseeninthefollowingscreenshot;theauthenticationandauthorizationcredentialscanbefoundundertheOAuthTooltab.
WearefinallyreadytogenerateourveryfirstTwitterdataset.
![Page 72: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/72.jpg)
![Page 73: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/73.jpg)
ProgrammaticaccesswithPythonInthissection,wewillusePythonandthetweepylibrary,foundathttps://github.com/tweepy/tweepy,tocollectTwitter’sdata.Thestream.pyfilefoundinthech1directoryofthebookcodearchiveinstantiatesalistenertothereal-timefirehose,grabsadatasample,andechoeseachtweet’stexttostandardoutput.
Thetweepylibrarycanbeinstalledusingeithertheeasy_installorpippackagemanagersorbycloningtherepositoryathttps://github.com/tweepy/tweepy.
OntheCDHQuickStartVM,wecaninstalltweepyusingthefollowingcommandline:
$pipinstalltweepy
Wheninvokedwiththe-jparameter,thescriptwilloutputaJSONtweettostandardoutput;-textractsandprintsthetextfield.Wespecifyhowmanytweetstoprintwith–n<numtweets>.When–nisnotspecified,thescriptwillrunindefinitely.ExecutioncanbeterminatedbypressingCtrl+C.
ThescriptexpectsOAuthcredentialstobestoredasshellenvironmentvariables;thefollowingcredentialswillhavetobesetintheterminalsessionfromwherestream.pywillbeexecuted.
$exportTWITTER_CONSUMER_KEY="your_consumer_key"
$exportTWITTER_CONSUMER_SECRET="your_consumer_secret"
$exportTWITTER_ACCESS_KEY="your_access_key"
$exportTWITTER_ACCESS_SECRET="your_access_secret"
OncetherequireddependencyhasbeeninstalledandtheOAuthdataintheshellenvironmenthasbeenset,wecanruntheprogramasfollows:
$pythonstream.py–t–n1000>tweets.txt
WearerelyingonLinux’sshellI/Otoredirecttheoutputwiththe>operatorofstream.pytoafilecalledtweets.txt.Ifeverythingwasexecutedcorrectly,youshouldseeawalloftext,whereeachlineisatweet.
Noticethatinthisexample,wedidnotmakeuseofHadoopatall.Inthenextchapters,wewillshowhowtoimportadatasetgeneratedfromtheStreamingAPIintoHadoopandanalyzeitscontentonthelocalclusterandAmazonEMR.
Fornow,let’stakealookatthesourcecodeofstream.py,whichcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch1/stream.py:
importtweepy
importos
importjson
importargparse
consumer_key=os.environ['TWITTER_CONSUMER_KEY']
consumer_secret=os.environ['TWITTER_CONSUMER_SECRET']
access_key=os.environ['TWITTER_ACCESS_KEY']
access_secret=os.environ['TWITTER_ACCESS_SECRET']
![Page 74: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/74.jpg)
classEchoStreamListener(tweepy.StreamListener):
def__init__(self,api,dump_json=False,numtweets=0):
self.api=api
self.dump_json=dump_json
self.count=0
self.limit=int(numtweets)
super(tweepy.StreamListener,self).__init__()
defon_data(self,tweet):
tweet_data=json.loads(tweet)
if'text'intweet_data:
ifself.dump_json:
printtweet.rstrip()
else:
printtweet_data['text'].encode("utf-8").rstrip()
self.count=self.count+1
returnFalseifself.count==self.limitelseTrue
defon_error(self,status_code):
returnTrue
defon_timeout(self):
returnTrue
…
if__name__=='__main__':
parser=get_parser()
args=parser.parse_args()
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_key,access_secret)
api=tweepy.API(auth)
sapi=tweepy.streaming.Stream(
auth,EchoStreamListener(
api=api,
dump_json=args.json,
numtweets=args.numtweets))
sapi.sample()
First,weimportthreedependencies:tweepy,andtheosandjsonmodules,whichcomewiththePythoninterpreterversion2.6orgreater.
Wethendefineaclass,EchoStreamListener,thatinheritsandextendsStreamListenerfromtweepy.Asthenamesuggests,StreamListenerlistensforeventsandtweetsbeingpublishedonthereal-timestreamandperformsactionsaccordingly.
Wheneveraneweventisdetected,ittriggersacalltoon_data().Inthismethod,weextractthetextfieldfromatweetobjectandprintittostandardoutputwithUTF-8encoding.Alternatively,ifthescriptisinvokedwith-j,weprintthewholeJSONtweet.Whenthescriptisexecuted,weinstantiateatweepy.OAuthHandlerobjectwiththeOAuthcredentialsthatidentifyourTwitteraccount,andthenweusethisobjecttoauthenticatewiththeapplicationaccessandsecretkey.Wethenusetheauthobjecttocreateaninstanceofthetweepy.APIclass(api)
![Page 75: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/75.jpg)
Uponsuccessfulauthentication,wetellPythontolistenforeventsonthereal-timestreamusingEchoStreamListener.
AnhttpGETrequesttothestatuses/sampleendpointisperformedbysample().Therequestreturnsarandomsampleofallpublicstatuses.
NoteBeware!Bydefault,sample()willrunindefinitely.RemembertoexplicitlyterminatethemethodcallbypressingCtrl+C.
![Page 76: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/76.jpg)
SummaryThischaptergaveawhirlwindtourofwhereHadoopcamefrom,itsevolution,andwhytheversion2releaseissuchamajormilestone.WealsodescribedtheemergingmarketinHadoopdistributionsandhowwewilluseacombinationoflocalandclouddistributionsinthebook.
Finally,wedescribedhowtosetuptheneededsoftware,accounts,andenvironmentsrequiredinsubsequentchaptersanddemonstratedhowtopulldatafromtheTwitterstreamthatwewilluseforexamples.
Withthisbackgroundoutoftheway,wewillnowmoveontoadetailedexaminationofthestoragelayerwithinHadoop.
![Page 77: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/77.jpg)
Chapter2.StorageAftertheoverviewofHadoopinthepreviouschapter,wewillnowstartlookingatitsvariouscomponentpartsinmoredetail.Wewillstartattheconceptualbottomofthestackinthischapter:themeansandmechanismsforstoringdatawithinHadoop.Inparticular,wewilldiscussthefollowingtopics:
DescribethearchitectureoftheHadoopDistributedFileSystem(HDFS)ShowwhatenhancementstoHDFShavebeenmadeinHadoop2ExplorehowtoaccessHDFSusingcommand-linetoolsandtheJavaAPIGiveabriefdescriptionofZooKeeper—another(sortof)filesystemwithinHadoopSurveyconsiderationsforstoringdatainHadoopandtheavailablefileformats
InChapter3,Processing–MapReduceandBeyond,wewilldescribehowHadoopprovidestheframeworktoallowdatatobeprocessed.
![Page 78: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/78.jpg)
TheinnerworkingsofHDFSInChapter1,Introduction,wegaveaveryhigh-leveloverviewofHDFS;wewillnowexploreitinalittlemoredetail.Asmentionedinthatchapter,HDFScanbeviewedasafilesystem,thoughonewithveryspecificperformancecharacteristicsandsemantics.It’simplementedwithtwomainserverprocesses:theNameNodeandtheDataNodes,configuredinamaster/slavesetup.IfyouviewtheNameNodeasholdingallthefilesystemmetadataandtheDataNodesasholdingtheactualfilesystemdata(blocks),thenthisisagoodstartingpoint.EveryfileplacedontoHDFSwillbesplitintomultipleblocksthatmightresideonnumerousDataNodes,andit’stheNameNodethatunderstandshowtheseblockscanbecombinedtoconstructthefiles.
![Page 79: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/79.jpg)
ClusterstartupLet’sexplorethevariousresponsibilitiesofthesenodesandthecommunicationbetweenthembyassumingwehaveanHDFSclusterthatwaspreviouslyshutdownandthenexaminingthestartupbehavior.
NameNodestartupWe’llfirstlyconsiderthestartupoftheNameNode(thoughthereisnoactualorderingrequirementforthisandwearedoingitfornarrativereasonsalone).TheNameNodeactuallystorestwotypesofdataaboutthefilesystem:
Thestructureofthefilesystem,thatis,directorynames,filenames,locations,andattributesTheblocksthatcompriseeachfileonthefilesystem
ThisdataisstoredinfilesthattheNameNodereadsatstartup.NotethattheNameNodedoesnotpersistentlystorethemappingoftheblocksthatarestoredonparticularDataNodes;we’llseehowthatinformationiscommunicatedshortly.
BecausetheNameNodereliesonthisin-memoryrepresentationofthefilesystem,ittendstohavequitedifferenthardwarerequirementscomparedtotheDataNodes.We’llexplorehardwareselectioninmoredetailinChapter10,RunningaHadoopCluster;fornow,justrememberthattheNameNodetendstobequitememoryhungry.Thisisparticularlytrueonverylargeclusterswithmany(millionsormore)files,particularlyifthesefileshaveverylongnames.ThisscalinglimitationontheNameNodehasalsoledtoanadditionalHadoop2featurethatwewillnotexploreinmuchdetail:NameNodefederation,wherebymultipleNameNodes(orNameNodeHApairs)workcollaborativelytoprovidetheoverallmetadataforthefullfilesystem.
ThemainfilewrittenbytheNameNodeiscalledfsimage;thisisthesinglemostimportantpieceofdataintheentirecluster,aswithoutit,theknowledgeofhowtoreconstructallthedatablocksintotheusablefilesystemislost.Thisfileisreadintomemoryandallfuturemodificationstothefilesystemareappliedtothisin-memoryrepresentationofthefilesystem.TheNameNodedoesnotwriteoutnewversionsoffsimageasnewchangesareappliedafteritisrun;instead,itwritesanotherfilecallededits,whichisalistofthechangesthathavebeenmadesincethelastversionoffsimagewaswritten.
TheNameNodestartupprocessistofirstreadthefsimagefile,thentoreadtheeditsfile,andapplyallthechangesstoredintheeditsfiletothein-memorycopyoffsimage.Itthenwritestodiskanewup-to-dateversionofthefsimagefileandisreadytoreceiveclientrequests.
DataNodestartupWhentheDataNodesstartup,theyfirstcatalogtheblocksforwhichtheyholdcopies.Typically,theseblockswillbewrittensimplyasfilesonthelocalDataNodefilesystem.
![Page 80: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/80.jpg)
TheDataNodewillperformsomeblockconsistencycheckingandthenreporttotheNameNodethelistofblocksforwhichithasvalidcopies.ThisishowtheNameNodeconstructsthefinalmappingitrequires—bylearningwhichblocksarestoredonwhichDataNodes.OncetheDataNodehasregistereditselfwiththeNameNode,anongoingseriesofheartbeatrequestswillbesentbetweenthenodestoallowtheNameNodetodetectDataNodesthathaveshutdown,becomeunreachable,orhavenewlyenteredthecluster.
![Page 81: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/81.jpg)
BlockreplicationHDFSreplicateseachblockontomultipleDataNodes;thedefaultreplicationfactoris3,butthisisconfigurableonaper-filelevel.HDFScanalsobeconfiguredtobeabletodeterminewhethergivenDataNodesareinthesamephysicalhardwarerackornot.Givensmartblockplacementandthisknowledgeoftheclustertopology,HDFSwillattempttoplacethesecondreplicaonadifferenthostbutinthesameequipmentrackasthefirstandthethirdonahostoutsidetherack.Inthisway,thesystemcansurvivethefailureofasmuchasafullrackofequipmentandstillhaveatleastonelivereplicaforeachblock.Aswe’llseeinChapter3,Processing–MapReduceandBeyond,knowledgeofblockplacementalsoallowsHadooptoscheduleprocessingasnearaspossibletoareplicaofeachblock,whichcangreatlyimproveperformance.
Rememberthatreplicationisastrategyforresiliencebutisnotabackupmechanism;ifyouhavedatamasteredinHDFSthatiscritical,thenyouneedtoconsiderbackuporotherapproachesthatgiveprotectionforerrors,suchasaccidentallydeletedfiles,againstwhichreplicationwillnotdefend.
WhentheNameNodestartsupandisreceivingtheblockreportsfromtheDataNodes,itwillremaininsafemodeuntilaconfigurablethresholdofblocks(thedefaultis99.9percent)havebeenreportedaslive.Whileinsafemode,clientscannotmakeanymodificationstothefilesystem.
![Page 82: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/82.jpg)
Command-lineaccesstotheHDFSfilesystemWithintheHadoopdistribution,thereisacommand-lineutilitycalledhdfs,whichistheprimarywaytointeractwiththefilesystemfromthecommandline.Runthiswithoutanyargumentstoseethevarioussubcommandsavailable.Therearemany,though;severalareusedtodothingslikestartingorstoppingvariousHDFScomponents.Thegeneralformofthehdfscommandis:
hdfs<sub-command><command>[arguments]
Thetwomainsubcommandswewilluseinthisbookare:
dfs:Thisisusedforgeneralfilesystemaccessandmanipulation,includingreading/writingandaccessingfilesanddirectoriesdfsadmin:Thisisusedforadministrationandmaintenanceofthefilesystem.Wewillnotcoverthiscommandindetail,though.Havealookatthe-reportcommand,whichgivesalistingofthestateofthefilesystemandallDataNodes:
$hdfsdfsadmin-report
NoteNotethatthedfsanddfsadmincommandscanalsobeusedwiththemainHadoopcommand-lineutility,forexample,hadoopfs-ls/.ThiswastheapproachinearlierversionsofHadoopbutisnowdeprecatedinfavorofthehdfscommand.
![Page 83: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/83.jpg)
ExploringtheHDFSfilesystemRunthefollowingtogetalistoftheavailablecommandsprovidedbythedfssubcommand:
$hdfsdfs
Aswillbeseenfromtheoutputoftheprecedingcommand,manyoftheselooksimilartostandardUnixfilesystemcommandsand,notsurprisingly,theyworkaswouldbeexpected.InourtestVM,wehaveauseraccountcalledcloudera.Usingthisuser,wecanlisttherootofthefilesystemasfollows:
$hdfsdfs-ls/
Found7items
drwxr-xr-x-hbasehbase02014-04-0415:18/hbase
drwxr-xr-x-hdfssupergroup02014-10-2113:16/jar
drwxr-xr-x-hdfssupergroup02014-10-1515:26/schema
drwxr-xr-x-solrsolr02014-04-0415:16/solr
drwxrwxrwt-hdfssupergroup02014-11-1211:29/tmp
drwxr-xr-x-hdfssupergroup02014-07-1309:05/user
drwxr-xr-x-hdfssupergroup02014-04-0415:15/var
TheoutputisverysimilartotheUnixlscommand.Thefileattributesworkthesameastheuser/group/worldattributesonaUnixfilesystem(includingthetstickybitascanbeseen)plusdetailsoftheowner,group,andmodificationtimeofthedirectories.Thecolumnbetweenthegroupnameandthemodifieddateisthesize;thisis0fordirectoriesbutwillhaveavalueforfilesaswe’llseeinthecodefollowingthenextinformationbox:
NoteIfrelativepathsareused,theyaretakenfromthehomedirectoryoftheuser.Ifthereisnohomedirectory,wecancreateitusingthefollowingcommands:
$sudo-uhdfshdfsdfs–mkdir/user/cloudera
$sudo-uhdfshdfsdfs–chowncloudera:cloudera/user/cloudera
Themkdirandchownstepsrequiresuperuserprivileges(sudo-uhdfs).
$hdfsdfs-mkdirtestdir
$hdfsdfs-ls
Found1items
drwxr-xr-x-clouderacloudera02014-11-1311:21testdir
Then,wecancreateafile,copyittoHDFS,andreaditscontentsdirectlyfromitslocationonHDFS,asfollows:
$echo"Helloworld">testfile.txt
$hdfsdfs-puttestfile.txttestdir
Notethatthereisanoldercommandcalled-copyFromLocal,whichworksinthesamewayas-put;youmightseeitinolderdocumentationonline.Now,runthefollowingcommandandchecktheoutput:
![Page 84: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/84.jpg)
$hdfsdfs-lstestdir
Found1items
-rw-r--r--3clouderacloudera122014-11-1311:21
testdir/testfile.txt
Notethenewcolumnbetweenthefileattributesandtheowner;thisisthereplicationfactorofthefile.Now,finally,runthefollowingcommand:
$hdfsdfs-tailtestdir/testfile.txt
Helloworld
Muchoftherestofthedfssubcommandsareprettyintuitive;playaround.We’llexploresnapshotsandprogrammaticaccesstoHDFSlaterinthischapter.
![Page 85: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/85.jpg)
ProtectingthefilesystemmetadataBecausethefsimagefileissocriticaltothefilesystem,itslossisacatastrophicfailure.InHadoop1,wheretheNameNodewasasinglepointoffailure,thebestpracticewastoconfiguretheNameNodetosynchronouslywritethefsimageandeditsfilestobothlocalstorageplusatleastoneotherlocationonaremotefilesystem(oftenNFS).IntheeventofNameNodefailure,areplacementNameNodecouldbestartedusingthisup-to-datecopyofthefilesystemmetadata.Theprocesswouldrequirenon-trivialmanualintervention,however,andwouldresultinaperiodofcompleteclusterunavailability.
![Page 86: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/86.jpg)
SecondaryNameNodenottotherescueThemostunfortunatelynamedcomponentinallofHadoop1wastheSecondaryNameNode,which,notunreasonably,manypeopleexpecttobesomesortofbackuporstandbyNameNode.Itisnot;instead,theSecondaryNameNodewasresponsibleonlyforperiodicallyreadingthelatestversionofthefsimageandeditsfileandcreatinganewup-to-datefsimagewiththeoutstandingeditsapplied.Onabusycluster,thischeckpointcouldsignificantlyspeeduptherestartoftheNameNodebyreducingthenumberofeditsithadtoapplybeforebeingabletoserviceclients.
InHadoop2,thenamingismoreclear;thereareCheckpointnodes,whichdotherolepreviouslyperformedbytheSecondaryNameNode,plusBackupNameNodes,whichkeepalocalup-to-datecopyofthefilesystemmetadataeventhoughtheprocesstopromoteaBackupnodetobetheprimaryNameNodeisstillamultistagemanualprocess.
![Page 87: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/87.jpg)
Hadoop2NameNodeHAInmostproductionHadoop2clusters,however,itmakesmoresensetousethefullHighAvailability(HA)solutioninsteadofrelyingonCheckpointandBackupnodes.ItisactuallyanerrortotrytocombineNameNodeHAwiththeCheckpointandBackupnodemechanisms.
Thecoreideaisforapair(currentlynomorethantwoaresupported)ofNameNodesconfiguredinanactive/passivecluster.OneNameNodeactsasthelivemasterthatservicesallclientrequests,andthesecondremainsreadytotakeovershouldtheprimaryfail.Inparticular,Hadoop2HDFSenablesthisHAthroughtwomechanisms:
ProvidingameansforbothNameNodestohaveconsistentviewsofthefilesystemProvidingameansforclientstoalwaysconnecttothemasterNameNode
KeepingtheHANameNodesinsyncThereareactuallytwomechanismsbywhichtheactiveandstandbyNameNodeskeeptheirviewsofthefilesystemconsistent;useofanNFSshareorQuorumJournalManager(QJM).
IntheNFScase,thereisanobviousrequirementonanexternalremoteNFSfileshare—notethatasuseofNFSwasbestpracticeinHadoop1forasecondcopyoffilesystemmetadatamanyclustersalreadyhaveone.Ifhighavailabilityisaconcern,thoughitshouldbeborneinmindthatmakingNFShighlyavailableoftenrequireshigh-endandexpensivehardware.InHadoop2,HAusesNFS;however,theNFSlocationbecomestheprimarylocationforthefilesystemmetadata.AstheactiveNameNodewritesallfilesystemchangestotheNFSshare,thestandbynodedetectsthesechangesandupdatesitscopyofthefilesystemmetadataaccordingly.
TheQJMmechanismusesanexternalservice(theJournalManagers)insteadofafilesystem.TheJournalManagerclusterisanoddnumberofservices(3,5,and7arethemostcommon)runningonthatnumberofhosts.AllchangestothefilesystemaresubmittedtotheQJMservice,andachangeistreatedascommittedonlywhenamajorityoftheQJMnodeshavecommittedthechange.ThestandbyNameNodereceiveschangeupdatesfromtheQJMserviceandusesthisinformationtokeepitscopyofthefilesystemmetadatauptodate.
TheQJMmechanismdoesnotrequireadditionalhardwareastheCheckpointnodesarelightweightandcanbeco-locatedwithotherservices.Thereisalsonosinglepointoffailureinthemodel.Consequently,theQJMHAisusuallythepreferredoption.
Ineithercase,bothinNFS-basedHAandQJM-basedHA,theDataNodessendblockstatusreportstobothNameNodestoensurethatbothhaveup-to-dateinformationofthemappingofblockstoDataNodes.Rememberthatthisblockassignmentinformationisnotheldinthefsimage/editsdata.
![Page 88: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/88.jpg)
ClientconfigurationTheclientstotheHDFSclusterremainmostlyunawareofthefactthatNameNodeHAisbeingused.TheconfigurationfilesneedtoincludethedetailsofbothNameNodes,butthemechanismsfordeterminingwhichistheactiveNameNode—andwhentoswitchtothestandby—arefullyencapsulatedintheclientlibraries.ThefundamentalconceptthoughisthatinsteadofreferringtoanexplicitNameNodehostasinHadoop1,HDFSinHadoop2identifiesanameserviceIDfortheNameNodewithinwhichmultipleindividualNameNodes(eachwithitsownNameNodeID)aredefinedforHA.NotethattheconceptofnameserviceIDisalsousedbyNameNodefederation,whichwebrieflymentionedearlier.
![Page 89: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/89.jpg)
HowafailoverworksFailovercanbeeithermanualorautomatic.AmanualfailoverrequiresanadministratortotriggertheswitchthatpromotesthestandbytothecurrentlyactiveNameNode.Thoughautomaticfailoverhasthegreatestimpactonmaintainingsystemavailability,theremightbeconditionsinwhichthisisnotalwaysdesirable.Triggeringamanualfailoverrequiresrunningonlyafewcommandsand,therefore,eveninthismode,thefailoverissignificantlyeasierthaninthecaseofHadoop1orwithHadoop2Backupnodes,wherethetransitiontoanewNameNoderequiressubstantialmanualeffort.
Regardlessofwhetherthefailoveristriggeredmanuallyorautomatically,ithastwomainphases:confirmationthatthepreviousmasterisnolongerservingrequestsandthepromotionofthestandbytobethemaster.
ThegreatestriskinafailoveristohaveaperiodinwhichbothNameNodesareservicingrequests.Insuchasituation,itispossiblethatconflictingchangesmightbemadetothefilesystemonthetwoNameNodesorthattheymightbecomeoutofsync.EventhoughthisshouldnotbepossibleiftheQJMisbeingused(itonlyeveracceptsconnectionsfromasingleclient),out-of-dateinformationmightbeservedtoclients,whomightthentrytomakeincorrectdecisionsbasedonthisstalemetadata.Thisis,ofcourse,particularlylikelyifthepreviousmasterNameNodeisbehavingincorrectlyinsomeway,whichiswhytheneedforthefailoverisidentifiedinthefirstplace.
ToensureonlyoneNameNodeisactiveatanytime,afencingmechanismisusedtovalidatethattheexistingNameNodemasterhasbeenshutdown.ThesimplestincludedmechanismwilltrytosshintotheNameNodehostandactivelykilltheprocessthoughacustomscriptcanalsobeexecuted,sothemechanismisflexible.ThefailoverwillnotcontinueuntilthefencingissuccessfulandthesystemhasconfirmedthatthepreviousmasterNameNodeisnowdeadandhasreleasedanyrequiredresources.
Oncefencingsucceeds,thestandbyNameNodebecomesthemasterandwillstartwritingtotheNFS-mountedfsimageandeditslogsifNFSisbeingusedforHAorwillbecomethesingleclienttotheQJMifthatistheHAmechanism.
Beforediscussingautomaticfailover,weneedaslightseguetointroduceanotherApacheprojectthatisusedtoenablethisfeature.
![Page 90: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/90.jpg)
ApacheZooKeeper–adifferenttypeoffilesystemWithinHadoop,wewillmostlytalkaboutHDFSwhendiscussingfilesystemsanddatastorage.But,insidealmostallHadoop2installations,thereisanotherservicethatlookssomewhatlikeafilesystem,butwhichprovidessignificantcapabilitycrucialtotheproperfunctioningofdistributedsystems.ThisserviceisApacheZooKeeper(http://zookeeper.apache.org)and,asitisakeypartoftheimplementationofHDFSHA,wewillintroduceitinthischapter.Itis,however,alsousedbymultipleotherHadoopcomponentsandrelatedprojects,sowewilltouchonitseveralmoretimesthroughoutthebook.
ZooKeeperstartedoutasasubcomponentofHBaseandwasusedtoenableseveraloperationalcapabilitiesoftheservice.Whenanycomplexdistributedsystemisbuilt,thereareaseriesofactivitiesthatarealmostalwaysrequiredandwhicharealwaysdifficulttogetright.Theseactivitiesincludethingssuchashandlingsharedlocks,detectingcomponentfailure,andsupportingleaderelectionwithinagroupofcollaboratingservices.ZooKeeperwascreatedasthecoordinationservicethatwouldprovideaseriesofprimitiveoperationsuponwhichHBasecouldimplementthesetypesofoperationallycriticalfeatures.NotethatZooKeeperalsotakesinspirationfromtheGoogleChubbysystemdescribedathttp://research.google.com/archive/chubby-osdi06.pdf.
ZooKeeperrunsasaclusterofinstancesreferredtoasanensemble.Theensembleprovidesadatastructure,whichissomewhatanalogoustoafilesystem.EachlocationinthestructureiscalledaZNodeandcanhavechildrenasifitwereadirectorybutcanalsohavecontentasifitwereafile.NotethatZooKeeperisnotasuitableplacetostoreverylargeamountsofdata,andbydefault,themaximumamountofdatainaZNodeis1MB.Atanypointintime,oneserverintheensembleisthemasterandmakesalldecisionsaboutclientrequests.Thereareverywell-definedrulesaroundtheresponsibilitiesofthemaster,includingthatithastoensurethatarequestisonlycommittedwhenamajorityoftheensemblehavecommittedthechange,andthatoncecommittedanyconflictingchangeisrejected.
YoushouldhaveZooKeeperinstalledwithinyourClouderaVirtualMachine.Ifnot,useClouderaManagertoinstallitasasinglenodeonthehost.Inproductionsystems,ZooKeeperhasveryspecificsemanticsaroundabsolutemajorityvoting,sosomeofthelogiconlymakessenseinalargerensemble(3,5,or7nodesarethemostcommonsizes).
Thereisacommand-lineclienttoZooKeepercalledzookeeper-clientintheClouderaVM;notethatinthevanillaZooKeeperdistributionitiscalledzkCli.sh.Ifyourunitwithnoarguments,itwillconnecttotheZooKeeperserverrunningonthelocalmachine.Fromhere,youcantypehelptogetalistofcommands.
Themostimmediatelyinterestingcommandswillbecreate,ls,andget.Asthenamessuggest,thesecreateaZNode,listtheZNodesataparticularpointinthefilesystem,and
![Page 91: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/91.jpg)
getthedatastoredataparticularZNode.Herearesomeexamplesofusage.
CreateaZNodewithnodata:
$create/zk-test''
CreateachildofthefirstZNodeandstoresometextinit:
$create/zk-test/child1'sampledata'
RetrievethedataassociatedwithaparticularZNode:
$get/zk-test/child1
TheclientcanalsoregisterawatcheronagivenZNode—thiswillraiseanalertiftheZNodeinquestionchanges,eitheritsdataorchildrenbeingmodified.
Thismightnotsoundveryuseful,butZNodescanadditionallybecreatedasbothsequentialandephemeralnodes,andthisiswherethemagicstarts.
![Page 92: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/92.jpg)
ImplementingadistributedlockwithsequentialZNodesIfaZNodeiscreatedwithintheCLIwiththe-soption,itwillbecreatedasasequentialnode.ZooKeeperwillsuffixthesuppliednamewitha10-digitintegerguaranteedtobeuniqueandgreaterthananyothersequentialchildrenofthesameZNode.Wecanusethismechanismtocreateadistributedlock.ZooKeeperitselfisnotholdingtheactuallock;theclientneedstounderstandwhatparticularstatesinZooKeepermeanintermsoftheirmappingtotheapplicationlocksinquestion.
Ifwecreatea(non-sequential)ZNodeat/zk-lock,thenanyclientwishingtoholdthelockwillcreateasequentialchildnode.Forexample,thecreate-s/zk-lock/locknodecommandmightcreatethenode,/zk-lock/locknode-0000000001,inthefirstcase,withincreasingintegersuffixesforsubsequentcalls.WhenaclientcreatesaZNodeunderthelock,itwillthencheckifitssequentialnodehasthelowestintegersuffix.Ifitdoes,thenitistreatedashavingthelock.Ifnot,thenitwillneedtowaituntilthenodeholdingthelockisdeleted.Theclientwillusuallyputawatchonthenodewiththenextlowestsuffixandthenbealertedwhenthatnodeisdeleted,indicatingthatitnowholdsthelock.
![Page 93: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/93.jpg)
ImplementinggroupmembershipandleaderelectionusingephemeralZNodesAnyZooKeeperclientwillsendheartbeatstotheserverthroughoutthesession,showingthatitisalive.FortheZNodeswehavediscusseduntilnow,wecansaythattheyarepersistentandwillsurviveacrosssessions.Wecan,however,createaZNodeasephemeral,meaningitwilldisappearoncetheclientthatcreatediteitherdisconnectsorisdetectedasbeingdeadbytheZooKeeperserver.WithintheCLIanephemeralZNodeiscreatedbyaddingthe-eflagtothecreatecommand.
EphemeralZNodesareagoodmechanismtoimplementgroupmembershipdiscoverywithinadistributedsystem.Foranysystemwherenodescanfail,join,andleavewithoutnotice,knowingwhichnodesarealiveatanypointintimeisoftenadifficulttask.WithinZooKeeper,wecanprovidethebasisforsuchdiscoverybyhavingeachnodecreateanephemeralZNodeatacertainlocationintheZooKeeperfilesystem.TheZNodescanholddataabouttheservicenodes,suchashostname,IPaddress,portnumber,andsoon.Togetalistoflivenodes,wecansimplylistthechildnodesoftheparentgroupZNode.Becauseofthenatureofephemeralnodes,wecanhaveconfidencethatthelistoflivenodesretrievedatanytimeisuptodate.
IfwehaveeachservicenodecreateZNodechildrenthatarenotjustephemeralbutalsosequential,thenwecanalsobuildamechanismforleaderelectionforservicesthatneedtohaveasinglemasternodeatanyonetime.Themechanismisthesameforlocks;theclientservicenodecreatesthesequentialandephemeralZNodeandthenchecksifithasthelowestsequencenumber.Ifso,thenitisthemaster.Ifnot,thenitwillregisterawatcheronthenextlowestsequencenodetobealertedwhenitmightbecomethemaster.
![Page 94: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/94.jpg)
JavaAPITheorg.apache.zookeeper.ZooKeeperclassisthemainprogrammaticclienttoaccessaZooKeeperensemble.Refertothejavadocsforthefulldetails,butthebasicinterfaceisrelativelystraightforwardwithobviousone-to-onecorrespondencetocommandsintheCLI.Forexample:
create:isequivalenttoCLIcreategetChildren:isequivalenttoCLIlsgetData:isequivalenttoCLIget
![Page 95: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/95.jpg)
BuildingblocksAscanbeseen,ZooKeeperprovidesasmallnumberofwell-definedoperationswithverystrongsemanticguaranteesthatcanbebuiltintohigher-levelservices,suchasthelocks,groupmembership,andleaderelectionwediscussedearlier.It’sbesttothinkofZooKeeperasatoolkitofwell-engineeredandreliablefunctionscriticaltodistributedsystemsthatcanbebuiltuponwithouthavingtoworryabouttheintricaciesoftheirimplementation.TheprovidedZooKeeperinterfaceisquitelow-levelthough,andthereareafewhigher-levelinterfacesemergingthatprovidemoreofthemappingofthelow-levelprimitivesintoapplication-levellogic.TheCuratorproject(http://curator.apache.org/)isagoodexampleofthis.
ZooKeeperwasusedsparinglywithinHadoop1,butit’snowquiteubiquitous.It’susedbybothMapReduceandHDFSforthehighavailabilityoftheirJobTrackerandNameNodecomponents.HiveandImpala,whichwewillexplorelater,useittoplacelocksondatatablesthatarebeingaccessedbymultipleconcurrentjobs.Kafka,whichwe’lldiscussinthecontextofSamza,usesZooKeeperfornode(brokerinKafkaterminology),leaderelection,andstatemanagement.
![Page 96: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/96.jpg)
FurtherreadingWehavenotdescribedZooKeeperinmuchdetailandhavecompletelyomittedaspectssuchasitsabilitytoapplyquotasandaccesscontrolliststoZNodeswithinthefilesystemandthemechanismstobuildcallbacks.OurpurposeherewastogiveenoughofthedetailssothatyouwouldhavesomeideaofhowitwasbeingusedwithintheHadoopservicesweexploreinthisbook.Formoreinformation,consulttheprojecthomepage.
![Page 97: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/97.jpg)
AutomaticNameNodefailoverNowthatwehaveintroducedZooKeeper,wecanshowhowitisusedtoenableautomaticNameNodefailover.
AutomaticNameNodefailoverintroducestwonewcomponentstothesystem,aZooKeeperquorum,andtheZooKeeperFailoverController(ZKFC),whichrunsoneachNameNodehost.TheZKFCcreatesanephemeralZNodeinZooKeeperandholdsthisZNodeforaslongasitdetectsthelocalNameNodetobealiveandfunctioningcorrectly.Itdeterminesthisbycontinuouslysendingsimplehealth-checkrequeststotheNameNode,andiftheNameNodefailstorespondcorrectlyoverashortperiodoftimetheZKFCwillassumetheNameNodehasfailed.IfaNameNodemachinecrashesorotherwisefails,theZKFCsessioninZooKeeperwillbeclosedandtheephemeralZNodewillalsobeautomaticallyremoved.
TheZKFCprocessesarealsomonitoringtheZNodesoftheotherNameNodesinthecluster.IftheZKFConthestandbyNameNodehostseestheexistingmasterZNodedisappear,itwillassumethemasterhasfailedandwillattemptafailover.ItdoesthisbytryingtoacquirethelockfortheNameNode(throughtheprotocoldescribedintheZooKeepersection)andifsuccessfulwillinitiateafailoverthroughthesamefencing/promotionmechanismdescribedearlier.
![Page 98: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/98.jpg)
HDFSsnapshotsWementionedearlierthatHDFSreplicationaloneisnotasuitablebackupstrategy.IntheHadoop2filesystem,snapshotshavebeenadded,whichbringsanotherlevelofdataprotectiontoHDFS.
Filesystemsnapshotshavebeenusedforsometimeacrossavarietyoftechnologies.Thebasicideaisthatitbecomespossibletoviewtheexactstateofthefilesystematparticularpointsintime.Thisisachievedbytakingacopyofthefilesystemmetadataatthepointthesnapshotismadeandmakingthisavailabletobeviewedinthefuture.
Aschangestothefilesystemaremade,anychangethatwouldaffectthesnapshotistreatedspecially.Forexample,ifafilethatexistsinthesnapshotisdeletedthen,eventhoughitwillberemovedfromthecurrentstateofthefilesystem,itsmetadatawillremaininthesnapshot,andtheblocksassociatedwithitsdatawillremainonthefilesystemthoughnotaccessiblethroughanyviewofthesystemotherthanthesnapshot.
Anexamplemightillustratethispoint.Say,youhaveafilesystemcontainingthefollowingfiles:
/data1(5blocks)
/data2(10blocks)
Youtakeasnapshotandthendeletethefile/data2.Ifyouviewthecurrentstateofthefilesystem,thenonly/data1willbevisible.Ifyouexaminethesnapshot,youwillseebothfiles.Behindthescenes,all15blocksstillexist,butonlythoseassociatedwiththeun-deletedfile/data1arepartofthecurrentfilesystem.Theblocksforthefile/data2willbereleasedonlywhenthesnapshotisitselfremoved—snapshotsareread-onlyviews.
SnapshotsinHadoop2canbeappliedateitherthefullfilesystemleveloronlyonparticularpaths.Apathneedstobesetassnapshottable,andnotethatyoucannothaveapathsnapshottableifanyofitschildrenorparentpathsarethemselvessnapshottable.
Let’stakeasimpleexamplebasedonthedirectorywecreatedearliertoillustratetheuseofsnapshots.Thecommandswearegoingtoillustrateneedtobeexecutedwithsuperuserprivileges,whichcanbeobtainedwithsudo-uhdfs.
First,usethedfsadminsubcommandofthehdfsCLIutilitytoenablesnapshotsofadirectory,asfollows:
$sudo-uhdfshdfsdfsadmin-allowSnapshot\
/user/cloudera/testdir
Allowingsnapshotontestdirsucceeded
Now,wecreatethesnapshotandexamineit;snapshotsareavailablethroughthe.snapshotsubdirectoryofthesnapshottabledirectory.Notethatthe.snapshotdirectorywillnotbevisibleinanormallistingofthedirectory.Here’showwecreateasnapshotandexamineit:
$sudo-uhdfshdfsdfs-createSnapshot\
![Page 99: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/99.jpg)
/user/cloudera/testdirsn1
Createdsnapshot/user/cloudera/testdir/.snapshot/sn1
$sudo-uhdfshdfsdfs-ls\
/user/cloudera/testdir/.snapshot/sn1
Found1items-rw-r--r--1clouderacloudera122014-11-1311:21
/user/cloudera/testdir/.snapshot/sn1/testfile.txt
Now,weremovethetestfilefromthemaindirectoryandverifythatitisnowempty:
$sudo-uhdfshdfsdfs-rm\
/user/cloudera/testdir/testfile.txt
14/11/1313:13:51INFOfs.TrashPolicyDefault:Namenodetrashconfiguration:
Deletioninterval=1440minutes,Emptierinterval=0minutes.Moved:
'hdfs://localhost.localdomain:8020/user/cloudera/testdir/testfile.txt'to
trashat:hdfs://localhost.localdomain:8020/user/hdfs/.Trash/Current
$hdfsdfs-ls/user/cloudera/testdir
$
Notethementionoftrashdirectories;bydefault,HDFSwillcopyanydeletedfilesintoa.Trashdirectoryintheuser’shomedirectory,whichhelpstodefendagainstslippingfingers.Thesefilescanberemovedthroughhdfsdfs-expungeorwillbeautomaticallypurgedin7daysbydefault.
Now,weexaminethesnapshotwherethenow-deletedfileisstillavailable:
$hdfsdfs-lstestdir/.snapshot/sn1
Found1itemsdrwxr-xr-x-clouderacloudera02014-11-1313:12
testdir/.snapshot/sn1
$hdfsdfs-tailtestdir/.snapshot/sn1/testfile.txt
Helloworld
Then,wecandeletethesnapshot,freeingupanyblocksheldbyit,asfollows:
$sudo-uhdfshdfsdfs-deleteSnapshot\
/user/cloudera/testdirsn1
$hdfsdfs-lstestdir/.snapshot
$
Ascanbeseen,thefileswithinasnapshotarefullyavailabletobereadandcopied,providingaccesstothehistoricalstateofthefilesystematthepointwhenthesnapshotwasmade.Eachdirectorycanhaveupto65,535snapshots,andHDFSmanagessnapshotsinsuchawaythattheyarequiteefficientintermsofimpactonnormalfilesystemoperations.Theyareagreatmechanismtousepriortoanyactivitythatmighthaveadverseeffects,suchastryinganewversionofanapplicationthataccessesthefilesystem.Ifthenewsoftwarecorruptsfiles,theoldstateofthedirectorycanberestored.Ifafteraperiodofvalidationthesoftwareisaccepted,thenthesnapshotcaninsteadbedeleted.
![Page 100: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/100.jpg)
HadoopfilesystemsUntilnow,wereferredtoHDFSastheHadoopfilesystem.Inreality,Hadoophasaratherabstractnotionoffilesystem.HDFSisonlyoneofseveralimplementationsoftheorg.apache.hadoop.fs.FileSystemJavaabstractclass.Alistofavailablefilesystemscanbefoundathttps://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/fs/FileSystem.html.Thefollowingtablesummarizessomeofthesefilesystems,alongwiththecorrespondingURIschemeandJavaimplementationclass.
Filesystem URIscheme Javaimplementation
Local file org.apache.hadoop.fs.LocalFileSystem
HDFS hdfs org.apache.hadoop.hdfs.DistributedFileSystem
S3(native) s3n org.apache.hadoop.fs.s3native.NativeS3FileSystem
S3(block-based) s3 org.apache.hadoop.fs.s3.S3FileSystem
ThereexisttwoimplementationsoftheS3filesystem.Native—s3n—isusedtoreadandwriteregularfiles.Datastoredusings3ncanbeaccessedbyanytoolandconverselycanbeusedtoreaddatageneratedbyotherS3tools.s3ncannothandlefileslargerthan5TBorrenameoperations.
MuchlikeHDFS,theblock-basedS3filesystemstoresfilesinblocksandrequiresanS3buckettobededicatedtothefilesystem.FilesstoredinanS3filesystemcanbelargerthan5TB,buttheywillnotbeinteroperablewithotherS3tools.Additionallyblock-basedS3supportsrenameoperations.
![Page 101: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/101.jpg)
HadoopinterfacesHadoopiswritteninJava,andnotsurprisingly,allinteractionwiththesystemhappensviatheJavaAPI.Thecommand-lineinterfaceweusedthroughthehdfscommandinpreviousexamplesisaJavaapplicationthatusestheFileSystemclasstocarryoutinput/outputoperationsontheavailablefilesystems.
JavaFileSystemAPITheJavaAPI,providedbytheorg.apache.hadoop.fspackage,exposesApacheHadoopfilesystems.
org.apache.hadoop.fs.FileSystemistheabstractclasseachfilesystemimplementsandprovidesageneralinterfacetointeractwithdatainHadoop.AllcodethatusesHDFSshouldbewrittenwiththecapabilityofhandlingaFileSystemobject.
LibhdfsLibhdfsisaClibrarythat,despiteitsname,canbeusedtoaccessanyHadoopfilesystemandnotjustHDFS.ItiswrittenusingtheJavaNativeInterface(JNI)andmimicstheJavaFileSystemclass.
ThriftApacheThrift(http://thrift.apache.org)isaframeworkforbuildingcross-languagesoftwarethroughdataserializationandremotemethodinvocationmechanisms.TheHadoopThriftAPI,availableincontrib,exposesHadoopfilesystemsasaThriftservice.Thisinterfacemakesiteasyfornon-JavacodetoaccessdatastoredinaHadoopfilesystem.
Otherthantheaforementionedinterfaces,thereexistotherinterfacesthatallowaccesstoHadoopfilesystemsviaHTTPandFTP—theseforHDFSonly—aswellasWebDAV.
![Page 102: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/102.jpg)
ManagingandserializingdataHavingafilesystemisallwellandgood,butwealsoneedmechanismstorepresentdataandstoreitonthefilesystems.Wewillexploresomeofthesemechanismsnow.
![Page 103: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/103.jpg)
TheWritableinterfaceItisuseful,tousasdevelopers,ifwecanmanipulatehigher-leveldatatypesandhaveHadooplookaftertheprocessesrequiredtoserializethemintobytestowritetoafilesystemandreconstructfromastreamofbyteswhenitisreadfromthefilesystem.
Theorg.apache.hadoop.iopackagecontainstheWritableinterface,whichprovidesthismechanismandisspecifiedasfollows:
publicinterfaceWritable
{
voidwrite(DataOutputout)throwsIOException;
voidreadFields(DataInputin)throwsIOException;
}
Themainpurposeofthisinterfaceistoprovidemechanismsfortheserializationanddeserializationofdataasitispassedacrossthenetworkorreadandwrittenfromthedisk.
WhenweexploreprocessingframeworksonHadoopinlaterchapters,wewilloftenseeinstanceswheretherequirementisforadataargumenttobeofthetypeWritable.Ifweusedatastructuresthatprovideasuitableimplementationofthisinterface,thentheHadoopmachinerycanautomaticallymanagetheserializationanddeserializationofthedatatypewithoutknowinganythingaboutwhatitrepresentsorhowitisused.
![Page 104: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/104.jpg)
IntroducingthewrapperclassesFortunately,youdon’thavetostartfromscratchandbuildWritablevariantsofallthedatatypesyouwilluse.HadoopprovidesclassesthatwraptheJavaprimitivetypesandimplementtheWritableinterface.Theyareprovidedintheorg.apache.hadoop.iopackage.
Theseclassesareconceptuallysimilartotheprimitivewrapperclasses,suchasIntegerandLong,foundinjava.lang.Theyholdasingleprimitivevaluethatcanbeseteitheratconstructionorviaasettermethod.Theyareasfollows:
BooleanWritable
ByteWritable
DoubleWritable
FloatWritable
IntWritable
LongWritable
VIntWritable:avariablelengthintegertypeVLongWritable:avariablelengthlongtypeThereisalsoText,whichwrapsjava.lang.String.
![Page 105: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/105.jpg)
ArraywrapperclassesHadoopalsoprovidessomecollection-basedwrapperclasses.TheseclassesprovideWritablewrappersforarraysofotherWritableobjects.Forexample,aninstancecouldeitherholdanarrayofIntWritableorDoubleWritable,butnotarraysoftherawintorfloattypes.AspecificsubclassfortherequiredWritableclasswillberequired.Theyareasfollows:
ArrayWritable
TwoDArrayWritable
![Page 106: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/106.jpg)
TheComparableandWritableComparableinterfacesWewereslightlyinaccuratewhenwesaidthatthewrapperclassesimplementWritable;theyactuallyimplementacompositeinterfacecalledWritableComparableintheorg.apache.hadoop.iopackagethatcombinesWritablewiththestandardjava.lang.Comparableinterface:
publicinterfaceWritableComparableextendsWritable,Comparable
{}
TheneedforComparablewillonlybecomeapparentwhenweexploreMapReduceinthenextchapter,butfornow,justrememberthatthewrapperclassesprovidemechanismsforthemtobebothserializedandsortedbyHadooporanyofitsframeworks.
![Page 107: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/107.jpg)
StoringdataUntilnow,weintroducedthearchitectureofHDFSandhowtoprogrammaticallystoreandretrievedatausingthecommand-linetoolsandtheJavaAPI.Intheexamplesseenuntilnow,wehaveimplicitlyassumedthatourdatawasstoredasatextfile.Inreality,someapplicationsanddatasetswillrequireadhocdatastructurestoholdthefile’scontents.Overtheyears,fileformatshavebeencreatedtoaddressboththerequirementsofMapReduceprocessing—forinstance,wewantdatatobesplittable—andtosatisfytheneedtomodelbothstructuredandunstructureddata.Currently,alotoffocushasbeendedicatedtobettercapturetheusecasesofrelationaldatastorageandmodeling.Intheremainderofthischapter,wewillintroducesomeofthepopularfileformatchoicesavailablewithintheHadoopecosystem.
![Page 108: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/108.jpg)
SerializationandContainersWhentalkingaboutfileformats,weareassumingtwotypesofscenarios,whichareasfollows:
Serialization:wewanttoencodedatastructuresgeneratedandmanipulatedatprocessingtimetoaformatwecanstoretoafile,transmit,andatalaterstage,retrieveandtranslatebackforfurthermanipulationContainers:oncedataisserializedtofiles,containersprovidemeanstogroupmultiplefilestogetherandaddadditionalmetadata
![Page 109: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/109.jpg)
CompressionWhenworkingwithdata,filecompressioncanoftenleadtosignificantsavingsbothintermsofthespacenecessarytostorefilesaswellasonthedataI/Oacrossthenetworkandfrom/tolocaldisks.
Inbroadterms,whenusingaprocessingframework,compressioncanoccuratthreepointsintheprocessingpipeline:
inputfilestobeprocessedoutputfilesthatresultafterprocessingiscompletedintermediate/temporaryfilesproducedinternallywithinthepipeline
Whenweaddcompressionatanyofthesestages,wehaveanopportunitytodramaticallyreducetheamountofdatatobereadorwrittentothediskoracrossthenetwork.ThisisparticularlyusefulwithframeworkssuchasMapReducethatcan,forexample,producevolumesoftemporarydatathatarelargerthaneithertheinputoroutputdatasets.
ApacheHadoopcomeswithanumberofcompressioncodecs:gzip,bzip2,LZO,snappy—eachwithitsowntradeoffs.Pickingacodecisaneducatedchoicethatshouldconsiderboththekindofdatabeingprocessedaswellasthenatureoftheprocessingframeworkitself.
Otherthanthegeneralspace/timetradeoff,wherethelargestspacesavingscomeattheexpenseofcompressionanddecompressionspeed(andviceversa),weneedtotakeintoaccountthatdatastoredinHDFSwillbeaccessedbyparallel,distributedsoftware;someofthissoftwarewillalsoadditsownparticularrequirementsonfileformats.MapReduce,forexample,ismostefficientonfilesthatcanbesplitintovalidsubfiles.
Thiscancomplicatedecisions,suchasthechoiceofwhethertocompressandwhichcodectouseifso,asmostcompressioncodecs(suchasgzip)donotsupportsplittablefiles,whereasafew(suchasLZO)do.
![Page 110: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/110.jpg)
General-purposefileformatsThefirstclassoffileformatsarethosegeneral-purposeonesthatcanbeappliedtoanyapplicationdomainandmakenoassumptionsondatastructureoraccesspatterns.
Text:thesimplestapproachtostoringdataonHDFSistouseflatfiles.Textfilescanbeusedbothtoholdunstructureddata—awebpageoratweet—aswellasstructureddata—aCSVfilethatisafewmillionrowslong.Textfilesaresplittable,thoughoneneedstoconsiderhowtohandleboundariesbetweenmultipleelements(forexample,lines)inthefile.SequenceFile:aSequenceFileisaflatdatastructureconsistingofbinarykey/valuepairs,introducedtoaddressspecificrequirementsofMapReduce-basedprocessing.ItisstillextensivelyusedinMapReduceasaninput/outputformat.AswewillseeinChapter3,Processing–MapReduceandBeyond,internally,thetemporaryoutputsofmapsarestoredusingSequenceFile.
SequenceFileprovidesWriter,Reader,andSorterclassestowrite,read,and,sortdata,respectively.
Dependingonthecompressionmechanisminuse,threevariationsofSequenceFilecanbedistinguished:
Uncompressedkey/valuerecords.Recordcompressedkey/valuerecords.Only‘values’arecompressed.Blockcompressedkey/valuerecords.Keysandvaluesarecollectedinblocksofarbitrarysizeandcompressedseparately.
Ineachcase,however,theSequenceFileremainssplittable,whichisoneofitsbiggeststrengths.
![Page 111: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/111.jpg)
Column-orienteddataformatsIntherelationaldatabaseworld,column-orienteddatastoresorganizeandstoretablesbasedonthecolumns;generallyspeaking,thedataforeachcolumnwillbestoredtogether.ThisisasignificantlydifferentapproachcomparedtomostrelationalDBMSthatorganizedataperrow.Column-orientedstoragehassignificantperformanceadvantages;forexample,ifaqueryneedstoreadonlytwocolumnsfromaverywidetablecontaininghundredsofcolumns,thenonlytherequiredcolumndatafilesareaccessed.Atraditionalrow-orienteddatabasewouldhavetoreadallcolumnsforeachrowforwhichdatawasrequired.Thishasthegreatestimpactonworkloadswhereaggregatefunctionsarecomputedoverlargenumbersofsimilaritems,suchaswithOLAPworkloadstypicalofdatawarehousesystems.
InChapter7,HadoopandSQL,wewillseehowHadoopisbecomingaSQLbackendforthedatawarehouseworldthankstoprojectssuchasApacheHiveandClouderaImpala.Aspartoftheexpansionintothisdomain,anumberoffileformatshavebeendevelopedtoaccountforbothrelationalmodelinganddatawarehousingneeds.
RCFile,ORC,andParquetarethreestate-of-the-artcolumn-orientedfileformatsdevelopedwiththeseusecasesinmind.
RCFileRowColumnarFile(RCFile)wasoriginallydevelopedbyFacebooktobeusedasthebackendstoragefortheirHivedatawarehousesystemthatwasthefirstmainstreamSQL-on-Hadoopsystemavailableasopensource.
RCFileaimstoprovidethefollowing:
fastdataloadingfastqueryprocessingefficientstorageutilizationadaptabilitytodynamicworkloads
MoreinformationonRCFilecanbefoundathttp://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html.
ORCTheOptimizedRowColumnarfileformat(ORC)aimstocombinetheperformanceoftheRCFilewiththeflexibilityofAvro.ItisprimarilyintendedtoworkwithApacheHiveandhasbeeninitiallydevelopedbyHortonworkstoovercometheperceivedlimitationsofotheravailablefileformats.
Moredetailscanbefoundathttp://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html.
ParquetParquet,foundathttp://parquet.incubator.apache.org,wasoriginallyajointeffortof
![Page 112: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/112.jpg)
Cloudera,Twitter,andCriteo,andnowhasbeendonatedtotheApacheSoftwareFoundation.ThegoalsofParquetaretoprovideamodern,performant,columnarfileformattobeusedwithClouderaImpala.AswithImpala,ParquethasbeeninspiredbytheDremelpaper(http://research.google.com/pubs/pub36632.html).Itallowscomplex,nesteddatastructuresandallowsefficientencodingonaper-columnlevel.
AvroApacheAvro(http://avro.apache.org)isaschema-orientedbinarydataserializationformatandfilecontainer.Avrowillbeourpreferredbinarydataformatthroughoutthisbook.Itisbothsplittableandcompressible,makingitanefficientformatfordataprocessingwithframeworkssuchasMapReduce.
Numerousotherprojectsalsohavebuilt-inspecificAvrosupportandintegration,however,soitisverywidelyapplicable.WhendataisstoredinanAvrofile,itsschema—definedasaJSONobject—isstoredwithit.Afilecanbelaterprocessedbyathirdpartywithnoapriorinotionofhowdataisencoded.Thismakesdataself-describingandfacilitatesusewithdynamicandscriptinglanguages.Theschema-on-readmodelalsohelpsAvrorecordstobeefficienttostoreasthereisnoneedfortheindividualfieldstobetagged.
Inlaterchapters,youwillseehowthesepropertiescanmakedatalifecyclemanagementeasierandallownon-trivialoperationssuchasschemamigrations.
UsingtheJavaAPIWe’llnowdemonstratetheuseoftheJavaAPItoparseAvroschemas,readandwriteAvrofiles,anduseAvro’scodegenerationfacilities.Notethattheformatisintrinsicallylanguageindependent;thereareAPIsformostlanguages,andfilescreatedbyJavawillseamlesslybereadfromanyotherlanguage.
AvroschemasaredescribedasJSONdocumentsandrepresentedbytheorg.apache.avro.Schemaclass.TodemonstratetheAPIformanipulatingAvrodocuments,we’lllookaheadtoanAvrospecificationweuseforaHivetableinChapter7,HadoopandSQL.Thefollowingcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/src/main/java/com/learninghadoop2/avro/AvroParse.java.
Inthefollowingcode,wewillusetheAvroJavaAPItocreateanAvrofilecontainingatweetrecordandthenre-readthefile,usingtheschemainthefiletoextractthedetailsofthestoredrecords:
publicstaticvoidtestGenericRecord(){
try{
Schemaschema=newSchema.Parser()
.parse(newFile("tweets_avro.avsc"));
GenericRecordtweet=newGenericData
.Record(schema);
tweet.put("text","Thegenerictweettext");
![Page 113: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/113.jpg)
Filefile=newFile("tweets.avro");
DatumWriter<GenericRecord>datumWriter=
newGenericDatumWriter<>(schema);
DataFileWriter<GenericRecord>fileWriter=
newDataFileWriter<>(datumWriter);
fileWriter.create(schema,file);
fileWriter.append(tweet);
fileWriter.close();
DatumReader<GenericRecord>datumReader=
newGenericDatumReader<>(schema);
DataFileReader<GenericRecord>fileReader=
newDataFileReader(file,datumReader);
GenericRecordgenericTweet=null;
while(fileReader.hasNext()){
genericTweet=(GenericRecord)fileReader
.next(genericTweet);
for(Schema.Fieldfield:
genericTweet.getSchema().getFields()){
Objectval=genericTweet.get(field.name());
if(val!=null){
System.out.println(val);
}
}
}
}catch(IOExceptionie){
System.out.println("Errorparsingorwritingfile.");
}
}
Thetweets_avro.avscschema,foundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/tweets_avro.avsc,describesatweetwithmultiplefields.TocreateanAvroobjectofthistype,wefirstparsetheschemafile.WethenuseAvro’sconceptofaGenericRecordtobuildanAvrodocumentthatcomplieswiththisschema.Inthiscase,weonlysetasingleattribute—thetweettextitself.
TowritethisAvrofile—containingasingleobject—wethenuseAvro’sI/Ocapabilities.Toreadthefile,wedonotneedtostartwiththeschema,aswecanextractthisfromtheGenericRecordwereadfromthefile.Wethenwalkthroughtheschemastructureanddynamicallyprocessthedocumentbasedonthediscoveredfields.Thisisparticularlypowerful,asitisthekeyenablerofclientsremainingindependentoftheAvroschemaandhowitevolvesovertime.
Ifwehavetheschemafileinadvance,however,wecanuseAvrocodegenerationtocreateacustomizedclassthatmakesmanipulatingAvrorecordsmucheasier.Togeneratethecode,wewillusethecompileclassintheavro-tools.jar,passingitthenameoftheschemafileandthedesiredoutputdirectory:
$java-jar/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/avro/avro-
![Page 114: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/114.jpg)
tools.jarcompileschematweets_avro.avscsrc/main/java
Theclasswillbeplacedinadirectorystructurebasedonanynamespacedefinedintheschema.Sincewecreatedthisschemainthecom.learninghadoop2.avrotablesnamespace,weseethefollowing:
$lssrc/main/java/com/learninghadoop2/avrotables/tweets_avro.java
Withthisclass,let’srevisitthecreationandtheactofreadingandwritingAvroobjects,asfollows:
publicstaticvoidtestGeneratedCode(){
tweets_avrotweet=newtweets_avro();
tweet.setText("Thecodegeneratedtweettext");
try{
Filefile=newFile("tweets.avro");
DatumWriter<tweets_avro>datumWriter=
newSpecificDatumWriter<>(tweets_avro.class);
DataFileWriter<tweets_avro>fileWriter=
newDataFileWriter<>(datumWriter);
fileWriter.create(tweet.getSchema(),file);
fileWriter.append(tweet);
fileWriter.close();
DatumReader<tweets_avro>datumReader=
newSpecificDatumReader<>(tweets_avro.class);
DataFileReader<tweets_avro>fileReader=
newDataFileReader<>(file,datumReader);
while(fileReader.hasNext()){
tweet=fileReader.next(tweet);
System.out.println(tweet.getText());
}
}catch(IOExceptionie){
System.out.println("Errorinparsingorwritingfiles.");
}
}
Becauseweusedcodegeneration,wenowusetheAvroSpecificRecordmechanismalongsidethegeneratedclassthatrepresentstheobjectinourdomainmodel.Consequently,wecandirectlyinstantiatetheobjectandaccessitsattributesthroughfamiliarget/setmethods.
Writingthefileissimilartotheactionperformedbefore,exceptthatweusespecificclassesandalsoretrievetheschemadirectlyfromthetweetobjectwhenneeded.Readingissimilarlyeasedthroughtheabilitytocreateinstancesofaspecificclassanduseget/setmethods.
![Page 115: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/115.jpg)
SummaryThischapterhasgivenawhistle-stoptourthroughstorageonaHadoopcluster.Inparticular,wecovered:
Thehigh-levelarchitectureofHDFS,themainfilesystemusedinHadoopHowHDFSworksunderthecoversand,inparticular,itsapproachtoreliabilityHowHadoop2hasaddedsignificantlytoHDFS,particularlyintheformofNameNodeHAandfilesystemsnapshotsWhatZooKeeperisandhowitisusedbyHadooptoenablefeaturessuchasNameNodeautomaticfailoverAnoverviewofthecommand-linetoolsusedtoaccessHDFSTheAPIforfilesystemsinHadoopandhowatacodelevelHDFSisjustoneimplementationofamoreflexiblefilesystemabstractionHowdatacanbeserializedontoaHadoopfilesystemandsomeofthesupportprovidedinthecoreclassesThevariousfileformatsavailableinwhichdataismostfrequentlystoredinHadoopandsomeoftheirparticularusecases
Inthenextchapter,wewilllookindetailathowHadoopprovidesprocessingframeworksthatcanbeusedtoprocessthedatastoredwithinit.
![Page 116: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/116.jpg)
Chapter3.Processing–MapReduceandBeyondInHadoop1,theplatformhadtwoclearcomponents:HDFSfordatastorageandMapReducefordataprocessing.ThepreviouschapterdescribedtheevolutionofHDFSinHadoop2andinthischapterwe’lldiscussdataprocessing.
ThepicturewithprocessinginHadoop2haschangedmoresignificantlythanhasstorage,andHadoopnowsupportsmultipleprocessingmodelsasfirst-classcitizens.Inthischapterwe’llexplorebothMapReduceandothercomputationalmodelsinHadoop2.Inparticular,we’llcover:
WhatMapReduceisandtheJavaAPIrequiredtowriteapplicationsforitHowMapReduceisimplementedinpracticeHowHadoopreadsdataintoandoutofitsprocessingjobsYARN,theHadoop2componentthatallowsprocessingbeyondMapReduceontheplatformAnintroductiontoseveralcomputationalmodelsimplementedonYARN
![Page 117: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/117.jpg)
MapReduceMapReduceistheprimaryprocessingmodelsupportedinHadoop1.Itfollowsadivideandconquermodelforprocessingdatamadepopularbya2006paperbyGoogle(http://research.google.com/archive/mapreduce.html)andhasfoundationsbothinfunctionalprogramminganddatabaseresearch.Thenameitselfreferstotwodistinctstepsappliedtoallinputdata,amapfunctionandareducefunction.
EveryMapReduceapplicationisasequenceofjobsthatbuildatopthisverysimplemodel.Sometimes,theoverallapplicationmayrequiremultiplejobs,wheretheoutputofthereducestagefromoneistheinputtothemapstageofanother,andsometimestheremightbemultiplemaporreducefunctions,butthecoreconceptsremainthesame.
WewillintroducetheMapReducemodelbylookingatthenatureofthemapandreducefunctionsandthendescribetheJavaAPIrequiredtobuildimplementationsofthefunctions.Aftershowingsomeexamples,wewillwalkthroughaMapReduceexecutiontogivemoreinsightintohowtheactualMapReduceframeworkexecutescodeatruntime.
LearningtheMapReducemodelcanbealittlecounter-intuitive;it’softendifficulttoappreciatehowverysimplefunctionscan,whencombined,provideveryrichprocessingonenormousdatasets.Butitdoeswork,trustus!
Asweexplorethenatureofthemapandreducefunctions,thinkofthemasbeingappliedtoastreamofrecordsbeingretrievedfromthesourcedataset.We’lldescribehowthathappenslater;fornow,thinkofthesourcedatabeingslicedintosmallerchunks,eachofwhichgetsfedtoadedicatedinstanceofthemapfunction.Eachrecordhasthemapfunctionapplied,producingasetofintermediarydata.Recordsareretrievedfromthistemporarydatasetandallassociatedrecordsarefedtogetherthroughthereducefunction.Thefinaloutputofthereducefunctionforallthesetsofrecordsistheoverallresultforthecompletejob.
Fromafunctionalperspective,MapReducetransformsdatastructuresfromonelistof(key,value)pairsintoanother.DuringtheMapphase,dataisloadedfromHDFS,andafunctionisappliedinparalleltoeveryinput(key,value)andanewlistof(key,value)pairsistheoutput:
map(k1,v1)->list(k2,v2)
Theframeworkthencollectsallpairswiththesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.AReducefunctionisappliedinparalleltoeachgroup,whichinturnproducesalistofvalues:
reduce(k2,list(v2))→k3,list(v3)
TheoutputisthenwrittenbacktoHDFSinthefollowingmanner:
![Page 118: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/118.jpg)
MapandReducephases
![Page 119: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/119.jpg)
JavaAPItoMapReduceTheJavaAPItoMapReduceisexposedbytheorg.apache.hadoop.mapreducepackage.WritingaMapReduceprogram,atitscore,isamatterofsubclassingHadoop-providedMapperandReducerbaseclasses,andoverridingthemap()andreduce()methodswithourownimplementation.
![Page 120: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/120.jpg)
TheMapperclassForourownMapperimplementations,wewillsubclasstheMapperbaseclassandoverridethemap()method,asfollows:
classMapper<K1,V1,K2,V2>
{
voidmap(K1key,V1valueMapper.Contextcontext)
throwsIOException,InterruptedException
...
}
Theclassisdefinedintermsofthekey/valueinputandoutputtypes,andthenthemapmethodtakesaninputkey/valuepairasitsparameter.TheotherparameterisaninstanceoftheContextclassthatprovidesvariousmechanismstocommunicatewiththeHadoopframework,oneofwhichistooutputtheresultsofamaporreducemethod.
NoticethatthemapmethodonlyreferstoasingleinstanceofK1andV1key/valuepairs.ThisisacriticalaspectoftheMapReduceparadigminwhichyouwriteclassesthatprocesssinglerecords,andtheframeworkisresponsibleforalltheworkrequiredtoturnanenormousdatasetintoastreamofkey/valuepairs.Youwillneverhavetowritemaporreduceclassesthattrytodealwiththefulldataset.HadoopalsoprovidesmechanismsthroughitsInputFormatandOutputFormatclassesthatprovideimplementationsofcommonfileformatsandlikewiseremovetheneedforhavingtowritefileparsersforanybutcustomfiletypes.
Therearethreeadditionalmethodsthatsometimesmayberequiredtobeoverridden:.
protectedvoidsetup(Mapper.Contextcontext)
throwsIOException,InterruptedException
Thismethodiscalledoncebeforeanykey/valuepairsarepresentedtothemapmethod.Thedefaultimplementationdoesnothing:
protectedvoidcleanup(Mapper.Contextcontext)
throwsIOException,InterruptedException
Thismethodiscalledonceafterallkey/valuepairshavebeenpresentedtothemapmethod.Thedefaultimplementationdoesnothing:
protectedvoidrun(Mapper.Contextcontext)
throwsIOException,InterruptedException
ThismethodcontrolstheoverallflowoftaskprocessingwithinaJVM.Thedefaultimplementationcallsthesetupmethodoncebeforerepeatedlycallingthemapmethodforeachkey/valuepairinthesplitandthenfinallycallsthecleanupmethod.
![Page 121: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/121.jpg)
TheReducerclassTheReducerbaseclassworksverysimilarlytotheMapperclassandusuallyrequiresonlysubclassestooverrideasinglereduce()method.Hereisthecut-downclassdefinition:
publicclassReducer<K2,V2,K3,V3>
{
voidreduce(K2key,Iterable<V2>values,
Reducer.Contextcontext)
throwsIOException,InterruptedException
...
}
Again,noticetheclassdefinitionintermsofthebroaderdataflow(thereducemethodacceptsK2/V2asinputandprovidesK3/V3asoutput),whiletheactualreducemethodtakesonlyasinglekeyanditsassociatedlistofvalues.TheContextobjectisagainthemechanismtooutputtheresultofthemethod.
Thisclassalsohasthesetup,runandcleanupmethodswithsimilardefaultimplementationsaswiththeMapperclassthatcanoptionallybeoverridden:
protectedvoidsetup(Reducer.Contextcontext)
throwsIOException,InterruptedException
Thesetup()methodiscalledoncebeforeanykey/listsofvaluesarepresentedtothereducemethod.Thedefaultimplementationdoesnothing:
protectedvoidcleanup(Reducer.Contextcontext)
throwsIOException,InterruptedException
Thecleanup()methodiscalledonceafterallkey/listsofvalueshavebeenpresentedtothereducemethod.Thedefaultimplementationdoesnothing:
protectedvoidrun(Reducer.Contextcontext)
throwsIOException,InterruptedException
Therun()methodcontrolstheoverallflowofprocessingthetaskwithintheJVM.Thedefaultimplementationcallsthesetupmethodbeforerepeatedlyandpotentiallyconcurrentlycallingthereducemethodforasmanykey/valuepairsprovidedtotheReducerclass,andthenfinallycallsthecleanupmethod.
![Page 122: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/122.jpg)
TheDriverclassTheDriverclasscommunicateswiththeHadoopframeworkandspecifiestheconfigurationelementsneededtorunaMapReducejob.ThisinvolvesaspectssuchastellingHadoopwhichMapperandReducerclassestouse,wheretofindtheinputdataandinwhatformat,andwheretoplacetheoutputdataandhowtoformatit.
ThedriverlogicusuallyexistsinthemainmethodoftheclasswrittentoencapsulateaMapReducejob.ThereisnodefaultparentDriverclasstosubclass:
publicclassExampleDriverextendsConfiguredimplementsTool
{
...
publicstaticvoidrun(String[]args)throwsException
{
//CreateaConfigurationobjectthatisusedtosetotheroptions
Configurationconf=getConf();
//Getcommandlinearguments
args=newGenericOptionsParser(conf,args)
.getRemainingArgs();
//Createtheobjectrepresentingthejob
Jobjob=newJob(conf,"ExampleJob");
//Setthenameofthemainclassinthejobjarfile
job.setJarByClass(ExampleDriver.class);
//Setthemapperclass
job.setMapperClass(ExampleMapper.class);
//Setthereducerclass
job.setReducerClass(ExampleReducer.class);
//Setthetypesforthefinaloutputkeyandvalue
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//Setinputandoutputfilepaths
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
//Executethejobandwaitforittocomplete
System.exit(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException
{
intexitCode=ToolRunner.run(newExampleDriver(),args);
System.exit(exitCode);
}
}
Intheprecedinglinesofcode,org.apache.hadoop.util.Toolisaninterfaceforhandlingcommand-lineoptions.TheactualhandlingisdelegatedtoToolRunner.run,whichruns
![Page 123: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/123.jpg)
ToolwiththegivenConfigurationusedtogetandsetajob’sconfigurationoptions.Bysubclassingorg.apache.hadoop.conf.Configured,wecansettheConfigurationobjectdirectlyfromcommand-lineoptionsviaGenericOptionsParser.
Givenourprevioustalkofjobs,it’snotsurprisingthatmuchofthesetupinvolvesoperationsonajobobject.Thisincludessettingthejobnameandspecifyingwhichclassesaretobeusedforthemapperandreducerimplementations.
Certaininput/outputconfigurationsaresetand,finally,theargumentspassedtothemainmethodareusedtospecifytheinputandoutputlocationsforthejob.Thisisaverycommonmodelthatyouwillseeoften.
Thereareanumberofdefaultvaluesforconfigurationoptions,andweareimplicitlyusingsomeofthemintheprecedingclass.Mostnotably,wedon’tsayanythingabouttheformatoftheinputfilesorhowtheoutputfilesaretobewritten.ThesearedefinedthroughtheInputFormatandOutputFormatclassesmentionedearlier;wewillexplorethemindetaillater.Thedefaultinputandoutputformatsaretextfilesthatsuitourexamples.Therearemultiplewaysofexpressingtheformatwithintextfilesinadditiontoparticularlyoptimizedbinaryformats.
AcommonmodelforlesscomplexMapReducejobsistohavetheMapperandReducerclassesasinnerclasseswithinthedriver.Thisallowseverythingtobekeptinasinglefile,whichsimplifiesthecodedistribution.
![Page 124: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/124.jpg)
CombinerHadoopallowstheuseofacombinerclasstoperformsomeearlysortingoftheoutputfromthemapmethodbeforeit’sretrievedbythereducer.
MuchofHadoop’sdesignispredicatedonreducingtheexpensivepartsofajobthatusuallyequatetodiskandnetworkI/O.Theoutputofthemapperisoftenlarge;it’snotinfrequenttoseeitmanytimesthesizeoftheoriginalinput.Hadoopdoesallowconfigurationoptionstohelpreducetheimpactofthereducerstransferringsuchlargechunksofdataacrossthenetwork.Thecombinertakesadifferentapproachwhereit’spossibletoperformearlyaggregationtorequirelessdatatobetransferredinthefirstplace.
Thecombinerdoesnothaveitsowninterface;acombinermusthavethesamesignatureasthereducer,andhencealsosubclassestheReduceclassfromtheorg.apache.hadoop.mapreducepackage.Theeffectofthisistobasicallyperformamini-reduceonthemapperfortheoutputdestinedforeachreducer.
Hadoopdoesnotguaranteewhetherthecombinerwillbeexecuted.Sometimes,itmaynotbeexecutedatall,whileatothertimesitmaybeusedonce,twice,ormoretimesdependingonthesizeandnumberofoutputfilesgeneratedbythemapperforeachreducer.
![Page 125: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/125.jpg)
PartitioningOneoftheimplicitguaranteesoftheReduceinterfaceisthatasinglereducerwillbegivenallthevaluesassociatedwithagivenkey.Withmultiplereducetasksrunningacrossacluster,eachmapperoutputmustbepartitionedintotheseparateoutputsdestinedforeachreducer.Thesepartitionedfilesarestoredonthelocalnodefilesystem.
Thenumberofreducetasksacrosstheclusterisnotasdynamicasthatofmappers,andindeedwecanspecifythevalueaspartofourjobsubmission.Hadooptherefore,knowshowmanyreducerswillbeneededtocompletethejob,andfromthis,itknowsintohowmanypartitionsthemapperoutputshouldbesplit.
TheoptionalpartitionfunctionWithintheorg.apache.hadoop.mapreducepackageisthePartitionerclass,anabstractclasswiththefollowingsignature:
publicabstractclassPartitioner<Key,Value>
{
publicabstractintgetPartition(Keykey,Valuevalue,int
numPartitions);
}
Bydefault,Hadoopwilluseastrategythathashestheoutputkeytoperformthepartitioning.ThisfunctionalityisprovidedbytheHashPartitionerclasswithintheorg.apache.hadoop.mapreduce.lib.partitionpackage,butit’snecessaryinsomecasestoprovideacustomsubclassofPartitionerwithapplication-specificpartitioninglogic.NoticethatthegetPartitionfunctiontakesthekey,value,andnumberofpartitionsasparameters,anyofwhichcanbeusedbythecustompartitioninglogic.
Acustompartitioningstrategywouldbeparticularlynecessaryif,forexample,thedataprovidedaveryunevendistributionwhenthestandardhashfunctionwasapplied.Unevenpartitioningcanresultinsometaskshavingtoperformsignificantlymoreworkthanothers,leadingtomuchlongeroveralljobexecutiontime.
![Page 126: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/126.jpg)
Hadoop-providedmapperandreducerimplementationsWedon’talwayshavetowriteourownMapperandReducerclassesfromscratch.HadoopprovidesseveralcommonMapperandReducerimplementationsthatcanbeusedinourjobs.Ifwedon’toverrideanyofthemethodsintheMapperandReducerclasses,thedefaultimplementationsaretheidentityMapperandReducerclasses,whichsimplyoutputtheinputunchanged.
Themappersarefoundatorg.apache.hadoop.mapreduce.lib.mapperandincludethefollowing:
InverseMapper:returns(value,key)asanoutput,thatis,theinputkeyisoutputasthevalueandtheinputvalueisoutputasthekeyTokenCounterMapper:countsthenumberofdiscretetokensineachlineofinputIdentityMapper:implementstheidentityfunction,mappinginputsdirectlytooutputs
Thereducersarefoundatorg.apache.hadoop.mapreduce.lib.reduceandcurrentlyincludethefollowing:
IntSumReducer:outputsthesumofthelistofintegervaluesperkeyLongSumReducer:outputsthesumofthelistoflongvaluesperkeyIdentityReducer:implementstheidentityfunction,mappinginputsdirectlytooutputs
![Page 127: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/127.jpg)
SharingreferencedataOccasionally,wemightwanttosharedataacrosstasks.Forinstance,ifweneedtoperformalookupoperationonanID-to-stringtranslationtable,wemightwantsuchadatasourcetobeaccessiblebythemapperorreducer.AstraightforwardapproachistostorethedatawewanttoaccessonHDFSandusetheFileSystemAPItoqueryitaspartoftheMaporReducesteps.
Hadoopgivesusanalternativemechanismtoachievethegoalofsharingreferencedataacrossalltasksinthejob,theDistributedCachedefinedbytheorg.apache.hadoop.mapreduce.filecache.DistributedCacheclass.Thiscanbeusedtoefficientlymakeavailablecommonread-onlyfilesthatareusedbythemaporreducetaskstoallnodes.
Thefilescanbetextdataasinthiscase,butcouldalsobeadditionalJARs,binarydata,orarchives;anythingispossible.ThefilestobedistributedareplacedonHDFSandaddedtotheDistributedCachewithinthejobdriver.Hadoopcopiesthefilesontothelocalfilesystemofeachnodepriortojobexecution,meaningeverytaskhaslocalaccesstothefiles.
AnalternativeistobundleneededfilesintothejobJARsubmittedtoHadoop.ThisdoestiethedatatothejobJAR,makingitmoredifficulttoshareacrossjobsandrequirestheJARtoberebuiltifthedatachanges.
![Page 128: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/128.jpg)
WritingMapReduceprogramsInthischapter,wewillbefocusingonbatchworkloads;givenasetofhistoricaldata,wewilllookatpropertiesofthatdataset.InChapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark,wewillshowhowasimilartypeofanalysiscanbeperformedoverastreamoftextcollectedinrealtime.
![Page 129: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/129.jpg)
GettingstartedInthefollowingexamples,wewillassumeadatasetgeneratedbycollecting1,000tweetsusingthestream.pyscript,asshowninChapter1,Introduction:
$pythonstream.py–t–n1000>tweets.txt
WecanthencopythedatasetintoHDFSwith:
$hdfsdfs-puttweets.txt<destination>
TipNotethatuntilnowwehavebeenworkingonlywiththetextoftweets.Intheremainderofthisbook,we’llextendstream.pytooutputadditionaltweetmetadatainJSONformat.Keepthisinmindbeforedumpingterabytesofmessageswithstream.py.
OurfirstMapReduceprogramwillbethecanonicalWordCountexample.Avariationofthisprogramwillbeusedtodeterminetrendingtopics.Wewillthenanalyzetextassociatedwithtopicstodeterminewhetheritexpressesa“positive”or“negative”sentiment.Finally,wewillmakeuseofaMapReducepattern—ChainMapper—topullthingstogetherandpresentadatapipelinetocleanandpreparethetextualdatawe’llfeedtothetrendingtopicandsentimentanalysismodel.
![Page 130: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/130.jpg)
RunningtheexamplesThefullsourcecodeoftheexamplesdescribedinthissectioncanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch3.
BeforewerunourjobinHadoop,wemustcompileourcodeandcollecttherequiredclassfilesintoasingleJARfilethatwewillsubmittothesystem.UsingGradle,youcanbuildtheneededJARfilewith:
$./gradlewjar
LocalclusterJobsareexecutedonHadoopusingtheJARoptiontotheHadoopcommand-lineutility.Tousethis,wespecifythenameoftheJARfile,themainclasswithinit,andanyargumentsthatwillbepassedtothemainclass,asshowninthefollowingcommand:
$hadoopjar<jobjarfile><mainclass><argument1>…<argument2>
ElasticMapReduceRecallfromChapter1,Introduction,thatElasticMapReduceexpectsthejobJARfileanditsinputdatatobelocatedinanS3bucketandconverselywilldumpitsoutputbackintoS3.
NoteBecareful:thiswillcostmoney!Forthisexample,wewillusethesmallestpossibleclusterconfigurationavailableforEMR,asingle-nodecluster
Firstofall,wewillcopythetweetdatasetandthelistofpositiveandnegativewordstoS3usingtheawscommand-lineutility:
$awss3puttweets.txts3://<bucket>/input
$awss3putjob.jars3://<bucket>
WecanexecuteajobusingtheEMRcommand-linetoolasfollowsbyuploadingtheJARfiletos3://<bucket>andaddingCUSTOM_JARstepswiththeawsCLI:
$awsemradd-steps--cluster-id<cluster-id>--steps\
Type=CUSTOM_JAR,\
Name=CustomJAR,\
Jar=s3://<bucket>/job.jar,\
MainClass=<classname>,\
Args=arg1,arg2,…argN
Here,cluster-idistheIDofarunningEMRcluster,<classname>isthefullyqualifiednameofthemainclass,andarg1,arg2,…,argNarethejobarguments.
![Page 131: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/131.jpg)
WordCount,theHelloWorldofMapReduceWordCountcountswordoccurrencesinadataset.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/WordCount.javaConsiderthefollowingblockofcodeforexample:
publicclassWordCountextendsConfiguredimplementsTool
{
publicstaticclassWordCountMapper
extendsMapper<Object,Text,Text,IntWritable>
{
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext
)throwsIOException,InterruptedException{
String[]words=value.toString().split("");
for(Stringstr:words)
{
word.set(str);
context.write(word,one);
}
}
}
publicstaticclassWordCountReducer
extendsReducer<Text,IntWritable,Text,IntWritable>{
publicvoidreduce(Textkey,Iterable<IntWritable>values,
Contextcontext
)throwsIOException,InterruptedException{
inttotal=0;
for(IntWritableval:values){
total++;
}
context.write(key,newIntWritable(total));
}
}
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args)
.getRemainingArgs();
Jobjob=Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
![Page 132: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/132.jpg)
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(newWordCount(),args);
System.exit(exitCode);
}
}
ThisisourfirstcompleteMapReducejob.Lookatthestructure,andyoushouldrecognizetheelementswehavepreviouslydiscussed:theoverallJobclasswiththedriverconfigurationinitsmainmethodandtheMapperandReducerimplementationsdefinedasstaticnestedclasses.
We’lldoamoredetailedwalkthroughofthemechanicsofMapReduceinthenextsection,butfornow,let’slookattheprecedingcodeandthinkofhowitrealizesthekey/valuetransformationswediscussedearlier.
TheinputtotheMapperclassisarguablythehardesttounderstand,asthekeyisnotactuallyused.ThejobspecifiesTextInputFormatastheformatoftheinputdataand,bydefault,thisdeliverstothemapperdatawherethekeyisthebyteoffsetinthefileandthevalueisthetextofthatline.Inreality,youmayneveractuallyseeamapperthatusesthatbyteoffsetkey,butit’sprovided.
Themapperisexecutedonceforeachlineoftextintheinputsource,andeverytimeittakesthelineandbreaksitintowords.ItthenusestheContextobjecttooutput(morecommonlyknownasemitting)eachnewkey/valueoftheform(word,1).TheseareourK2/V2values.
Wesaidbeforethattheinputtothereducerisakeyandacorrespondinglistofvalues,andthereissomemagicthathappensbetweenthemapandreducemethodstocollectthevaluesforeachkeythatfacilitatesthis—calledtheshufflestage,whichwewon’tdescriberightnow.Hadoopexecutesthereduceronceforeachkey,andtheprecedingreducerimplementationsimplycountsthenumbersintheIterableobjectandgivesoutputforeachwordintheformof(word,count).TheseareourK3/V3values.
Takealookatthesignaturesofourmapperandreducerclasses:theWordCountMapperclassacceptsIntWritableandTextasinputandprovidesTextandIntWritableasoutput.TheWordCountReducerclasshasTextandIntWritableacceptedasbothinputandoutput.Thisisagainquiteacommonpattern,wherethemapmethodperformsaninversiononthekeyandvalues,andinsteademitsaseriesofdatapairsonwhichthereducerperformsaggregation.
Thedriverismoremeaningfulhere,aswehaverealvaluesfortheparameters.Weuseargumentspassedtotheclasstospecifytheinputandoutputlocations.
Runthejobwith:
$hadoopjarbuild/libs/mapreduce-example.jar
com.learninghadoop2.mapreduce.WordCount\
twitter.txtoutput
![Page 133: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/133.jpg)
Examinetheoutputwithacommandsuchasthefollowing;theactualfilenamemightbedifferent,sojustlookinsidethedirectorycalledoutputinyourhomedirectoryonHDFS:
$hdfsdfs-catoutput/part-r-00000
![Page 134: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/134.jpg)
Wordco-occurrencesWordsoccurringtogetherarelikelytobephrasesandcommon—frequentlyoccurring—phrasesarelikelytobeimportant.InNaturalLanguageProcessing,alistofco-occurringtermsiscalledanN-Gram.N-Gramsarethefoundationofseveralstatisticalmethodsfortextanalytics.WewillgiveanexampleofthespecialcaseofanN-Gram—andametricoftenencounteredinanalyticsapplications—composedoftwoterms(abigram).
AnaïveimplementationinMapReducewouldbeanextensionofWordCountthatemitsamulti-fieldkeycomposedoftwotab-separatedwords.
publicclassBiGramCountextendsConfiguredimplementsTool
{
publicstaticclassBiGramMapper
extendsMapper<Object,Text,Text,IntWritable>{
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext
)throwsIOException,InterruptedException{
String[]words=value.toString().split("");
Textbigram=newText();
Stringprev=null;
for(Strings:words){
if(prev!=null){
bigram.set(prev+"\t+\t"+s);
context.write(bigram,one);
}
prev=s;
}
}
}
@Override
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args).getRemainingArgs();
Jobjob=Job.getInstance(conf);
job.setJarByClass(BiGramCount.class);
job.setMapperClass(BiGramMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(newBiGramCount(),args);
![Page 135: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/135.jpg)
System.exit(exitCode);
}
}
Inthisjob,wereplaceWordCountReducerwithorg.apache.hadoop.mapreduce.lib.reduce.IntSumReducer,whichimplementsthesamelogic.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/BiGramCount.java
![Page 136: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/136.jpg)
TrendingtopicsThe#symbol,calledahashtag,isusedtomarkkeywordsortopicsinatweet.ItwascreatedorganicallybyTwitterusersasawaytocategorizemessages.TwitterSearch(foundathttps://twitter.com/search-home)popularizedtheuseofhashtagsasamethodtoconnectandfindcontentrelatedtospecifictopicsaswellasthepeopletalkingaboutsuchtopics.Bycountingthefrequencywithwhichahashtagismentionedoveragiventimeperiod,wecandeterminewhichtopicsaretrendinginthesocialnetwork.
publicclassHashTagCountextendsConfiguredimplementsTool
{
publicstaticclassHashTagCountMapper
extendsMapper<Object,Text,Text,IntWritable>
{
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
privateStringhashtagRegExp=
"(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)";
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
String[]words=value.toString().split("");
for(Stringstr:words)
{
if(str.matches(hashtagRegExp)){
word.set(str);
context.write(word,one);
}
}
}
}
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args)
.getRemainingArgs();
Jobjob=Job.getInstance(conf);
job.setJarByClass(HashTagCount.class);
job.setMapperClass(HashTagCountMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
![Page 137: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/137.jpg)
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(newHashTagCount(),args);
System.exit(exitCode);
}
}
AsintheWordCountexample,wetokenizetextintheMapper.Weusearegularexpression—hashtagRegExp—todetectthepresenceofahashtaginTwitter’stextandemitthehashtagandthenumber1whenahashtagisfound.IntheReducerstep,wethencountthetotalnumberofemittedhashtagoccurrencesusingIntSumReducer.
Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagCount.java
ThiscompiledclasswillbeintheJARfilewebuiltwithGradleearlier,sonowweexecuteHashTagCountwiththefollowingcommand:
$hadoopjarbuild/libs/mapreduce-example.jar\
com.learninghadoop2.mapreduce.HashTagCounttwitter.txtoutput
Let’sexaminetheoutputasbefore:
$hdfsdfs-catoutput/part-r-00000
Youshouldseeoutputsimilartothefollowing:
#whey1
#willpower1
#win2
#winterblues1
#winterstorm1
#wipolitics1
#women6
#woodgrain1
Eachlineiscomposedofahashtagandthenumberoftimesitappearsinthetweetsdataset.Asyoucansee,theMapReducejobordersresultsbykey.Ifwewanttofindthemostmentionedtopics,weneedtoordertheresultset.Thenaïveapproachwouldbetoperformatotalorderoftheaggregatedvaluesandselectingthetop10.
Iftheoutputdatasetissmall,wecanpipeittostandardoutputandsortitusingthesortutility:
$hdfsdfs-catoutput/part-r-00000|sort-k2-n-r|head-n10
AnothersolutionwouldbetowriteanotherMapReducejobtotraversethewholeresultsetandsortbyvalue.Whendatabecomeslarge,thistypeofglobalsortingcanbecomequiteexpensive.Inthefollowingsection,wewillillustrateanefficientdesignpatterntosortaggregateddata
TheTopNpattern
![Page 138: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/138.jpg)
IntheTopNpattern,wekeepdatasortedinalocaldatastructure.EachmappercalculatesalistofthetopNrecordsinitssplitandsendsitslisttothereducer.AsinglereducertaskfindsthetopNglobalrecords.
WewillapplythisdesignpatterntoimplementaTopTenHashTagjobthatfindsthetoptentopicsinourdataset.ThejobtakesasinputtheoutputdatageneratedbyHashTagCountandreturnsalistofthetenmostfrequentlymentionedhashtags.
InTopTenMapperweuseTreeMaptokeepasortedlist—inascendingorder—ofhashtags.Thekeyofthismapisthenumberofoccurrences;thevalueisatab-separatedstringofhashtagsandtheirfrequency.Inmap(),foreachvalue,weupdatethetopNmap.WhentopNhasmorethantenitems,weremovethesmallest:
publicstaticclassTopTenMapperextendsMapper<Object,Text,
NullWritable,Text>{
privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext)throws
IOException,InterruptedException{
String[]words=value.toString().split("\t");
if(words.length<2){
return;
}
topN.put(Integer.parseInt(words[1]),newText(value));
if(topN.size()>10){
topN.remove(topN.firstKey());
}
}
@Override
protectedvoidcleanup(Contextcontext)throwsIOException,
InterruptedException{
for(Textt:topN.values()){
context.write(NullWritable.get(),t);
}
}
}
Wedon’temitanykey/valueinthemapfunction.Weimplementacleanup()methodthat,oncethemapperhasconsumedallitsinput,emitsthe(hashtag,count)valuesintopN.WeuseaNullWritablekeybecausewewantallvaluestobeassociatedwiththesamekeysothatwecanperformaglobalorderoverallmappers’topnlists.Thisimpliesthatourjobwillexecuteonlyonereducer.
Thereducerimplementslogicsimilartowhatwehaveinmap().WeinstantiateTreeMapanduseittokeepanorderedlistofthetop10values:
publicstaticclassTopTenReducerextends
Reducer<NullWritable,Text,NullWritable,Text>{
privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();
![Page 139: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/139.jpg)
@Override
publicvoidreduce(NullWritablekey,Iterable<Text>values,Context
context)throwsIOException,InterruptedException{
for(Textvalue:values){
String[]words=value.toString().split("\t");
topN.put(Integer.parseInt(words[1]),
newText(value));
if(topN.size()>10){
topN.remove(topN.firstKey());
}
}
for(Textword:topN.descendingMap().values()){
context.write(NullWritable.get(),word);
}
}
}
Finally,wetraversetopNindescendingordertogeneratethelistoftrendingtopics.
NoteNotethatinthisimplementation,weoverridehashtagsthathaveafrequencyvaluealreadypresentinTreeMapwhencallingtopN.put().Dependingontheusecase,it’sadvisedtouseadifferentdatastructure—suchastheonesofferedbytheGuavalibrary(https://code.google.com/p/guava-libraries/)—oradjusttheupdatingstrategy.
Inthedriver,weenforceasinglereducerbysettingjob.setNumReduceTasks(1):
$hadoopjarbuild/libs/mapreduce-example.jar\
com.learninghadoop2.mapreduce.TopTenHashTag\
output/part-r-00000\
top-ten
Wecaninspectthetoptentolisttrendingtopics:
$hdfsdfs-cattop-ten/part-r-00000
#Stalker48150
#gameinsight55
#12M52
#KCA46
#LORDJASONJEROME29
#Valencia19
#LesAnges616
#VoteLuan15
#hadoop212
#Gameinsight11
Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/TopTenHashTag.java
![Page 140: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/140.jpg)
SentimentofhashtagsTheprocessofidentifyingsubjectiveinformationinadatasourceiscommonlyreferredtoassentimentanalysis.Inthepreviousexample,weshowhowtodetecttrendingtopicsinasocialnetwork;we’llnowanalyzethetextsharedaroundthosetopicstodeterminewhethertheyexpressamostlypositiveornegativesentiment.
AlistofpositiveandnegativewordsfortheEnglishlanguage—aso-calledopinionlexicon—canbefoundathttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar.
NoteTheseresources—andmanymore—havebeencollectedbyProf.BingLiu’sgroupattheUniversityofIllinoisatChicagoandhavebeenused,amongothers,inBingLiu,MinqingHuandJunshengCheng.“OpinionObserver:AnalyzingandComparingOpinionsontheWeb.”Proceedingsofthe14thInternationalWorldWideWebconference(WWW-2005),May10-14,2005,Chiba,Japan.
Inthisexample,we’llpresentabag-of-wordsmethodthat,althoughsimplisticinnature,canbeusedasabaselinetomineopinionintext.Foreachtweetandeachhashtag,wewillcountthenumberoftimesapositiveoranegativewordappearsandnormalizethiscountbythetextlength.
NoteThebag-of-wordsmodelisanapproachusedinNaturalLanguageProcessingandInformationRetrievaltorepresenttextualdocuments.Inthismodel,textisrepresentedasthesetorbag—withmultiplicity—ofitswords,disregardinggrammarandmorphologicalpropertiesandevenwordorder.
UncompressthearchiveandplacethewordlistsintoHDFSwiththefollowingcommandline:
$hdfsdfs–putpositive-words.txt<destination>
$hdfsdfs–putnegative-words.txt<destination>
IntheMapperclass,wedefinetwoobjectsthatwillholdthewordlists:positiveWordsandnegativeWordsasSet<String>:
privateSet<String>positiveWords=null;
privateSet<String>negativeWords=null;
Weoverridethedefaultsetup()methodoftheMappersothatalistofpositiveandnegativewords—specifiedbytwoconfigurationproperties:job.positivewords.pathandjob.negativewords.path—isreadfromHDFSusingthefilesystemAPIwediscussedinthepreviouschapter.WecouldhavealsousedDistributedCachetosharethisdataacrossthecluster.Thehelpermethod,parseWordsList,readsalistofwordlists,stripsoutcomments,andloadswordsintoHashSet<String>:
privateHashSet<String>parseWordsList(FileSystemfs,PathwordsListPath)
![Page 141: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/141.jpg)
{
HashSet<String>words=newHashSet<String>();
try{
if(fs.exists(wordsListPath)){
FSDataInputStreamfi=fs.open(wordsListPath);
BufferedReaderbr=
newBufferedReader(newInputStreamReader(fi));
Stringline=null;
while((line=br.readLine())!=null){
if(line.length()>0&&!line.startsWith(BEGIN_COMMENT)){
words.add(line);
}
}
fi.close();
}
}
catch(IOExceptione){
e.printStackTrace();
}
returnwords;
}
IntheMapperstep,weemitforeachhashtaginthetweettheoverallsentimentofthetweet(simplythepositivewordcountminusthenegativewordcount)andthelengthofthetweet.
We’llusetheseinthereducertocalculateanoverallsentimentratioweightedbythelengthofthetweetstoestimatethesentimentexpressedbyatweetonahashtag,asfollows:
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
String[]words=value.toString().split("");
IntegerpositiveCount=newInteger(0);
IntegernegativeCount=newInteger(0);
IntegerwordsCount=newInteger(0);
for(Stringstr:words)
{
if(str.matches(HASHTAG_PATTERN)){
hashtags.add(str);
}
if(positiveWords.contains(str)){
positiveCount+=1;
}elseif(negativeWords.contains(str)){
negativeCount+=1;
}
wordsCount+=1;
![Page 142: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/142.jpg)
}
IntegersentimentDifference=0;
if(wordsCount>0){
sentimentDifference=positiveCount-negativeCount;
}
Stringstats;
for(Stringhashtag:hashtags){
word.set(hashtag);
stats=String.format("%d%d",sentimentDifference,
wordsCount);
context.write(word,newText(stats));
}
}
}
IntheReducerstep,weaddtogetherthesentimentscoresgiventoeachinstanceofthehashtaganddividebythetotalsizeofallthetweetsinwhichitoccurred:
publicstaticclassHashTagSentimentReducer
extendsReducer<Text,Text,Text,DoubleWritable>{
publicvoidreduce(Textkey,Iterable<Text>values,
Contextcontext
)throwsIOException,InterruptedException{
doubletotalDifference=0;
doubletotalWords=0;
for(Textval:values){
String[]parts=val.toString().split("");
totalDifference+=Double.parseDouble(parts[0]);
totalWords+=Double.parseDouble(parts[1]);
}
context.write(key,
newDoubleWritable(totalDifference/totalWords));
}
}
Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentiment.java
Afterrunningtheprecedingcode,executeHashTagSentimentwiththefollowingcommand:
$hadoopjarbuild/libs/mapreduce-example.jar
com.learninghadoop2.mapreduce.HashTagSentimenttwitter.txtoutput-sentiment
<positivewords><negativewords>
Youcanexaminetheoutputwiththefollowingcommand:
$hdfsdfs-catoutput-sentiment/part-r-00
000
Youshouldseeanoutputsimilartothefollowing:
#10680.011861271213042056
#10YearsOfLove0.012285135487494233
![Page 143: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/143.jpg)
#110.011941109121333999
#120.011938693593171155
#12F0.012339242266249566
#12M0.011864286953783268
#12MCalleEnPazYaTeVasNicolas
Intheprecedingoutput,eachlineiscomposedofahashtagandthesentimentpolarityassociatedwithit.Thisnumberisaheuristicthattellsuswhetherahashtagisassociatedmostlywithpositive(polarity>0)ornegative(polarity<0)sentimentandthemagnitudeofsuchasentiment—thehigherorlowerthenumber,thestrongerthesentiment.
![Page 144: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/144.jpg)
TextcleanupusingchainmapperIntheexamplespresenteduntilnow,weignoredakeystepofessentiallyeveryapplicationbuiltaroundtextprocessing,whichisthenormalizationandcleanupoftheinputdata.Threecommoncomponentsofthisnormalizationstepare:
ChangingthelettercasetoeitherlowerorupperRemovalofstopwordsStemming
Inthissection,wewillshowhowtheChainMapperclass—foundatorg.apache.hadoop.mapreduce.lib.chain.ChainMapper—allowsustosequentiallycombineaseriesofMapperstoputtogetherasthefirststepofadatacleanuppipeline.Mappersareaddedtotheconfiguredjobusingthefollowing:
ChainMapper.addMapper(
JobConfjob,
Class<?extendsMapper<K1,V1,K2,V2>>klass,
Class<?extendsK1>inputKeyClass,
Class<?extendsV1>inputValueClass,
Class<?extendsK2>outputKeyClass,
Class<?extendsV2>outputValueClass,JobConfmapperConf)
Thestaticmethod,addMapper,requiresthefollowingargumentstobepassed:
job:JobConftoaddtheMapperclassclass:MapperclasstoaddinputKeyClass:mapperinputkeyclassinputValueClass:mapperinputvalueclassoutputKeyClass:mapperoutputkeyclassoutputValueClass:mapperoutputvalueclassmapperConf:aJobConfwiththeconfigurationfortheMapperclass
Inthisexample,wewilltakecareofthefirstitemlistedabove:beforecomputingthesentimentofeachtweet,wewillconverttolowercaseeachwordpresentinitstext.Thiswillallowustomoreaccuratelyascertainthesentimentofhashtagsbyignoringdifferencesincapitalizationacrosstweets.
Firstofall,wedefineanewMapper—LowerCaseMapper—whosemap()functioncallsJavaString’stoLowerCase()methodonitsinputvalueandemitsthelowercasedtext:
publicclassLowerCaseMapperextendsMapper<LongWritable,Text,
IntWritable,Text>{
privateTextlowercased=newText();
publicvoidmap(LongWritablekey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
lowercased.set(value.toString().toLowerCase());
context.write(newIntWritable(1),lowercased);
}
}
IntheHashTagSentimentChaindriver,weconfiguretheJobobjectsothatbothMappers
![Page 145: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/145.jpg)
willbechainedtogetherandexecuted:
publicclassHashTagSentimentChain
extendsConfiguredimplementsTool
{
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args).getRemainingArgs();
//location(onhdfs)ofthepositivewordslist
conf.set("job.positivewords.path",args[2]);
conf.set("job.negativewords.path",args[3]);
Jobjob=Job.getInstance(conf);
job.setJarByClass(HashTagSentimentChain.class);
ConfigurationlowerCaseMapperConf=newConfiguration(false);
ChainMapper.addMapper(job,
LowerCaseMapper.class,
LongWritable.class,Text.class,
IntWritable.class,Text.class,
lowerCaseMapperConf);
ConfigurationhashTagSentimentConf=newConfiguration(false);
ChainMapper.addMapper(job,
HashTagSentiment.HashTagSentimentMapper.class,
IntWritable.class,
Text.class,Text.class,
Text.class,
hashTagSentimentConf);
job.setReducerClass(HashTagSentiment.HashTagSentimentReducer.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(
newHashTagSentimentChain(),args);
System.exit(exitCode);
}
}
TheLowerCaseMapperandHashTagSentimentMapperclassesareinvokedinapipeline,wheretheoutputofthefirstbecomestheinputofthesecond.TheoutputofthelastMapperwillbewrittentothetask’soutput.AnimmediatebenefitofthisdesignisareductionofdiskI/Ooperations.Mappersdonotneedtobeawarethattheyarechained.
![Page 146: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/146.jpg)
It’sthereforepossibletoreusespecializedMappersthatcanbecombinedwithinasingletask.NotethatthispatternassumesthatallMappers—andtheReduce—usematchingoutputandinput(key,value)pairs.NocastingorconversionisdonebyChainMapperitself.
Finally,noticethattheaddMappercallforthelastmapperinthechainspecifiestheoutputkey/valueclassesapplicabletothewholemapperpipelinewhenusedasacomposite.
Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentimentChain.java
ExecuteHashTagSentimentChainwiththecommand:
$hadoopjarbuild/libs/mapreduce-example.jar
com.learninghadoop2.mapreduce.HashTagSentimentChaintwitter.txtoutput
<positivewords><negativewords>
Youshouldseeanoutputsimilartothepreviousexample.Noticethatthistime,thehashtagineachlineislowercased.
![Page 147: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/147.jpg)
WalkingthrougharunofaMapReducejobToexploretherelationshipbetweenmapperandreducerinmoredetail,andtoexposesomeofHadoop’sinnerworkings,we’llnowgothroughhowaMapReducejobisexecuted.ThisappliestobothMapReduceinHadoop1andHadoop2eventhoughthelatterisimplementedverydifferentlyusingYARN,whichwe’lldiscusslaterinthischapter.Additionalinformationontheservicesdescribedinthissection,aswellassuggestionsfortroubleshootingMapReduceapplications,canbefoundinChapter10,RunningaHadoopCluster.
![Page 148: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/148.jpg)
StartupThedriveristheonlypieceofcodethatrunsonourlocalmachine,andthecalltoJob.waitForCompletion()startsthecommunicationwiththeJobTracker,whichisthemasternodeintheMapReducesystem.TheJobTrackerisresponsibleforallaspectsofjobschedulingandexecution,soitbecomesourprimaryinterfacewhenperforminganytaskrelatedtojobmanagement.
ToshareresourcesontheclustertheJobTrackercanuseoneofseveralschedulingapproachestohandleincomingjobs.Thegeneralmodelistohaveanumberofqueuestowhichjobscanbesubmittedalongwithpoliciestoassignresourcesacrossthequeues.ThemostcommonlyusedimplementationsforthesepoliciesareCapacityandFairScheduler.
TheJobTrackercommunicateswiththeNameNodeonourbehalfandmanagesallinteractionsrelatingtothedatastoredonHDFS.
![Page 149: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/149.jpg)
SplittingtheinputThefirstoftheseinteractionshappenswhentheJobTrackerlooksattheinputdataanddetermineshowtoassignittomaptasks.RecallthatHDFSfilesareusuallysplitintoblocksofatleast64MBandtheJobTrackerwillassigneachblocktoonemaptask.OurWordCountexample,ofcourse,usedatrivialamountofdatathatwaswellwithinasingleblock.Pictureamuchlargerinputfilemeasuredinterabytes,andthesplitmodelmakesmoresense.Eachsegmentofthefile—orsplit,inMapReduceterminology—isprocesseduniquelybyonemaptask.Onceithascomputedthesplits,theJobTrackerplacesthemandtheJARfilecontainingtheMapperandReducerclassesintoajob-specificdirectoryonHDFS,whosepathwillbepassedtoeachtaskasitstarts.
![Page 150: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/150.jpg)
TaskassignmentTheTaskTrackerserviceisresponsibleforallocatingresources,executingandtrackingthestatusofmapandreducetasksrunningonanode.OncetheJobTrackerhasdeterminedhowmanymaptaskswillbeneeded,itlooksatthenumberofhostsinthecluster,howmanyTaskTrackersareworking,andhowmanymaptaskseachcanconcurrentlyexecute(auser-definableconfigurationvariable).TheJobTrackeralsolookstoseewherethevariousinputdatablocksarelocatedacrosstheclusterandattemptstodefineanexecutionplanthatmaximizesthecaseswhentheTaskTrackerprocessesasplit/blocklocatedonthesamephysicalhost,or,failingthat,itprocessesatleastoneinthesamehardwarerack.ThisdatalocalityoptimizationisahugereasonbehindHadoop’sabilitytoefficientlyprocesssuchlargedatasets.Recallalsothat,bydefault,eachblockisreplicatedacrossthreedifferenthosts,sothelikelihoodofproducingatask/hostplanthatseesmostblocksprocessedlocallyishigherthanitmightseematfirst.
![Page 151: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/151.jpg)
TaskstartupEachTaskTrackerthenstartsupaseparateJavavirtualmachinetoexecutethetasks.Thisdoesaddastartuptimepenalty,butitisolatestheTaskTrackerfromproblemscausedbymisbehavingmaporreducetasks,anditcanbeconfiguredtobesharedbetweensubsequentlyexecutedtasks.
Iftheclusterhasenoughcapacitytoexecuteallthemaptasksatonce,theywillallbestartedandgivenareferencetothesplittheyaretoprocessandthejobJARfile.Iftherearemoretasksthantheclustercapacity,theJobTrackerwillkeepaqueueofpendingtasksandassignthemtonodesastheycompletetheirinitiallyassignedmaptasks.
Wearenowreadytoseetheexecuteddataofmaptasks.Ifallthissoundslikealotofwork,itis;itexplainswhy,whenrunninganyMapReducejob,thereisalwaysanon-trivialamountoftimetakenasthesystemgetsstartedandperformsallthesesteps.
![Page 152: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/152.jpg)
OngoingJobTrackermonitoringTheJobTrackerdoesn’tjuststopworknowandwaitfortheTaskTrackerstoexecuteallthemappersandreducers.It’sconstantlyexchangingheartbeatandstatusmessageswiththeTaskTrackers,lookingforevidenceofprogressorproblems.Italsocollectsmetricsfromthetasksthroughoutthejobexecution,someprovidedbyHadoopandothersspecifiedbythedeveloperofthemapandreducetasks,althoughwedon’tuseanyinthisexample.
![Page 153: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/153.jpg)
MapperinputThedriverclassspecifiestheformatandstructureoftheinputfileusingTextInputFormat,andfromthis,Hadoopknowstotreatthisastextwiththebyteoffsetasthekeyandlinecontentsasthevalue.Assumethatourdatasetcontainsthefollowingtext:
Thisisatest
Yesitis
Thetwoinvocationsofthemapperwillthereforebegiventhefollowingoutput:
1Thisisatest
2Yesitis
![Page 154: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/154.jpg)
MapperexecutionThekey/valuepairsreceivedbythemapperaretheoffsetinthefileofthelineandthelinecontents,respectively,becauseofhowthejobisconfigured.OurimplementationofthemapmethodinWordCountMapperdiscardsthekey,aswedonotcarewhereeachlineoccurredinthefile,andsplitstheprovidedvalueintowordsusingthesplitmethodonthestandardJavaStringclass.NotethatbettertokenizationcouldbeprovidedbyuseofregularexpressionsortheStringTokenizerclass,butforourpurposesthissimpleapproachwillsuffice.Foreachindividualword,themapperthenemitsakeycomprisedoftheactualworditself,andavalueof1.
![Page 155: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/155.jpg)
MapperoutputandreducerinputTheoutputofthemapperisaseriesofpairsoftheform(word,1);inourexample,thesewillbe:
(This,1),(is,1),(a,1),(test,1),(Yes,1),(it,1),(is,1)
Theseoutputpairsfromthemapperarenotpasseddirectlytothereducer.Betweenmappingandreducingistheshufflestage,wheremuchofthemagicofMapReduceoccurs.
![Page 156: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/156.jpg)
ReducerinputThereducerTaskTrackerreceivesupdatesfromtheJobTrackerthattellitwhichnodesintheclusterholdmapoutputpartitionsthatneedtobeprocessedbyitslocalreducetask.Itthenretrievesthesefromthevariousnodesandmergesthemintoasinglefilethatwillbefedtothereducetask.
![Page 157: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/157.jpg)
ReducerexecutionOurWordCountReducerclassisverysimple;foreachword,itsimplycountsthenumberofelementsinthearrayandemitsthefinal(word,count)outputforeachword.ForourinvocationofWordCountonoursampleinput,allbutonewordhasonlyonevalueinthelistofvalues;ishastwo.
![Page 158: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/158.jpg)
ReduceroutputThefinalsetofreduceroutputforourexampleistherefore:
(This,1),(is,2),(a,1),(test,1),(Yes,1),(it,1)
ThisdatawillbeoutputtopartitionfileswithintheoutputdirectoryspecifiedinthedriverthatwillbeformattedusingthespecifiedOutputFormatimplementation.Eachreducetaskwritestoasinglefilewiththefilenamepart-r-nnnnn,wherennnnnstartsat00000andisincremented.
![Page 159: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/159.jpg)
ShutdownOncealltaskshavecompletedsuccessfully,theJobTrackeroutputsthefinalstateofthejobtotheclient,alongwiththefinalaggregatesofsomeofthemoreimportantcountersthatithasbeenaggregatingalongtheway.Thefulljobandtaskhistoryisavailableinthelogdirectoryoneachnodeor,moreaccessibly,viatheJobTrackerwebUI;pointyourbrowsertoport50030ontheJobTrackernode.
![Page 160: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/160.jpg)
Input/OutputWehavetalkedaboutfilesbeingbrokenintosplitsaspartofthejobstartupandthedatainasplitbeingsenttothemapperimplementation.However,thisoverlookstwoaspects:howthedataisstoredinthefileandhowtheindividualkeysandvaluesarepassedtothemapperstructure.
![Page 161: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/161.jpg)
InputFormatandRecordReaderHadoophastheconceptofInputFormatforthefirstoftheseresponsibilities.TheInputFormatabstractclassintheorg.apache.hadoop.mapreducepackageprovidestwomethodsasshowninthefollowingcode:
publicabstractclassInputFormat<K,V>
{
publicabstractList<InputSplit>getSplits(JobContextcontext);
RecordReader<K,V>createRecordReader(InputSplitsplit,
TaskAttemptContextcontext);
}
ThesemethodsdisplaythetworesponsibilitiesoftheInputFormatclass:
ToprovidedetailsonhowtodivideaninputfileintothesplitsrequiredformapprocessingTocreateaRecordReaderthatwillgeneratetheseriesofkey/valuepairsfromasplit
TheRecordReaderclassisalsoanabstractclasswithintheorg.apache.hadoop.mapreducepackage:
publicabstractclassRecordReader<Key,Value>implementsCloseable
{
publicabstractvoidinitialize(InputSplitsplit,
TaskAttemptContextcontext);
publicabstractbooleannextKeyValue()
throwsIOException,InterruptedException;
publicabstractKeygetCurrentKey()
throwsIOException,InterruptedException;
publicabstractValuegetCurrentValue()
throwsIOException,InterruptedException;
publicabstractfloatgetProgress()
throwsIOException,InterruptedException;
publicabstractclose()throwsIOException;
}
ARecordReaderinstanceiscreatedforeachsplitandcallsgetNextKeyValuetoreturnaBooleanindicatingwhetheranotherkey/valuepairisavailable,and,ifso,thegetKeyandgetValuemethodsareusedtoaccessthekeyandvaluerespectively.
ThecombinationoftheInputFormatandRecordReaderclassesthereforeareallthatisrequiredtobridgebetweenanykindofinputdataandthekey/valuepairsrequiredbyMapReduce.
![Page 162: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/162.jpg)
Hadoop-providedInputFormatTherearesomeHadoop-providedInputFormatimplementationswithintheorg.apache.hadoop.mapreduce.lib.inputpackage:
FileInputFormat:isanabstractbaseclassthatcanbetheparentofanyfile-basedinput.SequenceFileInputFormat:isanefficientbinaryfileformatthatwillbediscussedinanupcomingsection.TextInputFormat:isusedforplaintextfiles.KeyValueTextInputFormat:isusedforplaintextfiles.Eachlineisdividedintokeyandvaluepartsbyaseparatorbyte.
Notethatinputformatsarenotrestrictedtoreadingfromfiles;FileInputFormatisitselfasubclassofInputFormat.It’spossibletohaveHadoopusedatathatisnotbasedonfilesastheinputtoMapReducejobs;commonsourcesarerelationaldatabasesorcolumn-orienteddatabases,suchasAmazonDynamoDBorHBase.
![Page 163: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/163.jpg)
Hadoop-providedRecordReaderHadoopprovidesafewcommonRecordReaderimplementations,whicharealsopresentwithintheorg.apache.hadoop.mapreduce.lib.inputpackage:
LineRecordReader:implementationisthedefaultRecordReaderclassfortextfilesthatpresentsthebyteoffsetinthefileasthekeyandthelinecontentsasthevalueSequenceFileRecordReader:implementationreadsthekey/valuefromthebinarySequenceFilecontainer
![Page 164: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/164.jpg)
OutputFormatandRecordWriterThereisasimilarpatternforwritingtheoutputofajobcoordinatedbysubclassesofOutputFormatandRecordWriterfromtheorg.apache.hadoop.mapreducepackage.Wewon’texploretheseinanydetailhere,butthegeneralapproachissimilar,althoughOutputFormatdoeshaveamoreinvolvedAPI,asithasmethodsfortaskssuchasvalidationoftheoutputspecification.
It’sthisstepthatcausesajobtofailifaspecifiedoutputdirectoryalreadyexists.Ifyouwanteddifferentbehavior,itwouldrequireasubclassofOutputFormatthatoverridesthismethod.
![Page 165: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/165.jpg)
Hadoop-providedOutputFormatThefollowingoutputformatsareprovidedintheorg.apache.hadoop.mapreduce.outputpackage:
FileOutputFormat:isthebaseclassforallfile-basedOutputFormatsNullOutputFormat:isadummyimplementationthatdiscardstheoutputandwritesnothingtothefileSequenceFileOutputFormat:writestothebinarySequenceFileformatTextOutputFormat:writesaplaintextfile
NotethattheseclassesdefinetheirrequiredRecordWriterimplementationsasstaticnestedclasses,sotherearenoseparatelyprovidedRecordWriterimplementations.
![Page 166: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/166.jpg)
SequencefilesTheSequenceFileclasswithintheorg.apache.hadoop.iopackageprovidesanefficientbinaryfileformatthatisoftenusefulasanoutputfromaMapReducejob.Thisisespeciallytrueiftheoutputfromthejobisprocessedastheinputofanotherjob.Sequencefileshaveseveraladvantages,asfollows:
Asbinaryfiles,theyareintrinsicallymorecompactthantextfilesTheyadditionallysupportoptionalcompression,whichcanalsobeappliedatdifferentlevels,thatis,theycompresseachrecordoranentiresplitTheycanbesplitandprocessedinparallel
Thislastcharacteristicisimportantasmostbinaryformats—particularlythosethatarecompressedorencrypted—cannotbesplitandmustbereadasasinglelinearstreamofdata.UsingsuchfilesasinputtoaMapReducejobmeansthatasinglemapperwillbeusedtoprocesstheentirefile,causingapotentiallylargeperformancehit.Insuchasituation,it’spreferabletouseasplittableformat,suchasSequenceFile,or,ifyoucannotavoidreceivingthefileinanotherformat,doapreprocessingstepthatconvertsitintoasplittableformat.Thiswillbeatradeoff,astheconversionwilltaketime,butinmanycases—especiallywithcomplexmaptasks—thiswillbeoutweighedbythetimesavedthroughincreasedparallelism.
![Page 167: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/167.jpg)
YARNYARNstartedoutaspartoftheMapReducev2(MRv2)initiativebutisnowanindependentsub-projectwithinHadoop(thatis,it’satthesamelevelasMapReduce).ItgrewoutofarealizationthatMapReduceinHadoop1conflatedtworelatedbutdistinctresponsibilities:resourcemanagementandapplicationexecution.
Althoughithasenabledpreviouslyunimaginedprocessingonenormousdatasets,theMapReducemodelataconceptuallevelhasanimpactonperformanceandscalability.ImplicitintheMapReducemodelisthatanyapplicationcanonlybecomposedofaseriesoflargelylinearMapReducejobs,eachofwhichfollowsamodelofoneormoremapsfollowedbyoneormorereduces.Thismodelisagreatfitforsomeapplications,butnotall.Inparticular,it’sapoorfitforworkloadsrequiringverylow-latencyresponsetimes;theMapReducestartuptimesandsometimeslengthyjobchainsoftengreatlyexceedthetoleranceforauser-facingprocess.Themodelhasalsobeenfoundtobeveryinefficientforjobsthatwouldmorenaturallyberepresentedasadirectedacyclicgraph(DAG)oftaskswherethenodesonthegraphareprocessingsteps,andtheedgesaredataflows.IfanalyzedandexecutedasaDAGthentheapplicationmaybeperformedinonestepwithhighparallelismacrosstheprocessingsteps,butwhenviewedthroughtheMapReducelens,theresultisusuallyaninefficientseriesofinterdependentMapReducejobs.
NumerousprojectshavebuiltdifferenttypesofprocessingatopMapReduceandalthoughmanyarewildlysuccessful(ApacheHiveandPigaretwostandoutexamples),theclosecouplingofMapReduceasaprocessingparadigmwiththejobschedulingmechanisminHadoop1madeitverydifficultforanynewprojecttotailoreitheroftheseareastoitsspecificneeds.
TheresultisYetAnotherResourceNegotiator(YARN),whichprovidesahighlycapablejobschedulingmechanismwithinHadoopandthewell-definedinterfacesfordifferentprocessingmodelstobeimplementedwithinit.
![Page 168: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/168.jpg)
YARNarchitectureTounderstandhowYARNworks,it’simportanttostopthinkingaboutMapReduceandhowitprocessesdata.YARNitselfsaysnothingaboutthenatureoftheapplicationsthatrunatopit,ratherit’sfocusedonprovidingthemachineryfortheschedulingandexecutionofthesejobs.Aswe’llsee,YARNisjustascapableofhostinglong-runningstreamprocessingorlow-latency,user-facingworkloadsasitiscapableofhostingbatch-processingworkloads,suchasMapReduce.
ThecomponentsofYARNYARNiscomprisedoftwomaincomponents,theResourceManager(RM),whichmanagesresourcesacrossthecluster,andtheNodeManager(NM),whichrunsoneachhostandmanagestheresourcesontheindividualmachine.TheResourceManagerandNodeManagersdealwiththeschedulingandmanagementofcontainers,anabstractnotionofthememory,CPU,andI/Othatwillbededicatedtorunaparticularpieceofapplicationcode.UsingMapReduceasanexample,whenrunningatopYARN,theJobTrackerandeachTaskTrackerallrunintheirowndedicatedcontainers.Notethough,thatinYARN,eachMapReducejobhasitsowndedicatedJobTracker;thereisnosingleinstancethatmanagesalljobs,asinHadoop1.
YARNitselfisresponsibleonlyfortheschedulingoftasksacrossthecluster;allnotionsofapplication-levelprogress,monitoring,andfaulttolerancearehandledintheapplicationcode.Thisisaveryexplicitdesigndecision;bymakingYARNasindependentaspossible,ithasaveryclearsetofresponsibilitiesanddoesnotartificiallyconstrainthetypesofapplicationthatcanbeimplementedonYARN.
Asthearbiterofallclusterresources,YARNhastheabilitytoefficientlymanagetheclusterasawholeandnotfocusonapplication-levelresourcerequirements.IthasapluggableschedulingpolicywiththeprovidedimplementationssimilartotheexistingHadoopCapacityandFairScheduler.YARNalsotreatsallapplicationcodeasinherentlyuntrustedandallapplicationmanagementandcontroltasksareperformedinuserspace.
AnatomyofaYARNapplicationAsubmittedYARNapplicationhastwocomponents:theApplicationMaster(AM),whichcoordinatestheoverallapplicationflow,andthespecificationofthecodethatwillrunontheworkernodes.ForMapReduceatopYARN,theJobTrackerimplementstheApplicationMasterfunctionalityandTaskTrackersaretheapplicationcustomcodedeployedontheworkernodes.
Asmentionedintheprevioussection,theresponsibilitiesofapplicationmanagement,progressmonitoringandfaulttolerancearepushedtotheapplicationlevelinYARN.It’stheApplicationMasterthatperformsthesetasks;YARNitselfsaysnothingaboutthemechanismsforcommunicationbetweentheApplicationMasterandthecoderunningintheworkercontainers,forexample.
ThisgenericityallowsYARNapplicationstonotbetiedtoJavaclasses.The
![Page 169: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/169.jpg)
ApplicationManagercaninsteadrequestaNodeManagertoexecuteshellscripts,nativeapplications,oranyothertypeofprocessingthatismadeavailableoneachnode.
![Page 170: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/170.jpg)
LifecycleofaYARNapplicationAswithMapReducejobsinHadoop1,YARNapplicationsaresubmittedtotheclusterbyaclient.WhenaYARNapplicationisstarted,theclientfirstcallstheResourceManager(morespecificallytheApplicationManagerportionoftheResourceManager)andrequeststheinitialcontainerwithinwhichtoexecutetheApplicationMaster.InmostcasestheApplicationMasterwillrunfromahostedcontainerinthecluster,justaswilltherestoftheapplicationcode.TheApplicationManagercommunicateswiththeothermaincomponentoftheResourceManager,thescheduleritself,whichhastheultimateresponsibilityofmanagingallresourcesacrossthecluster.
TheApplicationMasterstartsupintheprovidedcontainer,registersitselfwiththeResourceManager,andbeginstheprocessofnegotiatingitsrequiredresources.TheApplicationMastercommunicateswiththeResourceManagerandrequeststhecontainersitrequires.Thespecificationofthecontainersrequestedcanalsoincludeadditionalinformation,suchasdesiredplacementwithintheclusterandconcreteresourcerequirements,suchasaparticularamountofmemoryorCPU.
TheResourceManagerprovidestheApplicationMasterwiththedetailsofthecontainersithasbeenallocated,andtheApplicationMasterthencommunicateswiththeNodeManagerstostarttheapplication-specifictaskforeachcontainer.ThisisdonebyprovidingtheNodeManagerwiththespecificationoftheapplicationtobeexecuted,whichasmentionedmaybeaJARfile,ascript,apathtoalocalexecutable,oranythingelsethattheNodeManagercaninvoke.EachNodeManagerinstantiatesthecontainerfortheapplicationcodeandstartstheapplicationbasedontheprovidedspecification.
FaulttoleranceandmonitoringFromthispointonward,thebehaviorislargelyapplicationspecific.YARNwillnotmanageapplicationprogressbutdoesperformanumberofongoingtasks.TheAMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromallApplicationMasters,andifitdeterminesthatanApplicationMasterhasfailedorstoppedworking,itwillde-registerthefailedApplicationMasterandreleaseallitsallocatedcontainers.TheResourceManagerwillthenrescheduletheapplicationaconfigurablenumberoftimes.
AlongsidethisprocesstheNMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromtheNodeManagersandkeepstrackofthehealthofeachNodeManagerinthecluster.SimilartothemonitoringofApplicationMasterhealth,aNodeManagerwillbemarkedasdeadafterreceivingnoheartbeatsforadefaulttimeof10minutes,afterwhichallallocatedcontainersaremarkedasdead,andthenodeisexcludedfromfutureresourceallocation.
Atthesametime,theNodeManagerwillactivelymonitorresourceutilizationofeachallocatedcontainerand,forthoseresourcesnotconstrainedbyhardlimits,willkillcontainersthatexceedtheirresourceallocation.
![Page 171: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/171.jpg)
Atahigherlevel,theYARNschedulerwillalwaysbelookingtomaximizetheclusterutilizationwithintheconstraintsofthesharingpolicybeingemployed.AswithHadoop1,thiswillallowlow-priorityapplicationstousemoreclusterresourcesifcontentionislow,buttheschedulerwillthenpreempttheseadditionalcontainers(thatis,requestthemtobeterminated)ifhigher-priorityapplicationsaresubmitted.
Therestoftheresponsibilityforapplication-levelfaulttoleranceandprogressmonitoringmustbeimplementedwithintheapplicationcode.ForMapReduceonYARN,forexample,allthemanagementoftaskschedulingandretriesisprovidedattheapplicationlevelandisnotinanywaydeliveredbyYARN.
![Page 172: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/172.jpg)
ThinkinginlayersTheselaststatementsmaysuggestthatwritingapplicationstorunonYARNisalotofwork,andthisistrue.TheYARNAPIisquitelow-levelandlikelyintimidatingformostdeveloperswhojustwanttorunsomeprocessingtasksontheirdata.IfallwehadwasYARNandeverynewHadoopapplicationhadtohaveitsownApplicationMasterimplemented,thenYARNwouldnotlookquiteasinterestingasitdoes.
Whatmakesthepicturebetteristhat,ingeneral,therequirementisn’ttoimplementeachandeveryapplicationonYARN,butinsteaduseitforasmallernumberofprocessingframeworksthatprovidemuchfriendlierinterfacestobeimplemented.ThefirstofthesewasMapReduce;withithostedonYARN,thedeveloperwritestotheusualmapandreduceinterfacesandislargelyunawareoftheYARNmechanics.
Butonthesamecluster,anotherdevelopermayberunningajobthatusesadifferentframeworkwithsignificantlydifferentprocessingcharacteristics,andYARNwillmanagebothatthesametime.
We’llgivesomemoredetailonseveralYARNprocessingmodelscurrentlyavailable,buttheyrunthegamutfrombatchprocessingthroughlow-latencyqueriestostreamandgraphprocessingandbeyond.
AstheYARNexperiencegrows,however,thereareanumberofinitiativestomakethedevelopmentoftheseprocessingframeworkseasier.Ontheonehandtherearehigher-levelinterfaces,suchasClouderaKitten(https://github.com/cloudera/kitten)orApacheTwill(http://twill.incubator.apache.org/),thatgivefriendlierabstractionsabovetheYARNAPIs.Perhapsamoresignificantdevelopmentmodel,though,istheemergenceofframeworksthatproviderichertoolstomoreeasilyconstructapplicationswithacommongeneralclassofperformancecharacteristics.
![Page 173: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/173.jpg)
ExecutionmodelsWehavementioneddifferentYARNapplicationshavingdistinctprocessingcharacteristics,butanemergingpatternhasseentheirexecutionmodelsingeneralbeingasourceofdifferentiation.Bythis,werefertohowtheYARNapplicationlifecycleismanaged,andweidentifythreemaintypes:per-jobapplication,per-session,andalways-on.
Batchprocessing,suchasMapReduceonYARN,seesthelifecycleoftheMapReduceframeworktiedtothatofthesubmittedapplication.IfwesubmitaMapReducejob,thentheJobTrackerandTaskTrackersthatexecuteitarecreatedspecificallyforthejobandareterminatedwhenthejobcompletes.Thisworkswellforbatch,butifwewishtoprovideamoreinteractivemodelthenthestartupoverheadofestablishingtheYARNapplicationandallitsresourceallocationswillseverelyimpacttheuserexperienceifeverycommandissuedsuffersthispenalty.Amoreinteractive,orsession-based,lifecyclewillseetheYARNapplicationstartandthenbeavailabletoserviceanumberofsubmittedrequests/commands.TheYARNapplicationterminatesonlywhenthesessionisexited.
Finally,wehavetheconceptoflong-runningapplicationsthatprocesscontinuousdatastreamsindependentofanyinteractiveinput.FortheseitmakesmostsensefortheYARNapplicationtostartandcontinuouslyprocessdatathatisretrievedthroughsomeexternalmechanism.Theapplicationwillonlyexitwhenexplicitlyshutdownorifanabnormalsituationoccurs.
![Page 174: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/174.jpg)
YARNintherealworld–ComputationbeyondMapReduceThepreviousdiscussionshavebeenalittleabstract,sointhissection,wewillexploreafewexistingYARNapplicationstoseejusthowtheyusetheframeworkandhowtheyprovideabreadthofprocessingcapability.OfparticularinterestishowtheYARNframeworkstakeverydifferentapproachestoresourcemanagement,I/Opipelining,andfaulttolerance.
![Page 175: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/175.jpg)
TheproblemwithMapReduceUntilnow,wehavelookedatMapReduceintermsofAPI.MapReduceinHadoopismorethanthat;upuntilHadoop2,itwasthedefaultexecutionengineforanumberoftools,amongwhichwereHiveandPig,whichwewilldiscussinmoredetaillaterinthisbook.WehaveseenhowMapReduceapplicationsare,infact,chainsofjobs.Thisveryaspectisonethebiggestpainpointsandconstrainingfactorsoftheframeworks.MapReducecheckpointsdatatoHDFSforintra-processcommunication:
AchainofMapReducejobs
Attheendofeachreducephase,outputiswrittentodisksothatitcanthenbeloadedbythemappersofthenextjobandusedasitsinput.ThisI/Ooverheadintroduceslatency,especiallywhenwehaveapplicationsthatrequiremultiplepassesonadataset(hencemultiplewrites).Unfortunately,thistypeofiterativecomputationisatthecoreofmanyanalyticsapplications.
ApacheTezandApacheSparkaretwoframeworksthataddressthisproblembygeneralizingtheMapReduceparadigm.Wewillbrieflydiscussthemintheremainderofthissection,nexttoApacheSamza,aframeworkthattakesanentirelydifferentapproachtoreal-timeprocessing.
![Page 176: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/176.jpg)
TezTez(http://tez.apache.org)isalow-levelAPIandexecutionenginefocusedonprovidinglow-latencyprocessing,andisbeingusedasthebasisofthelatestevolutionofHive,Pigandseveralotherframeworksthatimplementstandardjoin,filter,mergeandgroupoperations.TezisanimplementationandevolutionofaprogrammingmodelpresentedbyMicrosoftinthe2009Dryadpaper(http://research.microsoft.com/en-us/projects/dryad/).TezisageneralizationofMapReduceasdataflowthatstrivestoachievefast,interactivecomputingbypipeliningI/Ooperationsoveraqueueforintra-processcommunication.ThisavoidstheexpensivewritestodisksthataffectMapReduce.TheAPIprovidesprimitivesexpressingdependenciesbetweenjobsasaDAG.ThefullDAGisthensubmittedtoaplannerthatcanoptimizetheexecutionflow.ThesameapplicationdepictedintheprecedingdiagramwouldbeexecutedinTezasasinglejob,withI/OpipelinedfromreducerstoreducerswithoutHDFSwritesandsubsequentreadsbymappers.Anexamplecanbeseeninthefollowingdiagram:.
ATezDAGisageneralizationofMapReduce
ThecanonicalWordCountexamplecanbefoundathttps://github.com/apache/incubator-tez/blob/master/tez-mapreduce-
![Page 177: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/177.jpg)
examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.
DAGdag=newDAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(newEdge(tokenizerVertex,summerVertex,
edgeConf.createDefaultEdgeProperty()));
Eventhoughthegraphtopologydagcanbeexpressedwithafewlinesofcode,theboilerplaterequiredtoexecutethejobisconsiderable.Thiscodehandlesmanyofthelow-levelschedulingandexecutionresponsibilities,includingfaulttolerance.WhenTezdetectsafailedtask,itwalksbackuptheprocessinggraphtofindthepointfromwhichtore-executethefailedtasks.
Hive-on-tezHive0.13isthefirsthigh-profileprojecttouseTezasitsexecutionengine.We’lldiscussHiveinalotmoredetailinChapter7,HadoopandSQL,butfornowwewilljusttouchonhowit’simplementedonYARN.
Hive(http://hive.apache.org)isanengineforqueryingdatastoredonHDFSthroughstandardSQLsyntax.Ithasbeenenormouslysuccessful,asthistypeofcapabilitygreatlyreducesthebarrierstostartanalyticexplorationofdatainHadoop.
InHadoop1,Hivehadnochoice,buttoimplementitsSQLstatementsasaseriesofMapReducejobs.WhenSQLissubmittedtoHive,itgeneratestherequiredMapReducejobsbehindthescenesandexecutestheseonthecluster.Thisapproachhastwomaindrawbacks:thereisanon-trivialstartuppenaltyeachtime,andtheconstrainedMapReducemodelmeansthatseeminglysimpleSQLstatementsareoftentranslatedintoalengthyseriesofmultipledependentMapReducejobs.ThisisanexampleofthetypeofprocessingmorenaturallyconceptualizedasaDAGoftasks,asdescribedearlierinthischapter.
AlthoughsomebenefitsareachievedwhenHiveexecuteswithinMapReduce,withinYARN,themajorbenefitscomeinHive0.13whentheprojectisfullyre-implementedusingTez.ByexploitingtheTezAPIs,whicharefocusedonprovidinglow-latencyprocessing,Hivegainsevenmoreperformancewhilemakingitscodebasesimpler.
SinceTeztreatsitsworkloadsastheDAGswhichprovideamuchbetterfittotranslatedSQLqueries,HiveonTezcanperformanySQLstatementasasinglejobwithmaximizedparallelism.
TezhelpsHivesupportinteractivequeriesbyprovidinganalways-runningserviceinsteadofrequiringtheapplicationtobeinstantiatedfromscratchforeachSQLsubmission.Thisisimportantbecause,eventhoughqueriesthatprocesshugedatavolumeswillsimplytakesometime,thegoalisforHivetobecomelessofabatchtoolandinsteadmovetobeasmuchofaninteractivetoolaspossible.
![Page 178: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/178.jpg)
ApacheSparkSpark(.apache.org)isaprocessingframeworkthatexcelsatiterativeandnearreal-timeprocessing.CreatedatUCBerkeley,ithasbeendonatedasanApacheproject.SparkprovidesanabstractionthatallowsdatainHadooptobeviewedasadistributeddatastructureuponwhichaseriesofoperationscanbeperformed.TheframeworkisbasedonthesameconceptsTezdrawsinspirationfrom(Dryad),butexcelswithjobsthatallowdatatobeheldandprocessedinmemory,anditcanveryefficientlyscheduleprocessingonthein-memorydatasetacrossthecluster.Sparkautomaticallycontrolsreplicationofdataacrossthecluster,ensuringthateachelementofthedistributeddatasetisheldinmemoryonatleasttwomachines,andprovidesreplication-basedfaulttolerancesomewhatakintoHDFS.
Sparkstartedasastandalonesystem,butwasportedtoalsorunonYARNasofits0.8release.Sparkisparticularlyinterestingbecause,althoughitsclassicprocessingmodelisbatch-oriented,withtheSparkshellitprovidesaninteractivefrontendandwiththeSparkStreamingsub-projectalsooffersnearreal-timeprocessingofdatastreams.Sparkisdifferentthingstodifferentpeople;it’sbothahigh-levelAPIandanexecutionengine.Atthetimeofwriting,portsofHiveandPigtoSparkareinprogress.
![Page 179: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/179.jpg)
ApacheSamzaSamza(http://samza.apache.org)isastream-processingframeworkdevelopedatLinkedInanddonatedtotheApacheSoftwareFoundation.Samzaprocessesconceptuallyinfinitestreamsofdata,whichareseenbytheapplicationasaseriesofmessages.
SamzacurrentlyintegratesmosttightlywithApacheKafka(http://kafka.apache.org)althoughitdoeshaveapluggablearchitecture.Kafkaitselfisamessagingsystemthatexcelsatlargedatavolumesandprovidesatopic-basedabstractionsimilartomostothermessagingplatforms,suchasRabbitMQ.Publisherssendmessagestotopicsandinterestedclientsconsumemessagesfromthetopicsastheyarrive.Kafkahasmultipleaspectsthatsetitapartfromothermessagingplatforms,butforthisdiscussion,themostinterestingoneisthatKafkastoresmessagesforaperiodoftime,whichallowsmessagesintopicstobereplayed.Topicsarepartitionedacrossmultiplehostsandpartitionscanbereplicatedacrosshoststoprotectfromnodefailure.
Samzabuildsitsprocessingflowonitsconceptofstreams,whichwhenusingKafkamapdirectlytoKafkapartitions.AtypicalSamzajobmaylistentoonetopicforincomingmessages,performsometransformations,andthenwritetheoutputtoadifferenttopic.MultipleSamzajobscanthenbecomposedtoprovidemorecomplexprocessingstructures.
AsaYARNapplication,theSamzaApplicationMastermonitorsthehealthofallrunningSamzatasks.Ifataskfails,thenareplacementtaskisinstantiatedinanewcontainer.Samzaachievesfaulttolerancebyhavingeachtaskwriteitsprogresstoanewstream(againmodeledasaKafkatopic),soanyreplacementtaskjustneedstoreadthelatesttaskstatefromthischeckpointtopicandthenreplaythemainmessagetopicfromthelastprocessedposition.Samzaadditionallyofferssupportforlocaltaskstate,whichcanbeveryusefulforjoinandaggregationtypeworkloads.Thislocalstateisagainbuiltatopthestreamabstractionandhenceisintrinsicallyresilienttohostfailure.
YARN-independentframeworksAninterestingpointtonoteisthattwooftheprecedingprojects(SamzaandSpark)runatopYARNbutarenotspecifictoYARN.Sparkstartedoutasastandaloneserviceandhasimplementationsforotherschedulers,suchasApacheMesosortorunonAmazonEC2.ThoughSamzarunsonlyonYARNtoday,itsarchitectureexplicitlyisnotYARN-specific,andtherearediscussionsaboutprovidingrealizationsonotherplatforms.
IftheYARNmodelofpushingasmuchaspossibleintotheapplicationhasitsdownsidesthroughimplementationcomplexity,thenthisdecouplingisoneofitsmajorbenefits.AnapplicationwrittentouseYARNneednotbetiedtoit;bydefinition,allthefunctionalityfortheactualapplicationlogicandmanagementisencapsulatedwithintheapplicationcodeandisindependentofYARNoranotherframework.Thisis,ofcourse,notsayingthatdesigningascheduler-independentapplicationisatrivialtask,butit’snowatractabletask;thiswasabsolutelynotthecasepre-YARN.
![Page 180: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/180.jpg)
YARNtodayandbeyondThoughYARNhasbeenusedinproduction(atYahoo!inparticular)forsometime,thefinalGAversionwasnotreleaseduntillate2012.TheinterfacestoYARNwerealsosomewhatfluiduntilquitelateinthedevelopmentcycle.Consequently,thefullyforwardcompatibleYARNasofHadoop2.2isstillrelativelynew.
YARNisfullyfunctionaltoday,andthefuturedirectionwillseeextensionstoitscurrentcapabilities.Perhapsmostnotableamongthesewillbetheabilitytospecifyandcontrolcontainerresourcesonmoredimensions.Currently,onlylocation,memoryandCPUspecificationsarepossible,andthiswillbeexpandedintoareassuchasstorageandnetworkI/O.
Inaddition,theApplicationMastercurrentlyhaslittlecontroloverthemanagementofhowcontainersareco-locatedornot.Finer-grainedcontrolherewillallowtheApplicationMastertospecifypoliciesaroundwhencontainersmayormaynotbescheduledonthesamenode.Inaddition,thecurrentresourceallocationmodelisquitestatic,anditwillbeusefultoallowanapplicationtodynamicallychangetheresourcesallocatedtoarunningcontainer.
![Page 181: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/181.jpg)
SummaryThischapterexploredhowtoprocessthoselargevolumesofdatathatwediscussedsomuchinthepreviouschapter.Inparticularwecovered:
HowMapReducewastheonlyprocessingmodelavailableinHadoop1anditsconceptualmodelTheJavaAPItoMapReduce,andhowtousethistobuildsomeexamples,fromawordcounttosentimentanalysisofTwitterhashtagsThedetailsofhowMapReduceisimplementedinpractice,andwewalkedthroughtheexecutionofaMapReducejobHowHadoopstoresdataandtheclassesinvolvedtorepresentinputandoutputformatsandrecordreadersandwritersThelimitationsofMapReducethatledtothedevelopmentofYARN,openingthedoortomultiplecomputationalmodelsontheHadoopplatformTheYARNarchitectureandhowapplicationsarebuiltatopit
Inthenexttwochapters,wewillmoveawayfromstrictlybatchprocessinganddelveintotheworldofnearreal-timeanditerativeprocessing,usingtwooftheYARN-hostedframeworksweintroducedinthischapter,namelySamzaandSpark.
![Page 182: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/182.jpg)
Chapter4.Real-timeComputationwithSamzaThepreviouschapterdiscussedYARN,andfrequentlymentionedthebreadthofcomputationalmodelsandprocessingframeworksoutsideoftraditionalbatch-basedMapReducethatitenablesontheHadoopplatform.Inthischapterandthenext,wewillexploretwosuchprojectsinsomedepth,namelyApacheSamzaandApacheSpark.Wechosetheseframeworksastheydemonstratetheusageofstreamanditerativeprocessingandalsoprovideinterestingmechanismstocombineprocessingparadigms.InthischapterwewillexploreSamzaandcoverthefollowingtopics:
WhatSamzaisandhowitintegrateswithYARNandotherprojectssuchasApacheKafkaHowSamzaprovidesasimplecallback-basedinterfaceforstreamprocessingHowSamzacomposesmultiplestreamprocessingjobsintomorecomplexworkflowsHowSamzasupportspersistentlocalstatewithintasksandhowthisgreatlyenricheswhatitcanenable
![Page 183: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/183.jpg)
StreamprocessingwithSamzaToexploreapurestream-processingplatform,wewilluseSamza,whichisavailableathttps://samza.apache.org.Thecodeshownherewastestedwiththecurrent0.8releaseandwe’llkeeptheGitHubrepositoryupdatedastheprojectcontinuestoevolve.
SamzawasbuiltatLinkedInanddonatedtotheApacheSoftwareFoundationinSeptember2013.Overtheyears,LinkedInhasbuiltamodelthatconceptualizesmuchoftheirdataasstreams,andfromthistheysawtheneedforaframeworkthatcanprovideadeveloper-friendlymechanismtoprocesstheseubiquitousdatastreams.
TheteamatLinkedInrealizedthatwhenitcametodataprocessing,muchoftheattentionwenttotheextremeendsofthespectrum,forexample,RPCworkloadsareusuallyimplementedassynchronoussystemswithverylowlatencyrequirementsorbatchsystemswheretheperiodicityofjobsisoftenmeasuredinhours.ThegroundinbetweenhasbeenrelativelypoorlysupportedandthisistheareathatSamzaistargetedat;mostofitsjobsexpectresponsetimesrangingfrommillisecondstominutes.Theyalsoassumethatdataarrivesinatheoreticallyinfinitestreamofcontinuousmessages.
![Page 184: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/184.jpg)
HowSamzaworksTherearenumerousstream-processingsystemssuchasStorm(http://storm.apache.org),intheopensourceworld,andmanyother(mostlycommercial)toolssuchascomplexeventprocessing(CEP)systemsthatalsotargetprocessingoncontinuousmessagestreams.Thesesystemshavemanysimilaritiesbutalsosomemajordifferences.
ForSamza,perhapsthemostsignificantdifferenceisitsassumptionsaboutmessagedelivery.Manysystemsworkveryhardtoreducethelatencyofeachmessage,sometimeswithanassumptionthatthegoalistogetthemessageintoandoutofthesystemasfastaspossible.Samzaassumesalmosttheopposite;itsstreamsarepersistentandresilientandanymessagewrittentoastreamcanbere-readforaperiodoftimeafteritsfirstarrival.Aswewillsee,thisgivessignificantcapabilityaroundfaulttolerance.Samzaalsobuildsonthismodeltoalloweachofitstaskstoholdresilientlocalstate.
SamzaismostlyimplementedinScalaeventhoughitspublicAPIsarewritteninJava.We’llshowJavaexamplesinthischapter,butanyJVMlanguagecanbeusedtoimplementSamzaapplications.We’lldiscussScalawhenweexploreSparkinthenextchapter.
![Page 185: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/185.jpg)
Samzahigh-levelarchitectureSamzaviewstheworldashavingthreemainlayersorcomponents:thestreaming,execution,andprocessinglayers.
Samzaarchitecture
Thestreaminglayerprovidesaccesstothedatastreams,bothforconsumptionandpublication.TheexecutionlayerprovidesthemeansbywhichSamzaapplicationscanberun,haveresourcessuchasCPUandmemoryallocated,andhavetheirlifecyclesmanaged.TheprocessinglayeristheactualSamzaframeworkitself,anditsinterfacesallowper-messagefunctionality.
SamzaprovidespluggableinterfacestosupportthefirsttwolayersthoughthecurrentmainimplementationsuseKafkaforstreamingandYARNforexecution.We’lldiscussthesefurtherinthefollowingsections.
![Page 186: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/186.jpg)
Samza’sbestfriend–ApacheKafkaSamzaitselfdoesnotimplementtheactualmessagestream.Instead,itprovidesaninterfaceforamessagesystemwithwhichitthenintegrates.ThedefaultstreamimplementationisbuiltuponApacheKafka(http://kafka.apache.org),amessagingsystemalsobuiltatLinkedInbutnowasuccessfulandwidelyadoptedopensourceproject.
KafkacanbeviewedasamessagebrokerakintosomethinglikeRabbitMQorActiveMQ,butasmentionedearlier,itwritesallmessagestodiskandscalesoutacrossmultiplehostsasacorepartofitsdesign.Kafkausestheconceptofapublish/subscribemodelthroughnamedtopicstowhichproducerswritemessagesandfromwhichconsumersreadmessages.Theseworkmuchliketopicsinanyothermessagingsystem.
BecauseKafkawritesallmessagestodisk,itmightnothavethesameultra-lowlatencymessagethroughputasothermessagingsystems,whichfocusongettingthemessageprocessedasfastaspossibleanddon’taimtostorethemessagelongterm.Kafkacan,however,scaleexceptionallywellanditsabilitytoreplayamessagestreamcanbeextremelyuseful.Forexample,ifaconsumingclientfails,thenitcanre-readmessagesfromaknowngoodpointintime,orifadownstreamalgorithmchanges,thentrafficcanbereplayedtoutilizethenewfunctionality.
Whenscalingacrosshosts,Kafkapartitionstopicsandsupportspartitionreplicationforfaulttolerance.EachKafkamessagehasakeyassociatedwiththemessageandthisisusedtodecidetowhichpartitionagivenmessageissent.Thisallowssemanticallyusefulpartitioning,forexample,ifthekeyisauserIDinthesystem,thenallmessagesforagivenuserwillbesenttothesamepartition.Kafkaguaranteesordereddeliverywithineachpartitionsothatanyclientreadingapartitioncanknowthattheyarereceivingallmessagesforeachkeyinthatpartitionintheorderinwhichtheyarewrittenbytheproducer.
Samzaperiodicallywritesoutcheckpointsofthepositionuptowhichithasreadinallthestreamsitisconsuming.ThesecheckpointmessagesarethemselveswrittentoaKafkatopic.Thus,whenaSamzajobstartsup,eachtaskcanrereaditscheckpointstreamtoknowfromwhichpositioninthestreamtostartprocessingmessages.ThismeansthatineffectKafkaalsoactsasabuffer;ifaSamzajobcrashesoristakendownforupgrade,nomessageswillbelost.Instead,thejobwilljustrestartfromthelastcheckpointedpositionwhenitrestarts.Thisbufferfunctionalityisalsoimportant,asitmakesiteasierformultipleSamzajobstorunaspartofacomplexworkflow.WhenKafkatopicsarethepointsofcoordinationbetweenthejobs,onejobmightconsumeatopicbeingwrittentobyanother;insuchcases,Kafkacanhelpsmoothoutissuescausedduetoanygivenjobrunningslowerthanothers.Traditionally,thebackpressurecausedbyaslowrunningjobcanbearealissueinasystemcomprisedofmultiplejobstages,butKafkaastheresilientbufferallowseachjobtoreadandwriteatitsownrate.NotethatthisisanalogoustohowmultiplecoordinatingMapReducejobswilluseHDFSforsimilarpurposes.
Kafkaprovidesat-leastoncemessagedeliverysemantics,thatistosaythatanymessage
![Page 187: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/187.jpg)
writtentoKafkawillbeguaranteedtobeavailabletoaclientoftheparticularpartition.Messagesmightbeprocessedbetweencheckpointshowever;itispossibleforduplicatemessagestobereceivedbytheclient.Thereareapplication-specificmechanismstomitigatethis,andbothKafkaandSamzahaveexactly-oncesemanticsontheirroadmaps,butfornowitissomethingyoushouldtakeintoconsiderationwhendesigningjobs.
Wewon’texplainKafkafurtherbeyondwhatweneedtodemonstrateSamza.Ifyouareinterested,checkoutitswebsiteandwiki;thereisalotofgoodinformation,includingsomeexcellentpapersandpresentations.
![Page 188: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/188.jpg)
YARNintegrationAsmentionedearlier,justasSamzautilizesKafkaforitsstreaminglayerimplementation,itusesYARNfortheexecutionlayer.JustlikeanyYARNapplicationdescribedinChapter3,Processing–MapReduceandBeyond,SamzaprovidesanimplementationofbothanApplicationMaster,whichcontrolsthelifecycleoftheoveralljob,plusimplementationsofSamza-specificfunctionality(calledtasks)thatareexecutedineachcontainer.JustasKafkapartitionsitstopics,tasksarethemechanismbywhichSamzapartitionsitsprocessing.EachKafkapartitionwillbereadbyasingleSamzatask.IfaSamzajobconsumesmultiplestreams,thenagiventaskwillbetheonlyconsumerwithinthejobforeverystreampartitionassignedtoit.
TheSamzaframeworkistoldbyeachjobconfigurationabouttheKafkastreamsthatareofinteresttothejob,andSamzacontinuouslypollsthesestreamstodetermineifanynewmessageshavearrived.Whenanewmessageisavailable,theSamzataskinvokesauser-definedcallbacktoprocessthemessage,amodelthatshouldn’tlooktooalientoMapReducedevelopers.ThismethodisdefinedinaninterfacecalledStreamTaskandhasthefollowingsignature:
publicvoidprocess(IncomingMessageEnvelopeenvelope,
MessageCollectorcollector,
TaskCoordinatorcoordinator)
ThisisthecoreofeachSamzataskanddefinesthefunctionalitytobeappliedtoreceivedmessages.ThereceivedmessagethatistobeprocessediswrappedintheIncomingMessageEnvelope;outputmessagescanbewrittentotheMessageCollector,andtaskmanagement(suchasShutdown)canbeperformedviatheTaskCoordinator.
Asmentioned,SamzacreatesonetaskinstanceforeachpartitionintheunderlyingKafkatopic.EachYARNcontainerwillmanageoneormoreofthesetasks.TheoverallmodelthenisoftheSamzaApplicationMastercoordinatingmultiplecontainers,eachofwhichisresponsibleforoneormoreStreamTaskinstances.
![Page 189: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/189.jpg)
AnindependentmodelThoughwewilltalkexclusivelyofKafkaandYARNastheprovidersofSamza’sstreamingandexecutionlayersinthischapter,itisimportanttorememberthatthecoreSamzasystemuseswell-definedinterfacesforboththestreamandexecutionsystems.Thereareimplementationsofmultiplestreamsources(we’llseeoneinthenextsection)andalongsidetheYARNsupport,SamzashipswithaLocalJobRunnerclass.ThisalternativemethodofrunningtaskscanexecuteStreamTaskinstancesin-processontheJVMinsteadofrequiringafullYARNcluster,whichcansometimesbeausefultestinganddebuggingtool.ThereisalsoadiscussionofSamzaimplementationsontopofotherclustermanagerorvirtualizationframeworks.
![Page 190: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/190.jpg)
HelloSamza!SincenoteveryonealreadyhasZooKeeper,Kafka,andYARNclustersreadytobeused,theSamzateamhascreatedawonderfulwaytogetstartedwiththeproduct.InsteadofjusthavingaHelloworld!program,thereisarepositorycalledHelloSamza,whichisavailablebycloningtherepositoryatgit://git.apache.org/samza-hello-samza.git.
ThiswilldownloadandinstalldedicatedinstancesofZooKeeper,Kafka,andYARN(the3majorprerequisitesforSamza),creatingafullstackuponwhichyoucansubmitSamzajobs.
TherearealsoanumberofexampleSamzajobsthatprocessdatafromWikipediaeditnotifications.Takealookatthepageathttp://samza.apache.org/startup/hello-samza/0.8/andfollowtheinstructionsgiventhere.(Atthetimeofwriting,Samzaisstillarelativelyyoungprojectandwe’drathernotincludedirectinformationabouttheexamples,whichmightbesubjecttochange).
FortheremainderoftheSamzaexamplesinthischapter,we’llassumeyouareeitherusingtheHelloSamzapackagetoprovidethenecessarycomponents(ZooKeeper/Kafka/YARN)oryouhaveintegratedwithotherinstancesofeach.
ThisexamplehasthreedifferentSamzajobsthatbuilduponeachother.ThefirstreadstheWikipediaedits,thesecondparsestheserecords,andthethirdproducesstatisticsbasedontheprocessedrecords.We’llbuildourownmultistreamworkflowshortly.
OneinterestingpointistheWikipediaFeedexamplehere;itusesWikipediaasitsmessagesourceinsteadofKafka.Specifically,itprovidesanotherimplementationoftheSamzaSystemConsumerinterfacetoallowSamzatoreadmessagesfromanexternalsystem.Asmentionedearlier,SamzaisnottiedtoKafkaand,asthisexampleshows,buildinganewstreamimplementationdoesnothavetobeagainstagenericinfrastructurecomponent;itcanbequitejob-specific,astheworkrequiredisnothuge.
TipNotethatthedefaultconfigurationforbothZooKeeperandKafkawillwritesystemdatatodirectoriesunder/tmp,whichwillbewhatyouhavesetifyouuseHelloSamza.BecarefulifyouareusingaLinuxdistributionthatpurgesthecontentsofthisdirectoryonareboot.Ifyouplantocarryoutanysignificanttesting,thenit’sbesttoreconfigurethesecomponentstouselessephemerallocations.Changetherelevantconfigfilesforeachservice;theyarelocatedintheservicedirectoryunderthehello-samza/deploydirectory.
![Page 191: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/191.jpg)
BuildingatweetparsingjobLet’sbuildourownsimplejobimplementationtoshowthefullcoderequired.We’lluseparsingoftheTwitterstreamastheexamplesinthischapterandwilllatersetupapipefromourclientconsumingmessagesfromtheTwitterAPIintoaKafkatopic.So,weneedaSamzataskthatwillreadthestreamofJSONmessages,extracttheactualtweettext,andwritethesetoatopicoftweets.
HereisthemaincodefromTwitterParseStreamTask.java,availableathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterParseStreamTask.java
packagecom.learninghadoop2.samza.tasks;
publicclassTwitterParseStreamTaskimplementsStreamTask{
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
Stringmsg=((String)envelope.getMessage());
try{
JSONParserparser=newJSONParser();
Objectobj=parser.parse(msg);
JSONObjectjsonObj=(JSONObject)obj;
Stringtext=(String)jsonObj.get("text");
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweets-parsed"),text));
}catch(ParseExceptionpe){}
}
}
}
Thecodeislargelyself-explanatory,butthereareafewpointsofinterest.WeuseJSONSimple(http://code.google.com/p/json-simple/)forourrelativelystraightforwardJSONparsingrequirements;we’llalsouseitlaterinthisbook.
TheIncomingMessageEnvelopeanditscorrespondingOutputMessageEnvelopearethemainstructuresconcernedwiththeactualmessagedata.Alongwiththemessagepayload,theenvelopewillalsohavedataconcerningthesystem,topicname,and(optionally)partitionnumberinadditiontoothermetadata.Forourpurposes,wejustextractthemessagebodyfromtheincomingmessageandsendthetweettextweextractfromitviaanewOutgoingMessageEnvelopetoatopiccalledtweets-parsedwithinasystemcalledkafka.Notethelowercasename—we’llexplainthisinamoment.
ThetypeofmessageintheIncomingMessageEnvelopeisjava.lang.Object.Samzadoesnotcurrentlyenforceadatamodelandhencedoesnothavestrongly-typedmessagebodies.Therefore,whenextractingthemessagecontents,anexplicitcastisusuallyrequired.Sinceeachtaskneedstoknowtheexpectedmessageformatofthestreamsitprocesses,thisisnottheodditythatitmayappeartobe.
![Page 192: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/192.jpg)
TheconfigurationfileTherewasnothinginthepreviouscodethatsaidwherethemessagescamefrom;theframeworkjustpresentsthemtotheStreamTaskimplementation,butobviouslySamzaneedstoknowfromwheretofetchmessages.Thereisaconfigurationfileforeachjobthatdefinesthisandmore.Thefollowingcanbefoundastwitter-parse.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-parser.properties:
#Job
job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
job.name=twitter-parser
#YARN
yarn.package.path=file:///home/gturkington/samza/build/distributions/learni
nghadoop2-0.1.tar.gz
#Task
task.class=com.learninghadoop2.samza.tasks.TwitterParseStreamTask
task.inputs=kafka.tweets
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointMa
nagerFactory
task.checkpoint.system=kafka
#Normally,thiswouldbe3,butwehaveonlyonebroker.
task.checkpoint.replication.factor=1
#Serializers
serializers.registry.string.class=org.apache.samza.serializers.StringSerdeF
actory
#Systems
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactor
y
systems.kafka.streams.tweets.samza.msg.serde=string
systems.kafka.streams.tweets-parsed.samza.msg.serde=string
systems.kafka.consumer.zookeeper.connect=localhost:2181/
systems.kafka.consumer.auto.offset.reset=largest
systems.kafka.producer.metadata.broker.list=localhost:9092
systems.kafka.producer.producer.type=sync
systems.kafka.producer.batch.num.messages=1
Thismaylooklikealot,butfornowwe’lljustconsiderthehigh-levelstructureandsomekeysettings.ThejobsectionsetsYARNastheexecutionframework(asopposedtothelocaljobrunnerclass)andgivesthejobaname.Ifweweretorunmultiplecopiesofthissamejob,wewouldalsogiveeachcopyauniqueID.Thetasksectionspecifiestheimplementationclassofourtaskandalsothenameofthestreamsforwhichitshouldreceivemessages.SerializerstellSamzahowtoreadandwritemessagestoandfromthestreamandthesystemsectiondefinessystemsbynameandassociatesimplementationclasseswiththem.
Inourcase,wedefineonlyonesystemcalledkafkaandwerefertothissystemwhen
![Page 193: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/193.jpg)
sendingourmessageintheprecedingtask.Notethatthisnameisarbitraryandwecouldcallitwhateverwewant.Obviously,forclarityitmakessensetocalltheKafkasystembythesamenamebutthisisonlyaconvention.Inparticular,sometimesyouwillneedtogivedifferentnameswhendealingwithmultiplesystemsthataresimilartoeachother,orsometimesevenwhentreatingthesamesystemdifferentlyindifferentpartsofaconfigurationfile.
Inthissection,wewillalsospecifytheSerDetobeassociatedwiththestreamsusedbythetask.RecallthatKafkamessageshaveabodyandanoptionalkeythatisusedtodeterminetowhichpartitionthemessageissent.Samzaneedstoknowhowtotreatthecontentsofthekeysandmessagesforthesestreams.Samzahassupporttotreattheseasrawbytesorspecifictypessuchasstring,integer,andJSON,asmentionedearlier.
Therestoftheconfigurationwillbemostlyunchangedfromjobtojob,asitincludesthingssuchasthelocationoftheZooKeeperensembleandKafkaclusters,andspecifieshowstreamsaretobecheckpointed.Samzaallowsawidevarietyofcustomizationsandthefullconfigurationoptionsaredetailedathttp://samza.apache.org/learn/documentation/0.8/jobs/configuration-table.html.
![Page 194: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/194.jpg)
GettingTwitterdataintoKafkaBeforewerunthejob,wedoneedtogetsometweetsintoKafka.Let’screateanewKafkatopiccalledtweetstowhichwe’llwritethetweets.
ToperformthisandotherKafka-relatedoperations,we’llusecommand-linetoolslocatedwithinthebindirectoryoftheKafkadistribution.IfyouarerunningajobfromwithinthestackcreatedaspartoftheHelloSamzaapplication;thiswillbedeploy/kafka/bin.
kafka-topics.shisageneral-purposetoolthatcanbeusedtocreate,update,anddescribetopics.MostofitsusagesrequireargumentstospecifythelocationofthelocalZooKeepercluster,whereKafkabrokersstoretheirdetails,andthenameofthetopictobeoperatedupon.Tocreateanewtopic,runthefollowingcommand:
$kafka-topics.sh--zookeeperlocalhost:2181--create–topictweets--
partitions1--replication-factor1
Thiscreatesatopiccalledtweetsandexplicitlysetsitsnumberofpartitionsandreplicationfactorto1.ThisissuitableifyouarerunningKafkawithinalocaltestVM,butclearlyproductiondeploymentswillhavemorepartitionstoscaleouttheloadacrossmultiplebrokersandareplicationfactorofatleast2toprovidefaulttolerance.
Usethelistoptionofthekafka-topics.shtooltosimplyshowthetopicsinthesystem,orusedescribetogetmoredetailedinformationonspecifictopics:
$kafka-topics.sh--zookeeperlocalhost:2181--describe--topictweets
Topic:tweetsPartitionCount:1ReplicationFactor:1Configs:
Topic:tweetsPartition:0Leader:0Replicas:0Isr:0
Themultiple0sarepossiblyconfusingasthesearelabelsandnotcounts.EachbrokerinthesystemhasanIDthatusuallystartsfrom0,asdothepartitionswithineachtopic.TheprecedingoutputistellingusthatthetopiccalledtweetshasasinglepartitionwithID0,thebrokeractingastheleaderforthatpartitionisbroker0,andthesetofin-syncreplicas(ISR)forthispartitionisagainonlybroker0.Thislastvalueisparticularlyimportantwhendealingwithreplication.
We’lluseourPythonutilityfrompreviouschapterstopullJSONtweetsfromtheTwitterfeed,andthenuseaKafkaCLImessageproducertowritethemessagestoaKafkatopic.Thisisn’taterriblyefficientwayofdoingthings,butitissuitableforillustrationpurposes.AssumingourPythonscriptisinourhomedirectory,runthefollowingcommandfromwithintheKafkabindirectory:
$python~/stream.py–j|./kafka-console-producer.sh--broker-list
localhost:9092--topictweets
ThiswillrunindefinitelysobecarefulnottoleaveitrunningovernightonatestVMwithsmalldiskspace,notthattheauthorshaveeverdonesuchathing.
![Page 195: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/195.jpg)
RunningaSamzajobTorunaSamzajob,weneedourcodetobepackagedalongwiththeSamzacomponentsrequiredtoexecuteitintoa.tar.gzarchivethatwillbereadbytheYARNNodeManager.Thisisthefilereferredtobytheyarn.file.packagepropertyintheSamzataskconfigurationfile.
WhenusingthesinglenodeHelloSamzawecanjustuseanabsolutepathonthefilesystem,asseeninthepreviousconfigurationexample.ForjobsonlargerYARNgrids,theeasiestwayistoputthepackageontoHDFSandrefertoitbyanhdfs://URIoronawebserver(SamzaprovidesamechanismtoallowYARNtoreadthefileviaHTTP).
BecauseSamzahasmultiplesubcomponentsandeachsubcomponenthasitsowndependencies,thefullYARNpackagecanendupcontainingalotofJARfiles(over100!).Inaddition,youneedtoincludeyourcustomcodefortheSamzataskaswellassomescriptsfromwithintheSamzadistribution.It’snotsomethingtobedonebyhand.Inthesamplecodeforthischapter,foundathttps://github.com/learninghadoop2/book-examples/tree/master/ch4,wehavesetupasamplestructuretoholdthecodeandconfigfilesandprovidedsomeautomationviaGradletobuildthenecessarytaskarchiveandstartthetasks.
WhenintherootoftheSamzaexamplecodedirectoryforthisbook,performthefollowingcommandtobuildasinglefilearchivecontainingalltheclassesofthischaptercompiledtogetherandbundledwithalltheotherrequiredfiles:
$./gradlewtargz
ThisGradletaskwillnotonlycreatethenecessary.tar.gzarchiveinthebuild/distributionsdirectory,butwillalsostoreanexpandedversionofthearchiveunderbuild/samza-package.Thiswillbeuseful,aswewilluseSamzascriptsstoredinthebindirectoryofthearchivetoactuallysubmitthetasktoYARN.
Sonow,let’srunourjob.Weneedtohavefilepathsfortwothings:theSamzarun-job.shscripttosubmitajobtoYARNandtheconfigurationfileforourjob.Sinceourcreatedjobpackagehasallthecompiledtasksbundledtogether,itisbyusingadifferentconfigurationfilethatspecifiesaspecifictaskimplementationclassinthetask.classpropertythatwetellSamzawhichtasktorun.Toactuallyrunthetask,wecanrunthefollowingcommandfromwithintheexplodedprojectarchiveunderbuild/samza-archives:
$bin/run-job.sh--config-
factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-
path=]config/twitter-parser.properties
Forconvenience,weaddedaGradletasktorunthisjob:
$./gradlewrunTwitterParser
Toseetheoutputofthejob,we’llusetheKafkaCLIclienttoconsumemessages:
![Page 196: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/196.jpg)
$./kafka-console-consumer.sh–zookeeperlocalhost:2181–topictweets-
parsed
Youshouldseeacontinuousstreamoftweetsappearingontheclient.
NoteNotethatwedidnotexplicitlycreatethetopiccalledtweets-parsed.Kafkacanallowtopicstobecreateddynamicallywheneitheraproducerorconsumertriestousethetopic.Inmanysituations,thoughthedefaultpartitioningandreplicationvaluesmaynotbesuitable,andexplicittopiccreationwillberequiredtoensurethesecriticaltopicattributesarecorrectlydefined.
![Page 197: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/197.jpg)
SamzaandHDFSYoumayhavenoticedthatwejustmentionedHDFSforthefirsttimeinourdiscussionofSamza.ThoughSamzaintegratestightlywithYARN,ithasnodirectintegrationwithHDFS.Atalogicallevel,Samza’sstream-implementingsystems(suchasKafka)areprovidingthestoragelayerthatisusuallyprovidedbyHDFSfortraditionalHadoopworkloads.IntheterminologyofSamza’sarchitecture,asdescribedearlier,YARNistheexecutionlayerinbothmodels,whereasSamzausesastreaminglayerforitssourceanddestinationdata,frameworkssuchasMapReduceuseHDFS.ThisisagoodexampleofhowYARNenablesalternativecomputationalmodelsthatnotonlyprocessdataverydifferentlythanbatch-orientedMapReduce,butthatcanalsouseentirelydifferentstoragesystemsfortheirsourcedata.
![Page 198: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/198.jpg)
WindowingfunctionsIt’sfrequentlyusefultogeneratesomedatabasedonthemessagesreceivedonastreamoveracertaintimewindow.Anexampleofthismaybetorecordthetopnattributevaluesmeasuredeveryminute.SamzasupportsthisthroughtheWindowableTaskinterface,whichhasthefollowingsinglemethodtobeimplemented:
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator);
ThisshouldlooksimilartotheprocessmethodintheStreamTaskinterface.However,becausethemethodiscalledonatimeschedule,itsinvocationisnotassociatedwithareceivedmessage.TheMessageCollectorandTaskCoordinatorparametersarestillthere,however,asmostwindowabletaskswillproduceoutputmessagesandmayalsowishtoperformsometaskmanagementactions.
Let’stakeourprevioustaskandaddawindowfunctionthatwilloutputthenumberoftweetsreceivedineachwindowedtimeperiod.ThisisthemainclassimplementationofTwitterStatisticsStreamTask.javafoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatisticsStreamTask.java
publicclassTwitterStatisticsStreamTaskimplementsStreamTask,
WindowableTask{
privateinttweets=0;
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
tweets++;
}
@Override
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator){
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweet-stats"),""+tweets));
//Resetcountsafterwindowing.
tweets=0;
}
}
TheTwitterStatisticsStreamTaskclasshasaprivatemembervariablecalledtweetsthatisinitializedto0andisincrementedineverycalltotheprocessmethod.Wethereforeknowthatthisvariablewillbeincrementedforeachmessagepassedtothetaskfromtheunderlyingstreamimplementation.EachSamzacontainerhasasinglethreadrunninginaloopthatexecutestheprocessandwindowmethodsonallthetaskswithinthecontainer.Thismeansthatwedonotneedtoguardinstancevariablesagainstconcurrentmodifications;onlyonemethodoneachtaskwithinacontainerwillbeexecutingsimultaneously.
![Page 199: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/199.jpg)
Inourwindowmethod,wesendamessagetoanewtopicwecalltweet-statsandthenresetthetweetsvariable.ThisisprettystraightforwardandtheonlymissingpieceishowSamzawillknowwhentocallthewindowmethod.Wespecifythisintheconfigurationfile:
task.window.ms=5000
ThistellsSamzatocallthewindowmethodoneachtaskinstanceevery5seconds.Torunthewindowtask,thereisaGradletask:
$./gradlewrunTwitterStatistics
Ifweusekafka-console-consumer.shtolistenonthetweet-statsstreamnow,wewillseethefollowingoutput:
Numberoftweets:5012
Numberoftweets:5398
NoteNotethatthetermwindowinthiscontextreferstoSamzaconceptuallyslicingthestreamofmessagesintotimerangesandprovidingamechanismtoperformprocessingateachrangeboundary.Samzadoesnotdirectlyprovideanimplementationoftheotheruseofthetermwithregardstoslidingwindows,whereaseriesofvaluesisheldandprocessedovertime.However,thewindowabletaskinterfacedoesprovidetheplumbingtoimplementsuchslidingwindows.
![Page 200: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/200.jpg)
MultijobworkflowsAswesawwiththeHelloSamzaexamples,someoftherealpowerofSamzacomesfromcompositionofmultiplejobsandwe’lluseatextcleanupjobtostartdemonstratingthiscapability.
Inthefollowingsection,we’llperformtweetsentimentanalysisbycomparingtweetswithasetofEnglishpositiveandnegativewords.SimplyapplyingthistotherawTwitterfeedwillhaveverypatchyresults,however,givenhowrichlymultilingualtheTwitterstreamis.Wealsoneedtoconsiderthingssuchastextcleanup,capitalization,frequentcontractions,andsoon.Asanyonewhohasworkedwithanynon-trivialdatasetknows,theactofmakingthedatafitforprocessingisusuallywherealargeamountofeffort(oftenthemajority!)goes.
Sobeforewetryanddetecttweetsentiments,let’sdosomesimpletextcleanup;inparticular,we’llselectonlyEnglishlanguagetweetsandwewillforcetheirtexttobelowercasebeforesendingthemtoanewoutputstream.
Languagedetectionisadifficultproblemandforthiswe’lluseafeatureoftheApacheTikalibrary(http://tika.apache.org).Tikaprovidesawidearrayoffunctionalitytoextracttextfromvarioussourcesandthentoextractfurtherinformationfromthattext.IfusingourGradlescripts,theTikadependencyisalreadyspecifiedandwillautomaticallybeincludedinthegeneratedjobpackage.Ifbuildingthroughanothermechanism,youwillneedtodownloadtheTikaJARfilefromthehomepageandaddittoyourYARNjobpackage.ThefollowingcodecanbefoundasTextCleanupStreamTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TextCleanupStreamTask.java
publicclassTextCleanupStreamTaskimplementsStreamTask{
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
Stringrawtext=((String)envelope.getMessage());
if("en".equals(detectLanguage(rawtext))){
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","english-tweets"),
rawtext.toLowerCase()));
}
}
privateStringdetectLanguage(Stringtext){
LanguageIdentifierli=newLanguageIdentifier(text);
returnli.getLanguage();
}
}
ThistaskisquitestraightforwardthankstotheheavyliftingperformedbyTika.WecreateautilitymethodthatwrapsthecreationanduseofaTika,LanguageDetector,andthenwe
![Page 201: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/201.jpg)
callthismethodonthemessagebodyofeachincomingmessageintheprocessmethod.Weonlywritetotheoutputstreamiftheresultofapplyingthisutilitymethodis"en",thatis,thetwo-lettercodeforEnglish.
Theconfigurationfileforthistaskissimilartothatofourprevioustask,withthespecificvaluesforthetasknameandimplementingclass.Itisintherepositoryastextcleanup.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/textcleanup.properties.Wealsoneedtospecifytheinputstream:
task.inputs=kafka.tweets-parsed
ThisisimportantbecauseweneedthistasktoparsethetweettextthatwasextractedintheearliertaskandavoidduplicatingtheJSONparsinglogicthatisbestencapsulatedinoneplace.Wecanrunthistaskwiththefollowingcommand:
$./gradlewrunTextCleanup
Now,wecanrunallthreetaskstogether;TwitterParseStreamTaskandTwitterStatisticsStreamTaskwillconsumetherawtweetstream,whileTextCleanupStreamTaskwillconsumetheoutputfromTwitterParseStreamTask.
Dataprocessingonstreams
![Page 202: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/202.jpg)
TweetsentimentanalysisWe’llnowimplementatasktoperformtweetsentimentanalysissimilartowhatwedidusingMapReduceinthepreviouschapter.ThiswillalsoshowusausefulmechanismofferedbySamza:bootstrapstreams.
BootstrapstreamsGenerallyspeaking,moststream-processingjobs(inSamzaoranotherframework)willstartprocessingmessagesthatarriveaftertheystartupandgenerallyignorehistoricalmessages.Becauseofitsconceptofreplayablestreams,Samzadoesn’thavethislimitation.
Inoursentimentanalysisjob,wehadtwosetsofreferenceterms:positiveandnegativewords.Thoughwe’venotshownitsofar,Samzacanconsumemessagesfrommultiplestreamsandtheunderlyingmachinerywillpollallnamedstreamsandprovidetheirmessages,oneatatime,totheprocessmethod.Wecanthereforecreatestreamsforthepositiveandnegativewordsandpushthedatasetsontothosestreams.Atfirstglance,wecouldplantorewindthesetwostreamstotheearliestpointandreadtweetsastheyarrive.TheproblemisthatSamzawon’tguaranteeorderingofmessagesfrommultiplestreams,andeventhoughthereisamechanismtogivestreamshigherpriority,wecan’tassumethatallnegativeandpositivewordswillbeprocessedbeforethefirsttweetarrives.
Forsuchtypesofscenarios,Samzahastheconceptofbootstrapstreams.Ifataskhasanybootstrapstreamsdefined,thenitwillreadthesestreamsfromtheearliestoffsetuntiltheyarefullyprocessed(technically,itwillreadthestreamstilltheygetcaughtup,sothatanynewwordssenttoeitherstreamwillbetreatedwithoutpriorityandwillarriveinterleavedbetweentweets).
We’llnowcreateanewjobcalledTweetSentimentStreamTaskthatreadstwobootstrapstreams,collectstheircontentsintoHashMaps,gathersrunningcountsforsentimenttrends,andusesawindowfunctiontooutputthisdataatintervals.Thiscodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterSentimentStreamTask.java
publicclassTwitterSentimentStreamTaskimplementsStreamTask,
WindowableTask{
privateSet<String>positiveWords=newHashSet<String>();
privateSet<String>negativeWords=newHashSet<String>();
privateinttweets=0;
privateintpositiveTweets=0;
privateintnegativeTweets=0;
privateintmaxPositive=0;
privateintmaxNegative=0;
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
if("positive-
words".equals(envelope.getSystemStreamPartition().getStream())){
![Page 203: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/203.jpg)
positiveWords.add(((String)envelope.getMessage()));
}elseif("negative-
words".equals(envelope.getSystemStreamPartition().getStream())){
negativeWords.add(((String)envelope.getMessage()));
}elseif("english-
tweets".equals(envelope.getSystemStreamPartition().getStream())){
tweets++;
intpositive=0;
intnegative=0;
Stringwords=((String)envelope.getMessage());
for(Stringword:words.split("")){
if(positiveWords.contains(word)){
positive++;
}elseif(negativeWords.contains(word)){
negative++;
}
}
if(positive>negative){
positiveTweets++;
}
if(negative>positive){
negativeTweets++;
}
if(positive>maxPositive){
maxPositive=positive;
}
if(negative>maxNegative){
maxNegative=negative;
}
}
}
@Override
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator){
Stringmsg=String.format("Tweets:%dPositive:%dNegative:%d
MaxPositive:%dMinPositive:%d",tweets,positiveTweets,negativeTweets,
maxPositive,maxNegative);
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweet-sentiment-stats"),msg));
//Resetcountsafterwindowing.
tweets=0;
positiveTweets=0;
negativeTweets=0;
maxPositive=0;
maxNegative=0;
}
![Page 204: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/204.jpg)
}
Inthistask,weaddanumberofprivatemembervariablesthatwewillusetokeeparunningcountofthenumberofoveralltweets,howmanywerepositiveandnegative,andthemaximumpositiveandnegativecountsseeninasingletweet.
ThistaskconsumesfromthreeKafkatopics.Eventhoughwewillconfiguretwotobeusedasbootstrapstreams,theyareallstillexactlythesametypeofKafkatopicfromwhichmessagesarereceived;theonlydifferencewithbootstrapstreamsisthatwetellSamzatouseKafka’srewindingcapabilitiestofullyre-readeachmessageinthestream.Fortheotherstreamoftweets,wejuststartreadingnewmessagesastheyarrive.
Ashintedearlier,ifatasksubscribestomultiplestreams,thesameprocessmethodwillreceivemessagesfromeachstream.Thatiswhyweuseenvelope.getSystemStreamPartition().getStream()toextractthestreamnameforeachgivenmessageandthenactaccordingly.Ifthemessageisfromeitherofthebootstrappedstreams,weadditscontentstotheappropriatehashmap.Webreakatweetmessageintoitsconstituentwords,testeachwordforpositiveornegativesentiment,andthenupdatecountsaccordingly.Asyoucansee,thistaskdoesn’toutputthereceivedtweetstoanothertopic.
Sincewedon’tperformanydirectprocessing,thereisnopointindoingso;anyothertaskthatwishestoconsumemessagescanjustsubscribedirectlytotheincomingtweetsstream.However,apossiblemodificationcouldbetowritepositiveandnegativesentimenttweetstodedicatedstreamsforeach.
Thewindowmethodoutputsaseriesofcountsandthenresetsthevariables(asitdidbefore).NotethatSamzadoeshavesupporttodirectlyexposemetricsthroughJMX,whichcouldpossiblybeabetterfitforsuchsimplewindowingexamples.However,wewon’thavespacetocoverthataspectoftheprojectinthisbook.
Torunthisjob,weneedtomodifytheconfigurationfilebysettingthejobandtasknamesasusual,butwealsoneedtospecifymultipleinputstreamsnow:
task.inputs=kafka.english-tweets,kafka.positive-words,kafka.negative-words
Then,weneedtospecifythattwoofourstreamsarebootstrapstreamsthatshouldbereadfromtheearliestoffset.Specifically,wesetthreepropertiesforthestreams.Wesaytheyaretobebootstrapped,thatis,fullyreadbeforeotherstreams,andthisisachievedbyspecifyingthattheoffsetoneachstreamneedstoberesettotheoldest(first)position:
systems.kafka.streams.positive-words.samza.bootstrap=true
systems.kafka.streams.positive-words.samza.reset.offset=true
systems.kafka.streams.positive-words.samza.offset.default=oldest
systems.kafka.streams.negative-words.samza.bootstrap=true
systems.kafka.streams.negative-words.samza.reset.offset=true
systems.kafka.streams.negative-words.samza.offset.default=oldest
Wecanrunthisjobwiththefollowingcommand:
![Page 205: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/205.jpg)
$./gradlewrunTwitterSentiment
Afterstartingthejob,lookattheoutputofthemessagesonthetweet-sentiment-statstopic.
Thesentimentdetectionjobwillbootstrapthepositiveandnegativewordstreamsbeforereadinganyofournewlydetectedlower-caseEnglishtweets.
Withthesentimentdetectionjob,wecannowvisualizeourfourcollaboratingjobsasshowninthefollowingdiagram:
Bootstrapstreamsandcollaboratingtasks
TipTocorrectlyrunthejobs,itmayseemnecessarytostarttheJSONparserjobfollowedbythecleanupjobbeforefinallystartingthesentimentjob,butthisisnotthecase.AnyunreadmessagesremainbufferedinKafka,soitdoesn’tmatterinwhichorderthejobsofamulti-jobworkflowarestarted.Ofcourse,thesentimentjobwilloutputcountsof0tweetsuntilitstartsreceivingdata,butnothingwillbreakifastreamjobstartsbeforethoseitdependson.
![Page 206: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/206.jpg)
StatefultasksThefinalaspectofSamzathatwewillexploreishowitallowsthetasksprocessingstreampartitionstohavepersistentlocalstate.Inthepreviousexample,weusedprivatevariablestokeepatrackofrunningtotals,butsometimesitisusefulforatasktohavericherlocalstate.Anexamplecouldbetheactofperformingalogicaljoinontwostreams,whereitisusefultobuildupastatemodelfromonestreamandcomparethiswiththeother.
NoteNotethatSamzacanutilizeitsconceptofpartitionedstreamstogreatlyoptimizetheactofjoiningstreams.Ifeachstreamtobejoinedusesthesamepartitionkey(forexample,auserID),theneachtaskconsumingthesestreamswillreceiveallmessagesassociatedwitheachIDacrossallthestreams.
Samzahasanotherabstractionsimilartoitsnotionoftheframeworktomanageitsjobsandthatwhichimplementsitstasks.Itdefinesanabstractkey-valuestorethatcanhavemultipleconcreteimplementations.Samzausesexistingopensourceprojectsfortheon-diskimplementationsandusedLevelDBasofv0.7andaddedRocksDBasofv0.8.Thereisalsoanin-memorystorethatdoesnotpersistthekey-valuedatabutthatmaybeusefulintestingorpotentiallyveryspecificproductionworkloads.
Eachtaskcanwritetothiskey-valuestoreandSamzamanagesitspersistencetothelocalimplementation.Tosupportpersistentstates,thestoreisalsomodeledasastreamandallwritestothestorearealsopushedintoastream.Ifataskfails,thenonrestart,itcanrecoverthestateofitslocalkey-valuestorebyreplayingthemessagesinthebackingtopic.Anobviousconcernherewillbethenumberofmessagesthatneedtobereplayed;however,whenusingKafka,forexample,itcompactsmessageswiththesamekeysothatonlythelatestupdateremainsinthetopic.
We’llmodifyourprevioustweetsentimentexampletoaddalifetimecountofthemaximumpositiveandnegativesentimentseeninanytweet.ThefollowingcodecanbefoundasTwitterStatefulSentimentStateTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatefulSentimentStreamTask.javaNotethattheprocessmethodisthesameasTwitterSentimentStateTask.java,sowehaveomittedithereforspacereasons:
publicclassTwitterStatefulSentimentStreamTaskimplementsStreamTask,
WindowableTask,InitableTask{
privateSet<String>positiveWords=newHashSet<String>();
privateSet<String>negativeWords=newHashSet<String>();
privateinttweets=0;
privateintpositiveTweets=0;
privateintnegativeTweets=0;
privateintmaxPositive=0;
privateintmaxNegative=0;
privateKeyValueStore<String,Integer>store;
![Page 207: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/207.jpg)
@SuppressWarnings("unchecked")
@Override
publicvoidinit(Configconfig,TaskContextcontext){
this.store=(KeyValueStore<String,Integer>)
context.getStore("tweet-store");
}
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
...
}
@Override
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator){
IntegerlifetimeMaxPositive=store.get("lifetimeMaxPositive");
IntegerlifetimeMaxNegative=store.get("lifetimeMaxNegative");
if((lifetimeMaxPositive==null)||(maxPositive>
lifetimeMaxPositive)){
lifetimeMaxPositive=maxPositive;
store.put("lifetimeMaxPositive",lifetimeMaxPositive);
}
if((lifetimeMaxNegative==null)||(maxNegative>
lifetimeMaxNegative)){
lifetimeMaxNegative=maxNegative;
store.put("lifetimeMaxNegative",lifetimeMaxNegative);
}
Stringmsg=
String.format(
"Tweets:%dPositive:%dNegative:%dMaxPositive:%d
MaxNegative:%dLifetimeMaxPositive:%dLifetimeMaxNegative:%d",
tweets,positiveTweets,negativeTweets,maxPositive,
maxNegative,lifetimeMaxPositive,
lifetimeMaxNegative);
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweet-stateful-sentiment-stats"),msg));
//Resetcountsafterwindowing.
tweets=0;
positiveTweets=0;
negativeTweets=0;
maxPositive=0;
maxNegative=0;
}
}
ThisclassimplementsanewinterfacecalledInitableTask.Thishasasinglemethodcalledinitandisusedwhenataskneedstoconfigureaspectsofitsconfigurationbeforeitbeginsexecution.Weusetheinit()methodheretocreateaninstanceoftheKeyValueStoreclassandstoreitinaprivatemembervariable.
![Page 208: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/208.jpg)
KeyValueStore,asthenamesuggests,providesafamiliarput/gettypeinterface.Inthiscase,wespecifythatthekeysareofthetypeStringandthevaluesareIntegers.Inourwindowmethod,weretrieveanypreviouslystoredvaluesforthemaximumpositiveandnegativesentimentandifthecountinthecurrentwindowishigher,updatethestoreaccordingly.Then,wejustoutputtheresultsofthewindowmethodasbefore.
Asyoucansee,theuserdoesnotneedtodealwiththedetailsofeitherthelocalorremotepersistenceoftheKeyValueStoreinstance;thisisallhandledbySamza.Theefficiencyofthemechanismalsomakesittractablefortaskstoholdsizeableamountoflocalstate,whichcanbeparticularlyvaluableincasessuchaslong-runningaggregationsorstreamjoins.
Theconfigurationfileforthejobcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-stateful-sentiment.properties.Itneedstohaveafewentriesadded,whichareasfollows:
stores.tweet-
store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory
stores.tweet-store.changelog=kafka.twitter-stats-state
stores.tweet-store.key.serde=string
stores.tweet-store.msg.serde=integer
Thefirstlinespecifiestheimplementationclassforthestore,thesecondlinespecifiestheKafkatopictobeusedforpersistentstate,andthelasttwolinesspecifythetypeofthestorekeyandvalue.
Torunthisjob,usethefollowingcommand:
$./gradlewrunTwitterStatefulSentiment
Forconvenience,thefollowingcommandwillstartupfourjobs:theJSONparser,thetextcleanup,thestatisticsjobandthestatefulsentimentjobs:
$./gradlewrunTasks
Samzaisapurestream-processingsystemthatprovidespluggableimplementationsofitsstorageandexecutionlayers.ThemostcommonlyusedpluginsareYARNandKafka,andthesedemonstratehowSamzacanintegratetightlywithHadoopYARNwhileusingacompletelydifferentstoragelayer.Samzaisstillarelativelynewprojectandthecurrentfeaturesareonlyasubsetofwhatisenvisaged.Itisrecommendedtoconsultitswebpagetogetthelatestinformationonitscurrentstatus.
![Page 209: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/209.jpg)
SummaryThischapterfocusedmuchmoreonwhatcanbedoneonHadoop2,andinparticularYARN,thanthedetailsofHadoopinternals.Thisisalmostcertainlyagoodthing,asitdemonstratesthatHadoopisrealizingitsgoalofbecomingamuchmoreflexibleandgenericdataprocessingplatformthatisnolongertiedtobatchprocessing.Inparticular,wehighlightedhowSamzashowsthattheprocessingframeworksthatcanbeimplementedonYARNcaninnovateandenablefunctionalityvastlydifferentfromthatavailableinHadoop1.
Inparticular,wesawhowSamzagoestotheoppositeendofthelatencyspectrumfrombatchprocessingandenablesper-messageprocessingofindividualmessagesastheyarrive.
WealsosawhowSamzaprovidesacallbackmechanismthatMapReducedeveloperswillbefamiliarwith,butusesitforaverydifferentprocessingmodel.WealsodiscussedthewaysinwhichSamzautilizesYARNasitsmainexecutionframeworkandhowitimplementsthemodeldescribedinChapter3,Processing–MapReduceandBeyond.
Inthenextchapter,wewillswitchgearsandexploreApacheSpark.ThoughithasaverydifferentdatamodelthanSamza,we’llseethatitdoesalsohaveanextensionthatsupportsprocessingofrealtimedatastreams,includingtheoptionofKafkaintegration.However,bothprojectsaresodifferentthattheyarecomplimentarymorethanincompetition.
![Page 210: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/210.jpg)
Chapter5.IterativeComputationwithSparkInthepreviouschapter,wesawhowSamzacanenablenearreal-timestreamdataprocessingwithinHadoop.ThisisquiteastepawayfromthetraditionalbatchprocessingmodelofMapReduce,butstillkeepswiththemodelofprovidingawell-definedinterfaceagainstwhichbusinesslogictaskscanbeimplemented.InthischapterwewillexploreApacheSpark,whichcanbeviewedbothasaframeworkonwhichapplicationscanbebuiltaswellasaprocessingframeworkinitsownright.NotonlyareapplicationsbeingbuiltonSpark,butentirecomponentswithintheHadoopecosystemarealsobeingreimplementedtouseSparkastheirunderlyingprocessingframework.Inparticular,wewillcoverthefollowingtopics:
WhatSparkisandhowitscoresystemcanrunonYARNThedatamodelprovidedbySparkthatenableshugelyscalableandhighlyefficientdataprocessingThebreadthofadditionalSparkcomponentsandrelatedprojects
It’simportanttonoteupfrontthatalthoughSparkhasitsownmechanismtoprocessstreamingdata,thisisbutonepartofwhatSparkhastooffer.It’sbesttothinkofitasamuchbroaderinitiative.
![Page 211: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/211.jpg)
ApacheSparkApacheSpark(https://spark.apache.org/)isadataprocessingframeworkbasedonageneralizationofMapReduce.ItwasoriginallydevelopedbytheAMPLabatUCBerkeley(https://amplab.cs.berkeley.edu/).LikeTez,SparkactsasanexecutionenginethatmodelsdatatransformationsasDAGsandstrivestoeliminatetheI/OoverheadofMapReduceinordertoperformiterativecomputationatscale.WhileTez’smaingoalwastoprovideafasterexecutionengineforMapReduceonHadoop,SparkhasbeendesignedbothasastandaloneframeworkandanAPIforapplicationdevelopment.Thesystemisdesignedtoperformgeneral-purposein-memorydataprocessing,streamworkflows,aswellasinteractiveanditerativecomputation.
SparkisimplementedinScala,whichisastaticallytypedprogramminglanguagefortheJavaVMandexposesnativeprogramminginterfacesforJavaandPythoninadditiontoScalaitself.NotethatthoughJavacodecancalltheScalainterfacedirectly,therearesomeaspectsofthetypesystemthatmakesuchcodeprettyunwieldy,andhenceweusethenativeJavaAPI.
ScalashipswithaninteractiveshellsimilartothatofRubyandPython;thisallowsuserstorunSparkinteractivelyfromtheinterpretertoqueryanydataset.
TheScalainterpreteroperatesbycompilingaclassforeachlinetypedbytheuser,loadingitintotheJVM,andinvokingafunctiononit.Thisclassincludesasingletonobjectthatcontainsthevariablesorfunctionsonthatlineandrunstheline’scodeinaninitializemethod.Inadditiontoitsrichprogramminginterfaces,Sparkisbecomingestablishedasanexecutionengine,withpopulartoolsoftheHadoopecosystem(suchasPigandHive)beingportedtotheframework.
![Page 212: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/212.jpg)
ClustercomputingwithworkingsetsSpark’sarchitectureiscenteredaroundtheconceptofResilientDistributedDatasets(RDDs),whichisaread-onlycollectionofScalaobjectspartitionedacrossasetofmachinesthatcanpersistinmemory.Thisabstractionwasproposedina2012researchpaper,ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,whichcanbefoundathttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
ASparkapplicationconsistsofadriverprogramthatexecutesparalleloperationsonaclusterofworkersandlong-livedprocessesthatcanstoredatapartitionsinmemorybydispatchingfunctionsthatrunasparalleltasks,asshowninthefollowingdiagram:
Sparkclusterarchitecture
ProcessesarecoordinatedviaaSparkContextinstance.SparkContextconnectstoaresourcemanager(suchasYARN),requestsexecutorsonworkernodes,andsendstaskstobeexecuted.Executorsareresponsibleforrunningtasksandmanagingmemorylocally.
Sparkallowsyoutosharevariablesbetweentasks,orbetweentasksandthedriver,usinganabstractionknownassharedvariables.Sparksupportstwotypesofsharedvariables:broadcastvariables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichareadditivevariablessuchascountersandsums.
ResilientDistributedDatasets(RDDs)AnRDDisstoredinmemory,sharedacrossmachinesandisusedinMapReduce-likeparalleloperations.Faulttoleranceisachievedthroughthenotionoflineage:ifapartitionofanRDDislost,theRDDhasenoughinformationabouthowitwasderivedfromotherRDDstobeabletorebuildjustthatpartition.AnRDDcanbebuiltinfourways:
ByreadingdatafromafilestoredinHDFSBydividing–parallelizing–aScalacollectionintoanumberofpartitionsthataresenttoworkers
![Page 213: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/213.jpg)
BytransforminganexistingRDDusingparalleloperatorsBychangingthepersistenceofanexistingRDD
SparkshineswhenRDDscanfitinmemoryandcanbecachedacrossoperations.TheAPIexposesmethodstopersistRDDsandallowsforseveralpersistencestrategiesandstoragelevels,allowingforspilltodiskaswellasspace-efficientbinaryserialization.
ActionsOperationsareinvokedbypassingfunctionstoSpark.Thesystemdealswithvariablesandsideeffectsaccordingtothefunctionalprogrammingparadigm.Closurescanrefertovariablesinthescopewheretheyarecreated.Examplesofactionsarecount(returnsthenumberofelementsinthedataset),andsave(outputsthedatasettostorage).OtherparalleloperationsonRDDsincludethefollowing:
map:appliesafunctiontoeachelementofthedatasetfilter:selectselementsfromadatasetbasedonuser-providedcriteriareduce:combinesdatasetelementsusinganassociativefunctioncollect:sendsallelementsofthedatasettothedriverprogramforeach:passeseachelementthroughauser-providedfunctiongroupByKey:groupsitemstogetherbyaprovidedkeysortByKey:sortsitemsbykey
![Page 214: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/214.jpg)
DeploymentSparkcanrunbothinlocalmode,similartoaHadoopsingle-nodesetup,oratoparesourcemanager.Currentlysupportedresourcemanagersinclude:
SparkStandaloneClusterModeYARNApacheMesos
SparkonYARNAnad-hoc-consolidatedJARneedstobebuiltinordertodeploySparkonYARN.SparklaunchesaninstanceofthestandalonedeployedclusterwithintheResourceManager.ClouderaandMapRbothshipwithSparkonYARNaspartoftheirsoftwaredistribution.Atthetimeofwriting,SparkisavailableforHortonworks’sHDPasatechnologypreview(http://hortonworks.com/hadoop/spark/).
SparkonEC2Sparkcomeswithadeploymentscript,spark-ec2,locatedintheec2directory.ThisscriptautomaticallysetsupSparkandHDFSonaclusterofEC2instances.InordertolaunchaSparkclusterontheAmazoncloud,gototheec2directoryandrunthefollowingcommand:
./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch<cluster-
name>
Here,<keypair>isthenameofyourEC2keypair,<key-file>istheprivatekeyfileforthekeypair,<num-slaves>isthenumberofslavenodestobelaunched,and<cluster-name>isthenametobegiventoyourcluster.SeeChapter1,Introduction,formoredetailsregardingthesetupofkeypairs,andverifythattheclusterschedulerisupandseesalltheslavesbygoingtoitswebUI,theaddressofwhichwillbeprintedoncethescriptcompletes.
YoucanspecifyapathinS3astheinputthroughaURIoftheforms3n://<bucket>/path.YouwillalsoneedtosetyourAmazonsecuritycredentials,eitherbysettingtheenvironmentvariablesAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYbeforeyourprogramisexecuted,orthroughSparkContext.hadoopConfiguration.
![Page 215: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/215.jpg)
GettingstartedwithSparkSparkbinariesandsourcecodeareavailableontheprojectwebsiteathttp://spark.apache.org/.TheexamplesinthefollowingsectionhavebeentestedusingSpark1.1.0builtfromsourceontheClouderaCDH5.0QuickStartVM.
Downloadanduncompressthegziparchivewiththefollowingcommands:
$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
$tarxvzfspark-1.1.0.tgz
$cdspark-1.1.0
SparkisbuiltonScala2.10andusessbt(https://github.com/sbt/sbt)tobuildthesourcecoreandrelatedexamples:
$./sbt/sbt-Dhadoop.version=2.2.0-Pyarnassembly
Withthe-Dhadoop.version=2.2.0and-Pyarnoptions,weinstructsbttobuildagainstHadoopversions2.2.0orhigherandenableYARNsupport.
StartSparkinstandalonemodewiththefollowingcommand:
$./sbin/start-all.sh
Thiscommandwilllaunchalocalmasterinstanceatspark://localhost:7077aswellasaworkernode.
Awebinterfacetothemasternodecanbeaccessedathttp://localhost:8080/andcanbeseeninthefollowingscreenshot:
Masternodewebinterface
Sparkcanruninteractivelythroughspark-shell,whichisamodifiedversionoftheScalashell.Asafirstexample,wewillimplementawordcountoftheTwitterdatasetweusedinChapter3,Processing-MapReduceandBeyond,usingtheScalaAPI.
Startaninteractivespark-shellsessionbyrunningthefollowingcommand:
$./bin/spark-shell
TheshellinstantiatesaSparkContextobject,sc,thatisresponsibleforhandlingdriverconnectionstoworkers.Wewilldescribeitssemanticslaterinthischapter.
![Page 216: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/216.jpg)
Tomakethingsabiteasier,let’screateasampletextualdatasetthatcontainsonestatusupdateperline:
$stream.py-t-n1000>sample.txt
Then,copyittoHDFS:
$hdfsdfs-putsample.txt/tmp
Withinspark-shell,wefirstcreateanRDD-file-fromthesampledata:
valfile=sc.textFile("/tmp/sample.txt")
Then,weapplyaseriesoftransformationstocountthewordoccurrencesinthefile.Notethattheoutputofthetransformationchain-counts-isstillanRDD:
valcounts=file.flatMap(line=>line.split(""))
.map(word=>(word,1))
.reduceByKey((m,n)=>m+n)
Thischainoftransformationscorrespondstothemapandreducephasesthatwearefamiliarwith.Inthemapphase,weloadeachlineofthedataset(flatMap),tokenizeeachtweetintoasequenceofwords,counttheoccurrenceofeachword(map),andemit(key,value)pairs.Inthereducephase,wegroupbykey(word)andsumvalues(m,n)togethertoobtainwordcounts.
Finally,weprintthefirsttenelements,counts.take(10),totheconsole:
counts.take(10).foreach(println)
![Page 217: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/217.jpg)
WritingandrunningstandaloneapplicationsSparkallowsstandaloneapplicationstobewrittenusingthreeAPIs:Scala,Java,andPython.
ScalaAPIThefirstthingaSparkdrivermustdoistocreateaSparkContextobject,whichtellsSparkhowtoaccessacluster.Afterimportingclassesandimplicitconversionsintoaprogram,asinthefollowing:
importorg.apache.spark.SparkContext
importorg.apache.spark.SparkContext._
TheSparkContextobjectcanbecreatedwiththefollowingconstructor:
newSparkContext(master,appName,[sparkHome])
ItcanalsobecreatedthroughSparkContext(conf),whichtakesaSparkConfobject.
ThemasterparameterisastringthatspecifiesaclusterURItoconnectto(suchasspark://localhost:7077)oralocalstringtoruninlocalmode.TheappNametermistheapplicationnamethatwillbeshownintheclusterwebUI.
ItisnotpossibletooverridethedefaultSparkContextclass,norisitpossibletocreateanewonewithinarunningSparkshell.ItishoweverpossibletospecifywhichmasterthecontextconnectstousingtheMASTERenvironmentvariable.Forexample,torunspark-shellonfourcores,usethefollowing:
$MASTER=local[4]./bin/spark-shell
JavaAPITheorg.apache.spark.api.javapackageexposesalltheSparkfeaturesavailableintheScalaversiontoJava.TheJavaAPIhasaJavaSparkContextclassthatreturnsinstancesoforg.apache.spark.api.java.JavaRDDandworkswithJavacollectionsinsteadofScalaones.
ThereareafewkeydifferencesbetweentheJavaandScalaAPIs:
Java7doesnotsupportanonymousorfirst-classfunctions;therefore,functionsmustbeimplementedbyextendingtheorg.apache.spark.api.java.function.Function,Function2,andotherclasses.AsofSparkversion1.0theAPIhasbeenrefactoredtosupportJava8lambdaexpressions.WithJava8,Functionclassescanbereplacedwithinlineexpressionsthatactasashorthandforanonymousfunctions.TheRDDmethodsreturnJavacollectionsKey-valuepairs,whicharesimplywrittenas(key,value)inScala,arerepresentedbythescala.Tuple2class.Tomaintaintypesafety,someRDDandfunctionmethods,suchasthosethathandlekeypairsanddoubles,areimplementedasspecializedclasses.
![Page 218: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/218.jpg)
WordCountinJavaAnexampleofWordCountinJavaisincludedwiththeSparksourcecodedistributionatexamples/src/main/java/org/apache/spark/examples/JavaWordCount.java.
Firstofall,wecreateacontextusingtheJavaSparkContextclass:
JavaSparkContextsc=newJavaSparkContext(master,"JavaWordCount",
System.getenv("SPARK_HOME"),
JavaSparkContext.jarOfClass(JavaWordCount.class));
JavaRDD<String>data=sc.textFile(infile,1);
JavaRDD<String>words=data.flatMap(newFlatMapFunction<String,
String>(){
@Override
publicIterable<String>call(Strings){
returnArrays.asList(s.split(""));
}
});
JavaPairRDD<String,Integer>ones=words.map(newPairFunction<String,
String,Integer>(){
@Override
publicTuple2<String,Integer>call(Strings){
returnnewTuple2<String,Integer>(s,1);
}
});
JavaPairRDD<String,Integer>counts=ones.reduceByKey(new
Function2<Integer,Integer,Integer>(){
@Override
publicIntegercall(Integeri1,Integeri2){
returni1+i2;
}
});
WethenbuildanRDDfromtheHDFSlocationinfile.Inthefirststepofthetransformationchain,wetokenizeeachtweetinthedatasetandreturnalistofwords.WeuseaninstanceofJavaPairRDD<String,Integer>tocountoccurrencesofeachword.Finally,wereducetheRDDtoanewJavaPairRDD<String,Integer>instancethatcontainsalistoftuples,eachrepresentingawordandthenumberoftimesitwasfoundinthedataset.
PythonAPIPySparkrequiresPythonversion2.6orhigher.RDDssupportthesamemethodsastheirScalacounterpartsbuttakePythonfunctionsandreturnPythoncollectiontypes.Lambdasyntax(https://docs.python.org/2/reference/expressions.html)isusedtopassfunctionstoRDDs.
ThewordcountinpysparkisrelativelysimilartoitsScalacounterpart:
tweets=sc.textFile("/tmp/sample.txt")
counts=tweets.flatMap(lambdatweet:tweet.split(''))\
![Page 219: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/219.jpg)
.map(lambdaword:(word,1))\
.reduceByKey(lambdam,n:m+n)
Thelambdaconstructcreatesanonymousfunctionsatruntime.lambdatweet:tweet.split('')createsafunctionthattakesastringtweetastheinputandoutputsalistofstringssplitbywhitespace.Spark’sflatMapappliesthisfunctiontoeachlineofthetweetsdataset.Inthemapphase,foreachwordtoken,lambdaword:(word,1)returns(word,1)tuplesthatindicatetheoccurrenceofawordinthedataset.InreduceByKey,wegroupthesetuplesbykey-word-andsumthevaluestogethertoobtainthewordcountwithlambdam,n:m+n.
![Page 220: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/220.jpg)
TheSparkecosystemApacheSparkpowersanumberoftools,bothasalibraryandasanexecutionengine.
![Page 221: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/221.jpg)
SparkStreamingSparkStreaming(foundathttp://spark.apache.org/docs/latest/streaming-programming-guide.html)isanextensionoftheScalaAPIthatallowsdataingestionfromstreamssuchasKafka,Flume,Twitter,ZeroMQ,andTCPsockets.
SparkStreamingreceivesliveinputdatastreamsanddividesthedataintobatches(arbitrarilysizedtimewindows),whicharethenprocessedbytheSparkcoreenginetogeneratethefinalstreamofresultsinbatches.Thishigh-levelabstractioniscalledDStream(org.apache.spark.streaming.dstream.DStreams)andisimplementedasasequenceofRDDs.DStreamallowsfortwokindsofoperations:transformationsandoutputoperations.TransformationsworkononeormoreDStreamstocreatenewDStreams.Aspartofachainoftransformations,datacanbepersistedeithertoastoragelayer(HDFS)oranoutputchannel.SparkStreamingallowsfortransformationsoveraslidingwindowofdata.Awindow-basedoperationneedstospecifytwoparameters:thewindowlength,thedurationofthewindowandtheslideinterval,theintervalatwhichthewindow-basedoperationisperformed.
![Page 222: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/222.jpg)
GraphXGraphX(foundathttps://spark.apache.org/docs/latest/graphx-programming-guide.html)isanAPIforgraphcomputationthatexposesasetofoperatorsandalgorithmsforgraph-orientedcomputationaswellasanoptimizedvariantofPregel.
![Page 223: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/223.jpg)
MLlibMLlib(foundathttp://spark.apache.org/docs/latest/mllib-guide.html)providescommonMachineLearning(ML)functionality,includingtestsanddatagenerators.MLlibcurrentlysupportsfourtypesofalgorithms:binaryclassification,regression,clustering,andcollaborativefiltering.
![Page 224: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/224.jpg)
SparkSQLSparkSQLisderivedfromShark,whichisanimplementationoftheHivedatawarehousingsystemthatusesSparkasanexecutionengine.WewilldiscussHiveinChapter7,HadoopandSQL.WithSparkSQL,itispossibletomixSQL-likequerieswithScalaorPythoncode.TheresultsetsreturnedbyaqueryarethemselvesRDDs,andassuch,theycanbemanipulatedbySparkcoremethodsorMLlibandGraphX.
![Page 225: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/225.jpg)
ProcessingdatawithApacheSparkInthissection,wewillimplementtheexamplesfromChapter3,Processing–MapReduceandBeyond,usingtheScalaAPI.Wewillconsiderboththebatchandreal-timeprocessingscenarios.WewillshowyouhowSparkStreamingcanbeusedtocomputestatisticsontheliveTwitterstream.
![Page 226: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/226.jpg)
BuildingandrunningtheexamplesScalasourcecodefortheexamplescanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch5.Wewillbeusingsbttobuild,manage,andexecutecode.
Thebuild.sbtfilecontrolsthecodebasemetadataandsoftwaredependencies;theseincludetheversionoftheScalainterpreterthatSparklinksto,alinktotheAkkapackagerepositoryusedtoresolveimplicitdependencies,aswellasdependenciesonSparkandHadooplibraries.
Thesourcecodeforallexamplescanbecompiledwith:
$sbtcompile
Or,itcanbepackagedintoaJARfilewith:
$sbtpackage
Ahelperscripttoexecutecompiledclassescanbegeneratedwith:
$sbtadd-start-script-tasks
$sbtstart-script
Thehelpercanbeinvokedasfollows:
$target/start<classname><master><param1>…<paramn>
Here,<master>istheURIofthemasternode.AninteractiveScalasessioncanbeinvokedviasbtwiththefollowingcommand:
$sbtconsole
ThisconsoleisnotthesameastheSparkinteractiveshell;rather,itisanalternativewaytoexecutecode.InordertorunSparkcodeinitwewillneedtomanuallyimportandinstantiateaSparkContextobject.Allexamplespresentedinthissectionexpectatwitter4j.propertiesfilecontainingtheconsumerkeyandsecretandtheaccesstokenstobepresentinthesamedirectorywheresbtorspark-shellisbeinginvoked:
oauth.consumerKey=
oauth.consumerSecret=
oauth.accessToken=
oauth.accessTokenSecret=
RunningtheexamplesonYARNToruntheexamplesonaYARNgrid,wefirstbuildaJARfileusing:
$sbtpackage
Then,weshipittotheresourcemanagerusingthespark-submitcommand:
./bin/spark-submit--classapplication.to.execute--masteryarn-cluster
[options]target/scala-2.10/chapter-4_2.10-1.0.jar[<param1>…<paramn>]
![Page 227: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/227.jpg)
Unlikethestandalonemode,wedon’tneedtospecifya<master>URI.InYARN,theResourceManagerisselectedfromtheclusterconfiguration.MoreinformationonlaunchingsparkinYARNcanbefoundathttp://spark.apache.org/docs/latest/running-on-yarn.html.
FindingpopulartopicsUnliketheearlierexampleswiththeSparkshellweinitializeaSparkContextaspartoftheprogram.WepassthreeargumentstotheSparkContextconstructor:thetypeofschedulerwewanttouse,anamefortheapplication,andthedirectorywhereSparkisinstalled:
importorg.apache.spark.SparkContext._
importorg.apache.spark.SparkContext
importscala.util.matching.Regex
objectHashtagCount{
defmain(args:Array[String]){
[…]
valsc=newSparkContext(master,
"HashtagCount",
System.getenv("SPARK_HOME"))
valfile=sc.textFile(inputFile)
valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")
valcounts=file.flatMap(line=>
(patternfindAllInline).toList)
.map(word=>(word,1))
.reduceByKey((m,n)=>m+n)
counts.saveAsTextFile(outputPath)
}
}
WecreateaninitialRDDfromadatasetstoredinHDFS-inputFile-andapplylogicthatissimilartotheWordCountexample.
Foreachtweetinthedataset,weextractanarrayofstringsthatmatchthehashtagpattern(patternfindAllInline).toArray,andwecountanoccurrenceofeachstringusingthemapoperator.ThisgeneratesanewRDDasalistoftuplesintheform:
(word,1),(word2,1),(word,1)
Finally,wecombinetogetherelementsofthisRDDusingthereduceByKey()method.WestoretheRDDgeneratedbythislaststepbackintoHDFSwithsaveAsTextFile.
Thecodeforthestandalonedrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagCount.scala
AssigningasentimenttotopicsThesourcecodeofthisexamplecanbefoundat
![Page 228: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/228.jpg)
https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagSentiment.scalaandthecodeisasfollows:
importorg.apache.spark.SparkContext._
importorg.apache.spark.SparkContext
importscala.util.matching.Regex
importscala.io.Source
objectHashtagSentiment{
defmain(args:Array[String]){
[…]
valsc=newSparkContext(master,
"HashtagSentiment",
System.getenv("SPARK_HOME"))
valfile=sc.textFile(inputFile)
valpositive=Source.fromFile(positiveWordsPath)
.getLines
.filterNot(_startsWith";")
.toSet
valnegative=Source.fromFile(negativeWordsPath)
.getLines
.filterNot(_startsWith";")
.toSet
valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")
valcounts=file.flatMap(line=>(patternfindAllInline).map({
word=>(word,sentimentScore(line,positive,negative))
})).reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})
valsentiment=counts.map({hashtagScore=>
valhashtag=hashtagScore._1
valscore=hashtagScore._2
valnormalizedScore=score._1/score._2
(hashtag,normalizedScore)
})
sentiment.saveAsTextFile(outputPath)
}
}
First,wereadalistofpositiveandnegativewordsintoScalaSetobjectsandfilteroutcomments(stringsbeginningwith;).
Whenahashtagisfound,wecallafunction-sentimentScore-toestimatethesentimentexpressedbythatgiventext.ThisfunctionimplementsthesamelogicweusedinChapter3,Processing–MapReduceandBeyond,toestimatethesentimentofatweet.Ittakesasinputparametersthetweet’stext,str,andalistofpositiveandnegativewordsasSet[String]objects.Thereturnvalueisthedifferencebetweenthepositiveandnegativescoresandthenumberofwordsinthetweets.InSpark,werepresentthisreturnvalueasapairofDoubleandIntegerobjects:
![Page 229: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/229.jpg)
defsentimentScore(str:String,positive:Set[String],
negative:Set[String]):(Double,Int)={
varpositiveScore=0;varnegativeScore=0;
str.split("""\s+""").foreach{w=>
if(positive.contains(w)){positiveScore+=1;}
if(negative.contains(w)){negativeScore+=1;}
}
((positiveScore-negativeScore).toDouble,
str.split("""\s+""").length)
}
Wereducethemapoutputbyaggregatingbythekey(thehashtag).Inthisphase,weemitatriplemadeofthehashtag,thesumofthedifferencebetweenpositiveandnegativescores,andthenumberofwordspertweet.WeuseanadditionalmapsteptonormalizethesentimentscoreandstoretheresultinglistofhashtagandsentimentpairstoHDFS.
![Page 230: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/230.jpg)
DataprocessingonstreamsThepreviousexamplecanbeeasilyadjustedtoworkonareal-timestreamofdata.Inthisandthefollowingsection,wewillusespark-streaming-twittertoperformsomesimpleanalyticstasksonthereal-timefirehose:
valwindow=10
valssc=newStreamingContext(master,"TwitterStreamEcho",
Seconds(window),System.getenv("SPARK_HOME"))
valstream=TwitterUtils.createStream(ssc,auth)
valtweets=stream.map(tweet=>(tweet.getText()))
tweets.print()
ssc.start()
ssc.awaitTermination()
}
TheScalasourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/TwitterStreamEcho.scala
Thetwokeypackagesweneedtoimportare:
importorg.apache.spark.streaming.{Seconds,StreamingContext}
importorg.apache.spark.streaming.twitter._
WeinitializeanewStreamingContextssconalocalclusterusinga10-secondwindowandusethiscontexttocreateaDStreamoftweetswhosetextweprint.
Uponsuccessfulexecution,Twitter’sreal-timefirehosewillbeechoedintheterminalinbatchesof10secondsworthofdata.NoticethatthecomputationwillcontinueindefinitelybutcanbeinterruptedatanymomentbypressingCtrl+C.
TheTwitterUtilsobjectisawrappertotheTwitter4jlibrary(http://twitter4j.org/en/index.html)thatshipswithspark-streaming-twitter.AsuccessfulcalltoTwitterUtils.createStreamwillreturnaDStreamofTwitter4jobjects(TwitterInputDStream).Intheprecedingexample,weusedthegetText()methodtoextractthetweettext;however,noticethatthetwitter4jobjectexposesthefullTwitterAPI.Forinstance,wecanprintastreamofuserswiththefollowingcall:
valusers=stream.map(tweet=>(tweet.getUser().getId(),
tweet.getUser().getName()))
users.print()
StatemanagementSparkStreamingprovidesanadhocDStreamtokeepthestateofeachkeyinanRDDandtheupdateStateByKeymethodtomutatestate.
Wecanreusethecodeofthebatchexampletoassignandupdatesentimentscoresonstreams:
![Page 231: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/231.jpg)
objectStreamingHashTagSentiment{
[…]
valcounts=text.flatMap(line=>(patternfindAllInline)
.toList
.map(word=>(word,sentimentScore(line,positive,negative))))
.reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})
valsentiment=counts.map({hashtagScore=>
valhashtag=hashtagScore._1
valscore=hashtagScore._2
valnormalizedScore=score._1/score._2
(hashtag,normalizedScore)
})
valstateDstream=sentiment
.updateStateByKey[Double](updateFunc)
stateDstream.print
ssc.checkpoint("/tmp/checkpoint")
ssc.start()
}
AstateDStreamiscreatedbycallinghashtagSentiment.updateStateByKey.
TheupdateFuncfunctionimplementsthestatemutationlogic,whichisacumulativesumofsentimentscoresoveraperiodoftime:
valupdateFunc=(values:Seq[Double],state:Option[Double])=>{
valcurrentScore=values.sum
valpreviousScore=state.getOrElse(0.0)
Some((currentScore+previousScore)*decayFactor)
}
decayFactorisaconstantvalue,lessthanorequaltozero,thatweusetoproportionallydecreasethescoreovertime.Intuitively,thiswillfadehashtagsiftheyarenottrendinganymore.SparkStreamingwritesintermediatedataforstatefuloperationstoHDFS,soweneedtocheckpointtheStreamingcontextwithssc.checkpoint.
Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/StreamingHashTagSentiment.scala
![Page 232: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/232.jpg)
DataanalysiswithSparkSQLSparkSQLcaneasethetaskofrepresentingandmanipulatingstructureddata.WewillloadaJSONfileintoatemporarytableandcalculatesimplestatisticsbyblendingSQLstatementsandScalacode:
objectSparkJson{
[…]
valfile=sc.textFile(inputFile)
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext._
valtweets=sqlContext.jsonFile(inFile)
tweets.printSchema()
//RegistertheSchemaRDDasatable
tweets.registerTempTable("tweets")
valtext=sqlContext.sql("SELECTtext,user.idFROMtweets")
//Findthetenmostpopularhashtags
valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")
valcounts=text.flatMap(sqlRow=>(patternfindAllIn
sqlRow(0).toString).toList)
.map(word=>(word,1))
.reduceByKey((m,n)=>m+n)
counts.registerTempTable("hashtag_frequency")
counts.printSchema
valtop10=sqlContext.sql("SELECT_1ashashtag,_2asfrequencyFROM
hashtag_frequencyorderbyfrequencydesclimit10")
top10.foreach(println)
}
Aswithpreviousexamples,weinstantiateaSparkContextscandloadthedatasetofJSONtweets.Wethencreateaninstanceoforg.apache.spark.sql.SQLContextbasedontheexistingsc.TheimportsqlContext._givesaccesstoallfunctionsandimplicitconventionsforsqlContext.Weloadthetweets’JSONdatasetusingsqlContext.jsonFile.TheresultingtweetsobjectisaninstanceofSchemaRDD,whichisanewtypeofRDDintroducedbySparkSQL.TheSchemaRDDclassisconceptuallysimilartoatableinarelationaldatabase;itiscomposedofRowobjectsandaschemathatdescribesthecontentineachRow.Wecanseetheschemaforatweetbycallingtweets.printSchema().Beforewe’reabletomanipulatetweetswithSQLstatements,weneedtoregisterSchemaRDDasatableintheSQLContext.WethenextractthetextfieldofaJSONtweetwithanSQLquery.NotethattheoutputofsqlContext.sqlisanRDDagain.Assuch,wecanmanipulateitusingSparkcoremethods.Inourcase,wereusethelogicusedinpreviousexamplestoextracthashtagsandcounttheiroccurrences.Finally,weregistertheresultingRDDasatable,hashtag_frequency,andorderhashtagsby
![Page 233: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/233.jpg)
frequencywithaSQLquery.
Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SparkJson.scala.
SQLondatastreamsAtthetimeofwriting,aSQLContextcannotbedirectlyinstantiatedfromaStreamingContextobject.Itis,however,possibletoqueryaDStreambyregisteringaSchemaRDDforeachRDDinagivenstream:
objectSqlOnStream{
[…]
valssc=newStreamingContext(sc,Seconds(window))
valgson=newGson()
valdstream=TwitterUtils
.createStream(ssc,auth)
.map(gson.toJson(_))
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext._
dstream.foreachRDD(rdd=>{
rdd.foreach(println)
valjsonRDD=sqlContext.jsonRDD(rdd)
jsonRDD.registerTempTable("tweets")
jsonRDD.printSchema
sqlContext.sql(query)
})
ssc.checkpoint("/tmp/checkpoint")
ssc.start()
ssc.awaitTermination()
}
Inordertogetthetwoworkingtogether,wefirstcreateaSparkContextscthatweusetoinitializebothaStreamingContextsscandasqlContext.Asinpreviousexamples,weuseTwitterUtils.createStreamtocreateaDStreamRDDdstream.Inthisexample,weuseGoogle’sGsonJSONparsertoserializeeachtwitter4jobjecttoaJSONstring.ToexecuteSparkSQLqueriesonthestream,weregisteraSchemaRDDjsonRDDwithinadstream.foreachRDDloop.WeusethesqlContext.jsonRDDmethodtocreateanRDDfromabatchofJSONtweets.Atthispoint,wecanquerytheSchemaRDDusingthesqlContext.sqlmethod.
Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SqlOnStream.scala.
![Page 234: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/234.jpg)
ComparingSamzaandSparkStreamingItisusefultocompareSamzaandSparkStreamingtohelpidentifytheareasinwhicheachcanbestbeapplied.Asithasbeenhopefullymadeclearinthisbook,thesetechnologiesareverymuchcomplimentary.EventhoughSparkStreamingmightappearcompetitivewithSamza,wefeelbothproductsoffercompellingadvantagesincertainareas.
Samzashineswhentheinputdataistrulyastreamofdiscreteeventsandyouwishtobuildprocessingthatoperatesonthistypeofinput.SamzajobsrunningonKafkacanhavelatenciesintheorderofmilliseconds.Thisprovidesaprogrammingmodelfocusedontheindividualmessagesandisthebetterfitfortruenearreal-timeprocessingapplications.Thoughitlackssupporttobuildtopologiesofcollaboratingjobs,itssimplemodelallowssimilarconstructstobebuiltand,perhapsmoreimportantly,beeasilyreasonedabout.Itsmodelofpartitioningandscalingalsofocusesonsimplicity,whichagainmakesaSamzaapplicationveryeasytounderstandandgivesitasignificantadvantagewhendealingwithsomethingasintrinsicallycomplexasreal-timedata.
Sparkismuchmorethanastreamingproduct.Itssupportforbuildingdistributeddatastructuresfromexistingdatasetsandusingpowerfulprimitivestomanipulatethesegivesittheabilitytoprocesslargedatasetsatahigherlevelofgranularity.OtherproductsintheSparkecosystembuildadditionalinterfacesorabstractionsuponthiscommonbatchprocessingcore.ThisisverymuchadifferentfocustothemessagestreammodelofSamza.
ThisbatchmodelisalsodemonstratedwhenwelookatSparkStreaming;insteadofaper-messageprocessingmodel,itslicesthemessagestreamintoaseriesofRDDs.Withafastexecutionengine,thismeanslatenciesaslowas1second(http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf).Forworkloadsthatwishtoanalyzethestreaminsuchaway,thiswillbeabetterfitthanSamza’sper-messagemodel,whichrequiresadditionallogictoprovidesuchwindowing.
![Page 235: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/235.jpg)
SummaryThischapterexploredSparkandshowedyouhowitaddsiterativeprocessingasanewrichframeworkuponwhichapplicationscanbebuiltatopYARN.Inparticular,wehighlighted:
Thedistributeddata-structure-basedprocessingmodelofSparkandhowitallowsveryefficientin-memorydataprocessingThebroaderSparkecosystemandhowmultipleadditionalprojectsarebuiltatopittospecializethecomputationalmodelevenfurther
InthenextchapterwewillexploreApachePiganditsprogramminglanguage,PigLatin.WewillseehowthistoolcangreatlysimplifysoftwaredevelopmentforHadoopbyabstractingawaysomeoftheMapReduceandSparkcomplexity.
![Page 236: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/236.jpg)
Chapter6.DataAnalysiswithApachePigInthepreviouschapters,weexploredanumberofAPIsfordataprocessing.MapReduce,Spark,TezandSamzaareratherlow-level,andwritingnon-trivialbusinesslogicwiththemoftenrequiressignificantJavadevelopment.Moreover,differentuserswillhavedifferentneeds.ItmightbeimpracticalforananalysttowriteMapReducecodeorbuildaDAGofinputsandoutputstoanswersomesimplequeries.Atthesametime,asoftwareengineeroraresearchermightwanttoprototypeideasandalgorithmsusinghigh-levelabstractionsbeforejumpingintolow-levelimplementationdetails.
Inthischapterandthefollowingone,wewillexploresometoolsthatprovideawaytoprocessdataonHDFSusinghigher-levelabstractions.InthischapterwewillexploreApachePig,and,inparticular,wewillcoverthefollowingtopics:
WhatApachePigisandthedataflowmodelitprovidesPigLatin’sdatatypesandfunctionsHowPigcanbeeasilyenhancedusingcustomusercodeHowwecanusePigtoanalyzetheTwitterstream
![Page 237: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/237.jpg)
AnoverviewofPigHistorically,thePigtoolkitconsistedofacompilerthatgeneratedMapReduceprograms,bundledtheirdependencies,andexecutedthemonHadoop.PigjobsarewritteninalanguagecalledPigLatinandcanbeexecutedinbothinteractiveandbatchfashions.Furthermore,PigLatincanbeextendedusingUserDefinedFunctions(UDFs)writteninJava,Python,Ruby,Groovy,orJavaScript.
Pigusecasesincludethefollowing:
DataprocessingAdhocanalyticalqueriesRapidprototypingofalgorithmsExtractTransformLoadpipelines
Followingatrendwehaveseeninpreviouschapters,Pigismovingtowardsageneral-purposecomputingarchitecture.Asofversion0.13theExecutionEngineinterface(org.apache.pig.backend.executionengine)actsasabridgebetweenthefrontendandthebackendofPig,allowingPigLatinscriptstobecompiledandexecutedonframeworksotherthanMapReduce.Atthetimeofwriting,version0.13shipswithMRExecutionEngine(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRExecutionEngineandworkonalow-latencybackendbasedonTez(org.apache.pig.backend.hadoop.executionengine.tez.*)isexpectedtobeincludedinversion0.14(seehttps://issues.apache.org/jira/browse/PIG-3446).WorkonintegratingSparkiscurrentlyinprogressinthedevelopmentbranch(seehttps://issues.apache.org/jira/browse/PIG-4059).
Pig0.13comeswithanumberofperformanceenhancementsfortheMapReducebackend,inparticulartwofeaturestoreducelatencyofsmalljobs:directHDFSaccess(https://issues.apache.org/jira/browse/PIG-3642)andautolocalmode(https://issues.apache.org/jira/browse/PIG-3463).DirectHDFS,theopt.fetchproperty,isturnedonbydefault.WhendoingaDUMPinasimple(map-only)scriptthatcontainsonlyLIMIT,FILTER,UNION,STREAM,orFOREACHoperators,inputdataisfetchedfromHDFS,andthequeryisexecuteddirectlyinPig,bypassingMapReduce.Withautolocal,thepig.auto.local.enabledproperty,PigwillrunaqueryintheHadooplocalmodewhenthedatasizeissmallerthanpig.auto.local.input.maxbytes.Autolocalisoffbydefault.
PigwilllaunchMapReducejobsifbothmodesareofforifthequeryisnoteligibleforeither.Ifbothmodesareon,Pigwillcheckwhetherthequeryiseligiblefordirectaccessand,ifnot,fallbacktoautolocal.Failingthat,itwillexecutethequeryonMapReduce.
![Page 238: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/238.jpg)
GettingstartedWewillusethestream.pyscriptoptionstoextractJSONdataandretrieveaspecificnumberoftweets;wecanrunthiswithacommandsuchasthefollowing:
$pythonstream.py-j-n10000>tweets.json
Thetweets.jsonfilewillcontainoneJSONstringoneachlinerepresentingatweet.
RememberthattheTwitterAPIcredentialsneedtobemadeavailableasenvironmentvariablesorhardcodedinthescriptitself.
![Page 239: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/239.jpg)
RunningPigPigisatoolthattranslatesstatementswritteninPigLatinandexecutesthemeitheronasinglemachineinstandalonemodeoronafullHadoopclusterwhenindistributedmode.Eveninthelatter,Pig’sroleistotranslatePigLatinstatementsintoMapReducejobsandthereforeitdoesn’trequiretheinstallationofadditionalservicesordaemons.Itisusedasacommand-linetoolwithitsassociatedlibraries.
ClouderaCDHshipswithApachePigversion0.12.Alternatively,thePigsourcecodeandbinarydistributionscanbeobtainedathttps://pig.apache.org/releases.html.
Ascanbeexpected,theMapReducemoderequiresaccesstoaHadoopclusterandHDFSinstallation.MapReducemodeisthedefaultmodeexecutedwhenrunningthePigcommandatthecommand-lineprompt.Scriptscanbeexecutedwiththefollowingcommand:
$pig-f<script>
Parameterscanbepassedviathecommandlineusing-param<param>=<val>,asfollows:
$pig–paraminput=tweets.txt
ParameterscanalsobespecifiedinaparamfilethatcanbepassedtoPigusingthe-param_file<file>option.Multiplefilescanbespecified.Ifaparameterispresentmultipletimesinthefile,thelastvaluewillbeusedandawarningwillbedisplayed.Aparameterfilecontainsoneparameterperline.Emptylinesandcomments(specifiedbystartingalinewith#)areallowed.WithinaPigscript,parametersareintheform$<parameter>.Thedefaultvaluecanbeassignedusingthedefaultstatement:%defaultinputtweets.json'.ThedefaultcommandwillnotworkwithinaGruntsession;we’lldiscussGruntinthenextsection.
Inlocalmode,allfilesareinstalledandrunusingthelocalhostandfilesystem.Specifylocalmodeusingthe-xflag:
$pig-xlocal
Inbothexecutionmodes,Pigprogramscanberuneitherinaninteractiveshellorinbatchmode.
![Page 240: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/240.jpg)
Grunt–thePiginteractiveshellPigcanruninaninteractivemodeusingtheGruntshell,whichisinvokedwhenweusethepigcommandattheterminalprompt.Intherestofthischapter,wewillassumethatexamplesareexecutedwithinaGruntsession.OtherthanexecutingPigLatinstatements,Gruntoffersanumberofutilitiesandaccesstoshellcommands:
fs:allowsuserstomanipulateHadoopfilesystemobjectsandhasthesamesemanticsastheHadoopCLIsh:executescommandsviatheoperatingsystemshellexec:launchesaPigscriptwithinaninteractiveGruntsessionkill:killsaMapReducejobhelp:printsalistofallavailablecommands
ElasticMapReducePigscriptscanbeexecutedonEMRbycreatingaclusterwith--applicationsName=Pig,Args=--version,<version>,asfollows:
$awsemrcreate-cluster\
--name"Pigcluster"\
--ami-version<amiversion>\
--instance-type<EC2instance>\
--instance-count<numberofnodes>\
--applicationsName=Pig,Args=--version,<version>\
--log-uri<S3bucket>\
--stepsType=PIG,\
Name="Pigscript",\
Args=[-f,s3://<scriptlocation>,\
-p,input=<inputparam>,\
-p,output=<outputparam>]
TheprecedingcommandwillprovisionanewEMRclusterandexecutes3://<scriptlocation>.Noticethatthescriptstobeexecutedandtheinput(-pinput)andoutput(-poutput)pathsareexpectedtobelocatedonS3.
AsanalternativetocreatinganewEMRcluster,itispossibletoaddPigstepstoanalready-instantiatedEMRclusterusingthefollowingcommand:
$awsemradd-steps\
--cluster-id<clusterid>\
--stepsType=PIG,\
Name="OtherPigscript",\
Args=[-f,s3://<scriptlocation>,\
-p,input=<inputparam>,\
-p,output=<outputparam>]
Intheprecedingcommand,<clusterid>istheIDoftheinstantiatedcluster.
ItisalsopossibletosshintothemasternodeandrunPigLatinstatementswithinaGruntsessionwiththefollowingcommand:
$awsemrssh--cluster-id<clusterid>--key-pair-file<keypair>
![Page 241: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/241.jpg)
![Page 242: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/242.jpg)
FundamentalsofApachePigTheprimaryinterfacetoprogramApachePigisPigLatin,aprocedurallanguagethatimplementsideasofthedataflowparadigm.
PigLatinprogramsaregenerallyorganizedasfollows:
ALOADstatementreadsdatafromHDFSAseriesofstatementsaggregatesandmanipulatesdataASTOREstatementwritesoutputtothefilesystemAlternatively,aDUMPstatementdisplaystheoutputtotheterminal
Thefollowingexampleshowsasequenceofstatementsthatoutputsthetop10hashtagsorderedbythefrequency,extractedfromthedatasetoftweets:
tweets=LOAD'tweets.json'
USINGJsonLoader('created_at:chararray,
id:long,
id_str:chararray,
text:chararray');
hashtags=FOREACHtweets{
GENERATEFLATTEN(
REGEX_EXTRACT(
text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)',1)
)astag;
}
hashtags_grpd=GROUPhashtagsBYtag;
hashtags_count=FOREACHhashtags_grpd{
GENERATE
group,
COUNT(hashtags)asoccurrencies;
}
hashtags_count_sorted=ORDERhashtags_countBYoccurrenciesDESC;
top_10_hashtags=LIMIThashtags_count_sorted10;
DUMPtop_10_hashtags;
First,weloadthetweets.jsondatasetfromHDFS,de-serializetheJSONfile,andmapittoafour-columnschemathatcontainsatweet’screationtime,itsIDinnumericalandstringform,andthetext.Foreachtweet,weextracthashtagsfromitstextusingaregularexpression.Weaggregateonhashtag,countthenumberofoccurrences,andorderbyfrequency.Finally,welimittheorderedrecordstothetop10mostfrequenthashtags.
AseriesofstatementslikethepreviousoneispickedupbythePigcompiler,transformedintoMapReducejobs,andexecutedonaHadoopcluster.Theplannerandoptimizerwillresolvedependenciesoninputandoutputrelationsandparallelizetheexecutionofstatementswhereverpossible.
StatementsarethebuildingblocksofprocessingdatawithPig.Theytakearelationasinputandproduceanotherrelationasoutput.InPigLatinterms,arelationcanbedefined
![Page 243: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/243.jpg)
asabagoftuples,twodatatypeswewillusethroughouttheremainderofthischapter.
UsersexperiencedwithSQLandtherelationaldatamodelmightfindPigLatin’ssyntaxsomewhatfamiliar.Whilethereareindeedsimilaritiesinthesyntaxitself,PigLatinimplementsanentirelydifferentcomputationalmodel.PigLatinisprocedural,itspecifiestheactualdatatransformstobeperformed,whereasSQLisdeclarativeanddescribesthenatureoftheproblembutdoesnotspecifytheactualruntimeprocessing.Intermsoforganizingdata,arelationcanbethoughtofasatableinarelationaldatabase,wheretuplesinabagcorrespondtotherowsinatable.Relationsareunorderedandthereforeeasilyparallelizable,andtheyarelessconstrainedthanrelationaltables.Pigrelationscancontaintupleswithdifferentnumbersoffields,andthosewiththesamefieldcountcanhavefieldsofdifferenttypesincorrespondingpositions.
AkeydifferencebetweenSQLandthedataflowmodeladoptedbyPigLatinliesinhowsplitsinadatapipelinearemanaged.Intherelationalworld,adeclarativelanguagesuchasSQLimplementsandexecutesqueriesthatwillgenerateasingleresult.Thedataflowmodelseesdatatransformationsasagraphwhereinputandoutputarenodesconnectedbyanoperator.Forinstance,intermediatestepsofaquerymightrequiretheinputtobegroupedbyanumberofkeysandresultinmultipleoutputs(GROUPBY).Pighasbuilt-inmechanismstomanagemultipledataflowsinsuchagraphbyexecutingoperatorsassoonasinputsarereadilyavailableandpotentiallyapplydifferentoperatorstoeachflow.Forinstance,Pig’simplementationoftheGROUPBYoperatorusestheparallelfeature(http://pig.apache.org/docs/r0.12.0/perf.html#parallel)toallowausertoincreasethenumberofreducetasksfortheMapReducejobsgeneratedandhenceincreasesconcurrency.Anadditionalsideeffectofthispropertyisthatwhenmultipleoperatorscanbeexecutedinparallelinthesameprogram,Pigdoesso(moredetailsonPig’smulti-queryimplementationcanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#multi-query-execution).AnotherconsequenceofPigLatin’sapproachtocomputationisthatitallowsthepersistenceofdataatanypointinthepipeline.Itallowsthedevelopertoselectspecificoperatorimplementationsandexecutionplanswhennecessary,effectivelyoverridingtheoptimizer.
PigLatinallowsandevenencouragesdeveloperstoinserttheirowncodealmostanywhereinapipelinebymeansofUserDefinedFunctions(UDFs)aswellasbyutilizingHadoopstreaming.UDFsallowuserstospecifycustombusinesslogiconhowdataisloaded,howitisstored,andhowitisprocessed,whereasstreamingallowsuserstolaunchexecutablesatanypointinthedataflow.
![Page 244: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/244.jpg)
ProgrammingPigPigLatincomeswithanumberofbuilt-infunctions(theeval,load/store,math,string,bag,andtuplefunctions)andanumberofscalarandcomplexdatatypes.Additionally,Pigallowsfunctionanddata-typeextensionbymeansofUDFsanddynamicinvocationofJavamethods.
![Page 245: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/245.jpg)
PigdatatypesPigsupportsthefollowingscalardatatypes:
int:asigned32-bitintegerlong:asigned64-bitintegerfloat:a32-bitfloatingpointdouble:a64-bitfloatingpointchararray:acharacterarray(string)inUnicodeUTF-8formatbytearray:abytearray(blob)boolean:abooleandatetime:adatetimebiginteger:aJavaBigIntegerbigdecimal:aJavaBigDecimal
Pigsupportsthefollowingcomplexdatatypes:
map:anassociativearrayenclosedby[],withthekeyandvalueseparatedby#,anditemsseparatedby,tuple:anorderedlistofdata,whereelementscanbeofanyscalarorcomplextypeenclosedby(),withitemsseparatedby,bag:anunorderedcollectionoftuplesenclosedby{}andseparatedby,
Bydefault,Pigtreatsdataasuntyped.Theusercandeclarethetypesofdataatloadtimeormanuallycastitwhennecessary.Ifadatatypeisnotdeclared,butascriptimplicitlytreatsavalueasacertaintype,Pigwillassumeitisofthattypeandcastitaccordingly.Thefieldsofabagortuplecanbereferredtobythenametuple.fieldorbytheposition$<index>.Pigcountsfrom0andhencethefirstelementwillbedenotedas$0.
![Page 246: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/246.jpg)
PigfunctionsBuilt-infunctionsareimplementedinJava,andtheytrytofollowstandardJavaconventions.Therearehoweveranumberofdifferencestokeepinmind,whichareasfollows:
FunctionnamesarecasesensitiveanduppercaseIftheresultvalueisnull,empty,ornotanumber(NaN),PigreturnsnullIfPigisunabletoprocesstheexpression,itreturnsanexception
Alistofallbuilt-infunctionscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html.
Load/storeLoad/storefunctionsdeterminehowdatagoesintoandcomesoutofPig.ThePigStorage,TextLoader,andBinStoragefunctionscanbeusedtoreadandwriteUTF-8delimited,unstructuredtext,andbinarydatarespectively.Supportforcompressionisdeterminedbytheload/storefunction.ThePigStorageandTextLoaderfunctionssupportgzipandbzip2compressionforbothread(load)andwrite(store).TheBinStoragefunctiondoesnotsupportcompression.
Asofversion0.12,Pigincludesbuilt-insupportforloadingandstoringAvroandJSONdataviatheAvroStorage(load/store),JsonStorage(store),andJsonLoader(load).Atthetimeofwriting,JSONsupportisstillsomewhatlimited.Inparticular,PigexpectsaschemaforthedatatobeprovidedasanargumenttoJsonLoader/JsonStorage,oritassumesthat.pig_schema(producedbyJsonStorage)ispresentinthedirectorycontainingtheinputdata.Inpractice,thismakesitdifficulttoworkwithJSONdumpsnotgeneratedbyPigitself.
Asseeninourfollowingexample,wecanloadtheJSONdatasetwithJsonLoader:
tweets=LOAD'tweets.json'USINGJsonLoader(
'created_at:chararray,
id:long,
id_str:chararray,
text:chararray,
source:chararray');
WeprovideaschemasothatthefirstfiveelementsofaJSONobjectcreated_id,id,id_str,text,andsourcearemapped.Wecanlookattheschemaoftweetsbyusingdescribetweets,whichreturnsthefollowing:
tweets:{created_at:chararray,id:long,id_str:chararray,text:
chararray,source:chararray}
EvalEvalfunctionsimplementasetofoperationstobeappliedonanexpressionthatreturnsabagormapdatatype.Theexpressionresultisevaluatedwithinthefunctioncontext.
AVG(expression):computestheaverageofthenumericvaluesinasingle-column
![Page 247: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/247.jpg)
bagCOUNT(expression):countsallelementswithnon-nullvaluesinthefirstpositioninabagCOUNT_STAR(expression):countsallelementsinabagIsEmpty(expression):checkswhetherabagormapisemptyMAX(expression),MIN(expression),andSUM(expression):returnthemax,min,orthesumofelementsinabagTOKENIZE(expression):splitsastringandoutputsabagofwords
Thetuple,bag,andmapfunctionsThesefunctionsallowconversionfromandtothebag,tuple,andmaptypes.Theyincludethefollowing:
TOTUPLE(expression),TOMAP(expression),andTOBAG(expression):Thesecoerceexpressiontoatuple,map,orbagTOP(n,column,relation):Thisreturnsthetopntuplesfromabagoftuples
Themath,string,anddatetimefunctionsPigexposesanumberoffunctionsprovidedbythejava.lang.Math,java.lang.String,java.util.Date,andJoda-TimeDateTimeclass(foundathttp://www.joda.org/joda-time/).
DynamicinvokersDynamicinvokersallowtheexecutionofJavafunctionswithouthavingtowraptheminaUDF.Theycanbeusedforanystaticfunctionthat:
acceptsnoargumentsoracceptsacombinationofstring,int,long,double,float,orarraywiththesesametypesreturnsastring,int,long,double,orfloatvalue
OnlyprimitivescanbeusedfornumbersandJavaboxedclasses(suchasInteger)cannotbeusedasarguments.Dependingonthereturntype,aspecifickindofinvokermustbeused:InvokeForString,InvokeForInt,InvokeForLong,InvokeForDouble,orInvokeForFloat.Moredetailsregardingdynamicinvokerscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html#dynamic-invokers.
MacrosAsofversion0.9,PigLatin’spreprocessorsupportsmacroexpansion.MacrosaredefinedusingtheDEFINEstatement:
DEFINEmacro_name(param1,...,paramN)RETURNSoutput_bag{
pig_latin_statements
};
Themacroisexpandedinline,anditsparametersarereferencedinthePigLatinblockwithin{}.
![Page 248: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/248.jpg)
ThemacrooutputrelationisgivenintheRETURNSstatements(output_bag).RETURNSvoidisusedforamacrowithnooutputrelation.
Wecandefineamacrotocountthenumberofrowsinarelation,asfollows:
DEFINEcount_rows(X)RETURNScnt{
grpd=group$Xall;
$cnt=foreachgrpdgenerateCOUNT($X);
};
WecanuseitinaPigscriptorGruntsessiontocountthenumberoftweets:
tweets_count=count_rows(tweets);
DUMPtweets_count;
Macrosallowustomakescriptsmodularbyhousingcodeinseparatefilesandimportingthemwhereneeded.Forexample,wecansavecount_rowsinafilecalledcount_rows.macroandlateronimportitwiththecommandimport'count_rows.macro'.
Macroshaveanumberoflimitations;inparticular,onlyPigLatinstatementsareallowedinsideamacro.ItisnotpossibletouseREGISTERstatementsandshellcommands,UDFsarenotallowed,andparametersubstitutioninsidethemacroisnotsupported.
![Page 249: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/249.jpg)
WorkingwithdataPigLatinprovidesanumberofrelationaloperatorstocombinefunctionsandapplytransformationsondata.Typicaloperationsinadatapipelineconsistoffilteringrelations(FILTER),aggregatinginputsbasedonkeys(GROUP),generatingtransformationsbasedoncolumnsofdata(FOREACH),andjoiningrelations(JOIN)basedonsharedkeys.
Inthefollowingsections,wewillillustratesuchoperatorsonadatasetoftweetsgeneratedbyloadingJSONdata.
FilteringTheFILTERoperatorselectstuplesfromarelationbasedonanexpression,asfollows:
relation=FILTERrelationBYexpression;
Wecanusethisoperatortofiltertweetswhosetextmatchesthehashtagregularexpression,asfollows:
tweets_with_tag=FILTERtweetsBY
(text
MATCHES'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)'
);
AggregationTheGROUPoperatorgroupstogetherdatainoneormorerelationsbasedonanexpressionorakey,asfollows:
relation=GROUPrelationBYexpression;
Wecangrouptweetsbythesourcefieldintoanewrelationgrpd,asfollows:
grpd=GROUPtweetsBYsource;
Itispossibletogrouponmultipledimensionsbyspecifyingatupleasthekey,asfollows:
grpd=GROUPtweetsBY(created_at,source);
TheresultofaGROUPoperationisarelationthatincludesonetupleperuniquevalueofthegroupexpression.Thistuplecontainstwofields.Thefirstfieldisnamedgroupandisofthesametypeasthegroupkey.Thesecondfieldtakesthenameoftheoriginalrelationandisofthetypebag.Thenamesofbothfieldsaregeneratedbythesystem.
UsingtheALLkeyword,Pigwillaggregateacrossthewholerelation.TheGROUPtweetsALLschemewillaggregatealltuplesinthesamegroup.
Aspreviouslymentioned,PigallowsexplicithandlingoftheconcurrencyleveloftheGROUPoperatorusingthePARALLELoperator:
grpd=GROUPtweetsBY(created_at,id)PARALLEL10;
Intheprecedingexample,theMapReducejobgeneratedbythecompilerwillrun10concurrentreducetasks.Pighasaheuristicestimateofhowmanyreducerstouse.
![Page 250: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/250.jpg)
Anotherwayofgloballyenforcingthenumberofreducetasksistousethesetdefault_parallel<n>command.
ForeachTheFOREACHoperatorappliesfunctionsoncolumns,asfollows:
relation=FOREACHrelationGENERATEtransformation;
TheoutputofFOREACHdependsonthetransformationapplied.
Wecanusetheoperatortoprojectthetextofalltweetsthatcontainahashtag,asfollows:
t=FOREACHtweets_with_tagGENERATEtext;
Wecanalsoapplyafunctiontotheprojectedcolumns.Forinstance,wecanusetheREGEX_TOKENIZEfunctiontospliteachtweetintowords,asfollows:
t=FOREACHtweets_with_tagGENERATEFLATTEN(TOKENIZE(text))asword;
TheFLATTENmodifierfurtherun-neststhebaggeneratedbyTOKENIZEintoatupleofwords.
JoinTheJOINoperatorperformsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.Itssyntaxisasfollows:
relation=JOINrelation1BYexpression1,relation2BYexpression2;
Wecanuseajoinoperationtodetecttweetsthatcontainpositivewords,asfollows:
positive=LOAD'positive-words.txt'USINGPigStorage()as(w:chararray);
Filteroutthecomments,asfollows:
positive_words=FILTERpositiveBYNOTwMATCHES'^;.*';
positive_wordsisabagoftuples,eachcontainingaword.Wethentokenizethetweets’textandcreateanewbagof(id_str,word)tuplesasfollows:
id_words=FOREACHtweets{
GENERATE
id_str,
FLATTEN(TOKENIZE(text))asword;
}
Wejointhetworelationsonthewordfieldandobtainarelationofalltweetsthatcontainoneormorepositivewords,asfollows:
positive_tweets=JOINpositive_wordsBYw,id_wordsBYword;
Inthisstatement,wejoinpositive_wordsandid_wordsontheconditionthatid_words.wordisapositiveword.Thepositive_tweetsoperatorisabagintheformof{w:chararray,id_str:chararray,word:chararray}thatcontainsallelementsofpositive_wordsandid_wordsthatmatchthejoincondition.
![Page 251: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/251.jpg)
WecancombinetheGROUPandFOREACHoperatortocalculatethenumberofpositivewordspertweet(withatleastonepositiveword).First,wegrouptherelationofpositivetweetsbythetweetID,andthenwecountthenumberofoccurrencesofeachIDintherelation,asfollows:
grpd=GROUPpositive_tweetsBYid_str;
score=FOREACHgrpdGENERATEFLATTEN(group),COUNT(positive_tweets);
TheJOINoperatorcanmakeuseoftheparallelizefeatureaswell,asfollows:
positive_tweets=JOINpositive_wordsBYw,id_wordsBYwordPARALLEL10
Theprecedingcommandwillexecutethejoinwith10reducertasks.
Itispossibletospecifytheoperator’sbehaviorwiththeUSINGkeywordfollowedbytheIDofaspecializedjoin.Moredetailscanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#specialized-joins.
![Page 252: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/252.jpg)
ExtendingPig(UDFs)FunctionscanbeapartofalmosteveryoperatorinPig.TherearetwomaindifferencesbetweenUDFsandbuilt-infunctions.First,UDFsneedtoberegisteredusingtheREGISTERkeywordinordertomakethemavailabletoPig.Secondly,theyneedtobequalifiedwhenused.PigUDFscancurrentlybeimplementedinJava,Python,Ruby,JavaScript,andGroovy.ThemostextensivesupportisprovidedforJavafunctions,whichallowyoutocustomizeallpartsoftheprocessincludingdataload/store,transformation,andaggregation.Additionally,JavafunctionsarealsomoreefficientbecausetheyareimplementedinthesamelanguageasPigandbecauseadditionalinterfacesaresupported,suchastheAlgebraicandAccumulatorinterfaces.Ontheotherhand,RubyandPythonAPIsallowmorerapidprototyping.
TheintegrationofUDFswiththePigenvironmentismainlymanagedbythefollowingtwostatementsREGISTERandDEFINE:
REGISTERregistersaJARfilesothattheUDFsinthefilecanbeused,asfollows:
REGISTER'piggybank.jar'
DEFINEcreatesanaliastoafunctionorastreamingcommand,asfollows:
DEFINEMyFunctionmy.package.uri.MyFunction
Theversion0.12ofPigintroducedthestreamingofUDFsasamechanismforwritingfunctionsusinglanguageswithnoJVMimplementation.
![Page 253: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/253.jpg)
ContributedUDFsPig’scodebasehostsaUDFrepositorycalledPiggybank.OtherpopularcontributedrepositoriesareTwitter’sElephantBird(foundathttps://github.com/kevinweil/elephant-bird/)andApacheDataFu(foundathttp://datafu.incubator.apache.org/).
PiggybankPiggybankisaplaceforPiguserstosharetheirfunctions.SharedcodeislocatedintheofficialPigSubversionrepositoryfoundathttp://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/TheAPIdocumentationcanbefoundathttp://pig.apache.org/docs/r0.12.0/api/underthecontribsection.PiggybankUDFscanbeobtainedbycheckingoutandcompilingthesourcesfromtheSubversionrepositoryorbyusingtheJARfilethatshipswithbinaryreleasesofPig.InClouderaCDH,piggybank.jarisavailableat/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar.
ElephantBirdElephantBirdisanopensourcelibraryofallthingsHadoopusedinproductionatTwitter.Thislibrarycontainsanumberofserializationtools,custominputandoutputformats,writables,Pigload/storefunctions,andmoremiscellanea.
ElephantBirdshipswithanextremelyflexibleJSONloaderfunction,whichatthetimeofwriting,isthego-toresourceformanipulatingJSONdatainPig.
ApacheDataFuApacheDataFuPigcollectsanumberofanalyticalfunctionsdevelopedandcontributedbyLinkedIn.Theseincludestatisticalandestimationfunctions,bagandsetoperations,sampling,hashing,andlinkanalysis.
![Page 254: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/254.jpg)
AnalyzingtheTwitterstreamInthefollowingexamples,wewillusetheimplementationofJsonLoaderprovidedbyElephantBirdtoloadandmanipulateJSONdata.WewillusePigtoexploretweetmetadataandanalyzetrendsinthedataset.Finally,wewillmodeltheinteractionbetweenusersasagraphanduseApacheDataFutoanalyzethissocialnetwork.
![Page 255: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/255.jpg)
PrerequisitesDownloadtheelephant-bird-pig(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/4.5/elephant-bird-pig-4.5.jar),elephant-bird-hadoop-compat(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-hadoop-compat/4.5/elephant-bird-hadoop-compat-4.5.jar),andelephant-bird-core(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.5/elephant-bird-core-4.5.jar)JARfilesfromtheMavencentralrepositoryandcopythemontoHDFSusingthefollowingcommand:
$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/
$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/
$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/
![Page 256: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/256.jpg)
DatasetexplorationBeforedivingdeeperintothedataset,weneedtoregisterthedependenciestoElephantBirdandDataFu,asfollows:
REGISTER/opt/cloudera/parcels/CDH/lib/pig/datafu-1.1.0-cdh5.0.0.jar
REGISTER/opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar
REGISTERhdfs:///jar/elephant-bird-pig-4.5.jar
REGISTERhdfs:///jar/elephant-bird-hadoop-compat-4.5.jar
REGISTERhdfs:///jar/elephant-bird-core-4.5.jar
Then,loadtheJSONdatasetoftweetsusingcom.twitter.elephantbird.pig.load.JsonLoader,asfollows:
tweets=LOAD'tweets.json'using
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
com.twitter.elephantbird.pig.load.JsonLoaderdecodeseachlineoftheinputfiletoJSONandpassestheresultingmapofvaluestoPigasasingle-elementtuple.ThisenablesaccesstoelementsoftheJSONobjectwithouthavingtospecifyaschemaupfront.The–nestedLoadargumentinstructstheclasstoloadnesteddatastructures.
![Page 257: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/257.jpg)
TweetmetadataIntheremainderofthechapter,wewillusemetadatafromtheJSONdatasettomodelthetweetstream.OneexampleofmetadataattachedtoatweetisthePlaceobject,whichcontainsgeographicalinformationabouttheuser’slocation.Placecontainsfieldsthatdescribeitsname,ID,country,countrycode,andmore.Afulldescriptioncanbefoundathttps://dev.twitter.com/docs/platform-objects/places.
place=FOREACHtweetsGENERATE(chararray)$0#'place'asplace;
Entitiesgiveinformationsuchasstructureddatafromtweets,URLs,hashtags,andmentions,withouthavingtoextractthemfromtext.Adescriptionofentitiescanbefoundathttps://dev.twitter.com/docs/entities.Thehashtagentityisanarrayoftagsextractedfromatweet.Eachentityhasthefollowingtwoattributes:
Text:isthehashtagtextIndices:isthecharacterpositionfromwhichthehashtagwasextracted
Thefollowingcodeusesentities:
hashtags_bag=FOREACHtweets{
GENERATE
FLATTEN($0#'entities'#'hashtags')astag;
}
Wethenflattenhashtags_bagtoextracteachhashtag’stext:
hashtags=FOREACHhashtags_bagGENERATEtag#'text'astopic;
Entitiesforuserobjectscontaininformationthatappearsintheuserprofileanddescriptionfields.Wecanextractthetweetauthor’sIDviatheuserfieldinthetweetmap:
users=FOREACHtweetsGENERATE$0#'user'#'id'asid;
![Page 258: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/258.jpg)
DatapreparationTheSAMPLEbuilt-inoperatorselectsasetofntupleswithprobabilitypoutofthedataset,asfollows:
sampled=SAMPLEtweets0.01;
Theprecedingcommandwillselectapproximately1percentofthedataset.GiventhatSAMPLEisprobabilistic(http://en.wikipedia.org/wiki/Bernoulli_sampling),thereisnoguaranteethatthesamplesizewillbeexact.Moreoverthefunctionsampleswithreplacement,whichmeansthateachitemmightappearmorethanonce.
ApacheDataFuimplementsanumberofsamplingmethodsforcaseswherehavinganexactsamplesizeandnoreplacementisdesired(SimpleRandomSampling),samplingwithreplacement(SimpleRandomSampleWithReplacementVoteandSimpleRandomSampleWithReplacementElect),whenwewanttoaccountforsamplebias(WeightedRandomSampling),ortosampleacrossmultiplerelations(SampleByKey).
Wecancreateasampleofexactly1percentofthedataset,witheachitemhavingthesameprobabilityofbeingselected,usingSimpleRandomSample.
NoteTheactualguaranteeisasampleofsizeceil(p*n)withaprobabilityofatleast99percent.
First,wepassasamplingprobability0.01totheUDFconstructor:
DEFINESRSdatafu.pig.sampling.SimpleRandomSample('0.01');
andthebag,createdwith(GROUPtweetsALL),tobesampled:
sampled=FOREACH(GROUPtweetsALL)GENERATEFLATTEN(SRS(tweets));
TheSimpleRandomSampleUDFselectswithoutreplacement,whichmeansthateachitemwillappearonlyonce.
NoteWhichsamplingmethodtousedependsbothonthedataweareworkingwith,assumptionsonhowitemsaredistributed,thesizeofthedataset,andwhatwepracticallywanttoachieve.Ingeneral,whenwewanttoexploreadatasettoformulatehypotheses,SimpleRandomSamplecanbeagoodchoice.However,inseveralanalyticsapplications,itiscommontousemethodsthatassumereplacement(forexample,bootstrapping).
Notethatwhenworkingwithverylargedatasets,samplingwithreplacementandsamplingwithoutreplacementtendtobehavesimilarly.Theprobabilityofanitembeingselectedtwiceoutofapopulationofbillionsofitemswillbelow.
![Page 259: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/259.jpg)
TopnstatisticsOneofthefirstquestionswemightwanttoaskishowfrequentcertainthingsare.Forinstance,wemightwanttocreateahistogramofthetop10topicsbythenumberofmentions.Similarly,wemightwanttofindthetop50countriesorthetop10users.Beforelookingattweetsdata,wewilldefineamacrosothatwecanapplythesameselectionlogictodifferentcollectionsofitems:
DEFINEtop_n(rel,col,n)
RETURNStop_n_items{
grpd=GROUP$relBY$col;
cnt_items=FOREACHgrpd
GENERATEFLATTEN(group),COUNT($rel)AScnt;
cnt_items_sorted=ORDERcnt_itemsBYcntDESC;
$top_n_items=LIMITcnt_items_sorted$n;
}
Thetop_nmethodtakesarelationrel,thecolumncolwewanttocount,andthenumberofitemstoreturnnasparameters.InthePigLatinblock,wefirstgrouprelbyitemsincol,countthenumberofoccurrencesofeachitem,sortthem,andselectthemostfrequentn.
Tofindthetop10Englishhashtags,wefilterthembylanguage,andextracttheirtext:
tweets_en=FILTERtweetsby$0#'lang'=='en';
hashtags_bag=FOREACHtweets{
GENERATE
FLATTEN($0#'entities'#'hashtags')AStag;
}
hashtags=FOREACHhashtags_bagGENERATEtag#'text'AStag;
Andapplythetop_nmacro:
top_10_hashtags=top_n(hashtags,tag,10);
Inordertobettercharacterizewhatistrendingandmakethisinformationmorerelevanttousers,wecandrilldownintothedatasetandlookathashtagspergeographiclocation.
First,wegeneratebagof(place,hashtag)tuples,asfollows:
hashtags_country_bag=FOREACHtweetsgenerate{
0#'place'asplace,
FLATTEN($0#'entities'#'hashtags')astag;
}
Andthen,weextractthecountrycodeandhashtagtext,asfollows:
hashtags_country=FOREACHhashtags_country_bag{
GENERATE
place#'country_code'asco,
tag#'text'astag;
}
Then,wecounthowmanytimeseachcountrycodeandhashtagappeartogether,as
![Page 260: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/260.jpg)
follows:
hashtags_country_frequency=FOREACH(GROUPhashtags_countryALL){
GENERATE
FLATTEN(group),
COUNT(hashtags_country)ascount;
}
Finally,wecountthetop10countriesperhashtagwiththeTOPfunction,asfollows:
hashtags_country_regrouped=GROUPhashtags_country_frequencyBYcnt;
top_results=FOREACHhashtags_country_regrouped{
result=TOP(10,1,hashtags_country_frequency);
GENERATEFLATTEN(result);
}
TOP‘sparametersarethenumberoftuplestoreturn,thecolumntocompare,andtherelationcontainingsaidcolumn:
top_results=FOREACHD{
result=TOP(10,1,C);
GENERATEFLATTEN(result);
}
Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/topn.pig.
![Page 261: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/261.jpg)
DatetimemanipulationThecreated_atfieldintheJSONtweetsgivesustime-stampedinformationaboutwhenthetweetwasposted.Unfortunately,itsformatisnotcompatiblewithPig’sbuilt-indatetimetype.
PiggybankcomestotherescuewithanumberoftimemanipulationUDFscontainedinorg.apache.pig.piggybank.evaluation.datetime.convert.OneofthemisCustomFormatToISO,whichconvertsanarbitrarilyformattedtimestampintoanISO8601datetimestring.
InordertoaccesstheseUDFs,wefirstneedtoregisterthepiggybank.jarfile,asfollows:
REGISTER/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
Tomakeourcodelessverbose,wecreateanaliasfortheCustomFormatToISOclass’sfullyqualifiedJavaname:
DEFINECustomFormatToISO
org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
Byknowinghowtomanipulatetimestamps,wecancalculatestatisticsatdifferenttimeintervals.Forinstance,wecanlookathowmanytweetsarecreatedperhour.Pighasabuilt-inGetHourfunctionthatextractsthehouroutofadatetimetype.Tousethis,wefirstconvertthetimestampstringtoISO8601withCustomFormatToISOandthentheresultingchararraytodatetimeusingthebuilt-inToDatefunction,asfollows:
hourly_tweets=FOREACHtweets{
GENERATE
GetHour(
ToDate(
CustomFormatToISO(
$0#'created_at','EEEMMMMdHH:mm:ssZy')
)
)ashour;
}
Now,itisjustamatterofgroupinghourly_tweetsbyhourandthengeneratingacountoftweetspergroup,asfollows:
hourly_tweets_count=FOREACH(GROUPhourly_tweetsBYhour){
GENERATEFLATTEN(group),COUNT(hourly_tweets);
}
SessionsDataFu’sSessionizeclasscanhelpustobettercaptureuseractivityovertime.Asessionrepresentstheactivityofauserwithinagivenperiodoftime.Forinstance,wecanlookateachuser’stweetstreamatintervalsof15minutesandmeasurethesesessionstodeterminebothnetworkvolumesaswellasuseractivity:
DEFINESessionizedatafu.pig.sessions.Sessionize('15m');
users_activity=FOREACHtweets{
![Page 262: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/262.jpg)
GENERATE
CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')ASdt,
(chararray)$0#'user'#'id'asuser_id;
}
users_activity_sessionized=FOREACH
(GROUPusers_activityBYuser_id){
ordered=ORDERusers_activityBYdt;
GENERATEFLATTEN(Sessionize(ordered))
AS(dt,user_id,session_id);
}
user_activitysimplyrecordsthetimedtagivenuser_idpostedastatusupdate.
Sessionizetakesthesessiontimeoutandabagasinput.ThefirstelementoftheinputbagisanISO8601timestamp,andthebagmustbesortedbythistimestamp.Eventsthatarewithin15minutesfromeachotherwillbelongtothesamesession.
Itreturnstheinputbagwithanewfield,session_id,thatuniquelyidentifiesasession.Withthisdata,wecancalculatethesession’slengthandsomeotherstatistics.MoreexamplesofSessionizeusagecanbefoundathttp://datafu.incubator.apache.org/docs/datafu/guide/sessions.html.
![Page 263: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/263.jpg)
CapturinguserinteractionsIntheremainderofthechapter,wewilllookathowtocapturepatternsfromuserinteractions.Asafirststepinthisdirection,wewillcreateadatasetsuitabletomodelasocialnetwork.Thisdatasetwillcontainatimestamp,theIDofthetweet,theuserwhopostedthetweet,theuserandtweetshe’sreplyingto,andthehashtaginthetweet.
Twitterconsidersasareply(in_reply_to_status_id_str)anymessagebeginningwiththe@character.Suchtweetsareinterpretedasadirectmessagetothatperson.Placingan@characteranywhereelseinthetweetisinterpretedasamention('entities'#'user_mentions‘)andnotareply.Thedifferenceisthatmentionsareimmediatelybroadcasttoaperson’sfollowers,whereasrepliesarenot.Repliesare,however,consideredasmentions.
Whenworkingwithpersonallyidentifiableinformation,itisagoodideatoanonymizeifnotremoveentirelysensitivedatasuchasIPaddresses,names,anduserIDs.Acommonlyusedtechniqueinvolvesahashfunctionthattakesasinputthedatawewanttoanonymize,concatenatedwithadditionalrandomdatacalledsalt.Thefollowingcodeshowsanexampleofsuchanonymization:
DEFINESHAdatafu.pig.hash.SHA();
from_to_bag=FOREACHtweets{
dt=$0#'created_at';
user_id=(chararray)$0#'user'#'id';
tweet_id=(chararray)$0#'id_str';
reply_to_tweet=(chararray)$0#'in_reply_to_status_id_str';
reply_to=(chararray)$0#'in_reply_to_user_id_str';
place=$0#'place';
topics=$0#'entities'#'hashtags';
GENERATE
CustomFormatToISO(dt,'EEEMMMMdHH:mm:ssZy')ASdt,
SHA((chararray)CONCAT('SALT',user_id))ASsource,
SHA(((chararray)CONCAT('SALT',tweet_id)))AStweet_id,
((reply_to_tweetISNULL)
?NULL
:SHA((chararray)CONCAT('SALT',reply_to_tweet)))
ASreply_to_tweet_id,
((reply_toISNULL)
?NULL
:SHA((chararray)CONCAT('SALT',reply_to)))
ASdestination,
(chararray)place#'country_code'ascountry,
FLATTEN(topics)AStopic;
}
—extractthehashtagtext
from_to=FOREACHfrom_to_bag{
GENERATE
dt,
tweet_id,
reply_to_tweet_id,
source,
![Page 264: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/264.jpg)
destination,
country,
(chararray)topic#'text'AStopic;
}
Inthisexample,weuseCONCATtoappenda(notsorandom)saltstringtopersonaldata.WethengenerateahashofthesaltedIDswithDataFu’sSHAfunction.TheSHAfunctionrequiresitsinputparameterstobenonnull.Weenforcethisconditionusingif-then-elsestatements.InPigLatin,thisisexpressedas<conditionistrue>?<truebranch>:<falsebranch>.Ifthestringisnull,wereturnNULL,andifnot,wereturnthesaltedhash.Tomakecodemorereadable,weusealiasesforthetweetJSONfieldsandreferencethemintheGENERATEblock.
![Page 265: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/265.jpg)
LinkanalysisWecanredefineourapproachtodeterminetrendingtopicstoincludeusers’reactions.Afirst,naïve,approachcouldbetoconsideratopicasimportantifitcausedanumberofreplieslargerthanathresholdvalue.
Aproblemwiththisapproachisthattweetsgeneraterelativelyfewreplies,sothevolumeoftheresultingdatasetwillbelow.Hence,itrequiresaverylargeamountofdatatocontaintweetsbeingrepliedtoandproduceanyresult.Inpractice,wewouldlikelywanttocombinethismetricwithotherones(forexample,mentions)inordertoperformmoremeaningfulanalyses.
Tosatisfythisquery,wewillcreateanewdatasetthatincludesthehashtagsextractedfromboththetweetandtheoneauserisreplyingto:
tweet_hashtag=FOREACHfrom_toGENERATEtweet_id,topic;
from_to_self_joined=JOINfrom_toBYreply_to_tweet_idLEFT,
tweet_hashtagBYtweet_id;
twitter_graph=FOREACHfrom_to_self_joined{
GENERATE
from_to::dtASdt,
from_to::tweet_idAStweet_id,
from_to::reply_to_tweet_idASreply_to_tweet_id,
from_to::sourceASsource,
from_to::destinationASdestination,
from_to::topicAStopic,
from_to::countryAScountry,
tweet_hashtag::topicAStopic_replied;
}
NotethatPigdoesnotallowacrossjoinonthesamerelation,hencewehavetocreatetweet_hashtagfortheright-handsideofthejoin.Here,weusethe::operatortodisambiguatefromwhichrelationandcolumnwewanttoselectrecords.
Onceagain,wecanlookforthetop10topicsbynumberofrepliesusingthetop_nmacro:
top_10_topics=top_n(twitter_graph,topic_replied,10);
Countingthingswillonlytakeussofar.WecancomputemoredescriptivestatisticsonthisdatasetwithDataFu.UsingtheQuantilefunction,wecancalculatethemedian,the90th,95th,andthe99thpercentilesofthenumberofhashtagreactions,asfollows:
DEFINEQuantiledatafu.pig.stats.Quantile('0.5','0.90','0.95','0.99');
SincetheUDFexpectsanorderedbagofintegervaluesasinput,wefirstcountthefrequencyofeachtopic_repliedentry,asfollows.
topics_with_replies_grpd=GROUPtwitter_graphBYtopic_replied;
topics_with_replies_cnt=FOREACHtopics_with_replies_grpd{
GENERATE
COUNT(twitter_graph)ascnt;
}
![Page 266: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/266.jpg)
Then,weapplyQuantileonthebagoffrequencies,asfollows:
quantiles=FOREACH(GROUPtopics_with_replies_cntALL){
sorted=ORDERtopics_with_replies_cntBYcnt;
GENERATEQuantile(sorted);
}
Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/graph.pig.
![Page 267: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/267.jpg)
InfluentialusersWewillusePageRank,analgorithmdevelopedbyGoogletorankwebpages(http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf),toidentifyinfluentialusersintheTwittergraphwegeneratedintheprevioussection.
Thistypeofanalysishasanumberofusecases,suchastargetedandcontextualadvertisement,recommendationsystems,spamdetection,andobviouslymeasuringtheimportanceofwebpages.Asimilarapproach,usedbyTwittertoimplementtheWhotoFollowfeature,isdescribedintheresearchpaperWTF:TheWhotoFollowserviceatTwitterfoundathttp://stanford.edu/~rezab/papers/wtf_overview.pdf.
Informally,PageRankdeterminestheimportanceofapagebasedontheimportanceofotherpageslinkingtoitandassignsitascorebetween0and1.AhighPageRankscoreindicatesthatalotofpagespointtoit.Intuitively,beinglinkedbypageswithahighPageRankisaqualityendorsement.IntermsoftheTwittergraph,weassumethatusersreceivingalotofrepliesareimportantorinfluentialwithinthesocialnetwork.InTwitter’scase,weconsideranextendeddefinitionofPageRank,wherethelinkbetweentwousersisgivenbyadirectreplyandlabeledbyanyeventualhashtagpresentinthemessage.Heuristically,wewanttoidentifyinfluentialusersonagiventopic.
InDataFu’simplementation,eachgraphisrepresentedasabagof(source,edges)tuples.ThesourcetupleisanintegerIDrepresentingthesourcenode.Theedgesareabagof(destination,weight)tuples.destinationisanintegerIDrepresentingthedestinationnode.weightisadoublerepresentinghowmuchtheedgeshouldbeweighted.TheoutputoftheUDFisabagof(source,rank)pairs,whererankisthePageRankvalueforthesourceuserinthegraph.Noticethatwetalkedaboutnodes,edges,andgraphsasabstractconcepts.InGoogle’scase,nodesarewebpages,edgesarelinksfromonepagetotheother,andgraphsaregroupsofpagesconnecteddirectlyandindirectly.
Inourcase,nodesrepresentusers,edgesrepresentin_reply_to_user_id_strmentions,andedgesarelabeledbyhashtagsintweets.TheoutputofPageRankshouldsuggestwhichusersareinfluentialonagiventopicgiventheirinteractionpatterns.
Inthissection,wewillwriteapipelineto:
RepresentdataasagraphwhereeachnodeisauserandahashtaglabelstheedgeMapIDsandhashtagstointegerssothattheycanbeconsumedbyPageRankApplyPageRankStoretheresultsintoHDFSinaninteroperableformat(Avro)
Werepresentthegraphasabagoftuplesintheform(source,destination,topic),whereeachtuplerepresentstheinteractionbetweennodes.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/pagerank.pig.
Wewillmapusers’andhashtags’texttonumericalIDs.WeusetheJavaStringhashCode()methodtoperformthisconversionstepandwrapthelogicinanEvalUDF.
![Page 268: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/268.jpg)
NoteThesizeofanintegeriseffectivelytheupperboundforthenumberofnodesandedgesinthegraph.Forproductioncode,itisrecommendedthatyouuseamorerobusthashfunction.
TheStringToIntclasstakesastringasinput,callsthehashCode()method,andreturnsthemethodoutputtoPig.TheUDFcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/udf/com/learninghadoop2/pig/udf/StringToInt.java.
packagecom.learninghadoop2.pig.udf;
importjava.io.IOException;
importorg.apache.pig.EvalFunc;
importorg.apache.pig.data.Tuple;
publicclassStringToIntextendsEvalFunc<Integer>{
publicIntegerexec(Tupleinput)throwsIOException{
if(input==null||input.size()==0)
returnnull;
try{
Stringstr=(String)input.get(0);
returnstr.hashCode();
}catch(Exceptione){
throw
newIOException("CannotconvertStringtoInt",e);
}
}
}
Weextendorg.apache.pig.EvalFuncandoverridetheexecmethodtoreturnstr.hashCode()onthefunctioninput.TheEvalFunc<Integer>classisparameterizedwiththereturntypeoftheUDF(Integer).
Next,wecompiletheclassandarchiveitintoaJAR,asfollows:
$javac-classpath/opt/cloudera/parcels/CDH/lib/pig/pig.jar:$(hadoop
classpath)com/learninghadoop2/pig/udf/StringToInt.java
$jarcvfmyudfs-pig.jarcom/learninghadoop2/pig/udf/StringToInt.class
WecannowregistertheUDFinPigandcreateanaliastoStringToInt,asfollows:
REGISTERmyudfs-pig.jar
DEFINEStringToIntcom.learninghadoop2.pig.udf.StringToInt();
Wefilterouttweetswithnodestinationandnotopic,asfollows:
tweets_graph_filtered=FILTERtwitter_graphby
(destinationISNOTNULL)AND
(topicISNOTnull);
Then,weconvertthesource,destination,andtopictointegerIDs:
from_to=foreachtweets_graph_filtered{
GENERATE
StringToInt(source)assource_id,
![Page 269: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/269.jpg)
StringToInt(destination)asdestination_id,
StringToInt(topic)astopic_id;
}
Oncedataisintheappropriateformat,wecanreusetheimplementationofPageRankandtheexamplecode(foundathttps://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/linkanalysis/PageRank.java)providedbyDataFu,asshowninthefollowingcode:
DEFINEPageRankdatafu.pig.linkanalysis.PageRank('dangling_nodes','true');
Webeginbycreatingabagof(source_id,destination_id,topic_id)tuples,asfollows:
reply_to=groupfrom_toby(source_id,destination_id,topic_id);
Wecounttheoccurrencesofeachtuple,thatis,howmanytimestwopeopletalkedaboutatopic,asfollows:
topic_edges=foreachreply_to{
GENERATEflatten(group),((double)COUNT(from_to.topic_id))asw;
}
Rememberthattopicistheedgeofourgraph;webeginbycreatinganassociationbetweenthesourcenodeandthetopicedge,asfollows:
topic_edges_grouped=GROUPtopic_edgesby(topic_id,source_id);
Thenweregroupitwiththepurposeofaddingadestinationnodeandtheedgeweight,asfollows:
topic_edges_grouped=FOREACHtopic_edges_grouped{
GENERATE
group.topic_idastopic,
group.source_idassource,
topic_edges.(destination_id,w)asedges;
}
OncewecreatetheTwittergraph,wecalculatethePageRankofallusers(source_id):
topic_rank=FOREACH(GROUPtopic_edges_groupedBYtopic){
GENERATE
groupastopic,
FLATTEN(PageRank(topic_edges_grouped.(source,edges)))as(source,rank);
}
topic_rank=FOREACHtopic_rankGENERATEtopic,source,rank;
WestoretheresultinHDFSinAvroformat.IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReducejarfiletoourenvironmentbeforeaccessingindividualfields.WithinPig,forexample,ontheClouderaCDH5VM:
REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro.jar
REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar
STOREtopic_rankINTO'replies-pagerank'usingAvroStorage();
Note
![Page 270: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/270.jpg)
Intheselasttwosections,wemadeanumberofimplicitassumptionsonwhataTwittergraphmightlooklikeandwhattheconceptsoftopicanduserinteractionmean.Giventheconstraintsthatweposed,theresultingsocialnetworkweanalyzedwillberelativelysmallandnotnecessarilyrepresentativeoftheentireTwittersocialnetwork.Extrapolatingresultsfromthisdatasetisdiscouraged.Inpractice,therearemanyotherfactorsthatshouldbetakenintoaccounttogeneratearobustmodelofsocialinteraction.
![Page 271: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/271.jpg)
SummaryInthischapter,weintroducedApachePig,aplatformforlarge-scaledataanalysisonHadoop.Inparticular,wecoveredthefollowingtopics:
ThegoalsofPigasawayofprovidingadataflow-likeabstractionthatdoesnotrequirehands-onMapReducedevelopmentHowPig’sapproachtoprocessingdatacomparestoSQL,wherePigisproceduralwhileSQLisdeclarativeGettingstartedwithPig—aneasytask,asitisalibrarythatgeneratescustomcodeanddoesn’trequireadditionalservicesAnoverviewofthedatatypes,corefunctions,andextensionmechanismsprovidedbyPigExamplesofapplyingPigtoanalyzetheTwitterdatasetindetail,whichdemonstrateditsabilitytoexpresscomplexconceptsinaveryconcisefashionHowlibrariessuchasPiggybank,ElephantBird,andDataFuproviderepositoriesfornumeroususefulprewrittenPigfunctionsInthenextchapter,wewillrevisittheSQLcomparisonbyexploringtoolsthatexposeaSQL-likeabstractionoverdatastoredinHDFS
![Page 272: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/272.jpg)
Chapter7.HadoopandSQLMapReduceisapowerfulparadigmthatenablescomplexdataprocessingthatcanrevealvaluableinsights.Asdiscussedinearlierchaptershowever,itdoesrequireadifferentmindsetandsometrainingandexperienceonthemodelofbreakingprocessinganalyticsintoaseriesofmapandreducesteps.ThereareseveralproductsthatarebuiltatopHadooptoprovidehigher-levelormorefamiliarviewsofthedataheldwithinHDFS,andPigisaverypopularone.ThischapterwillexploretheothermostcommonabstractionimplementedatopHadoop:SQL.
Inthischapter,wewillcoverthefollowingtopics:
WhattheusecasesforSQLonHadoopareandwhyitissopopularHiveQL,theSQLdialectintroducedbyApacheHiveUsingHiveQLtoperformSQL-likeanalysisoftheTwitterdatasetHowHiveQLcanapproximatecommonfeaturesofrelationaldatabasessuchasjoinsandviewsHowHiveQLallowstheincorporationofuser-definedfunctionsintoitsqueriesHowSQLonHadoopcomplementsPigOtherSQL-on-HadoopproductssuchasImpalaandhowtheydifferfromHive
![Page 273: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/273.jpg)
WhySQLonHadoopSofarwehaveseenhowtowriteHadoopprogramsusingtheMapReduceAPIsandhowPigLatinprovidesascriptingabstractionandawrapperforcustombusinesslogicbymeansofUDFs.Pigisaverypowerfultool,butitsdataflow-basedprogrammingmodelisnotfamiliartomostdevelopersorbusinessanalysts.ThetraditionaltoolofchoiceforsuchpeopletoexploredataisSQL.
Backin2008FacebookreleasedHive,thefirstwidelyusedimplementationofSQLonHadoop.
Insteadofprovidingawayofmorequicklydevelopingmapandreducetasks,HiveoffersanimplementationofHiveQL,aquerylanguagebasedonSQL.HivetakesHiveQLstatementsandimmediatelyandautomaticallytranslatesthequeriesintooneormoreMapReducejobs.ItthenexecutestheoverallMapReduceprogramandreturnstheresultstotheuser.
ThisinterfacetoHadoopnotonlyreducesthetimerequiredtoproduceresultsfromdataanalysis,italsosignificantlywidensthenetastowhocanuseHadoop.Insteadofrequiringsoftwaredevelopmentskills,anyonewho’sfamiliarwithSQLcanuseHive.
ThecombinationoftheseattributesisthatHiveQLisoftenusedasatoolforbusinessanddataanalyststoperformadhocqueriesonthedatastoredonHDFS.WithHive,thedataanalystcanworkonrefiningquerieswithouttheinvolvementofasoftwaredeveloper.JustaswithPig,HivealsoallowsHiveQLtobeextendedbymeansofUserDefinedFunctions,enablingthebaseSQLdialecttobecustomizedwithbusiness-specificfunctionality.
![Page 274: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/274.jpg)
OtherSQL-on-HadoopsolutionsThoughHivewasthefirstproducttointroduceandsupportHiveQL,itisnolongertheonlyone.Laterinthischapter,wewillalsodiscussImpala,releasedin2013andalreadyaverypopulartool,particularlyforlow-latencyqueries.Thereareothers,butwewillmostlydiscussHiveandImpalaastheyhavebeenthemostsuccessful.
WhileintroducingthecorefeaturesandcapabilitiesofSQLonHadoophowever,wewillgiveexamplesusingHive;eventhoughHiveandImpalasharemanySQLfeatures,theyalsohavenumerousdifferences.Wedon’twanttoconstantlyhavetocaveateachnewfeaturewithexactlyhowitissupportedinHivecomparedtoImpala.We’llgenerallybelookingataspectsofthefeaturesetthatarecommontoboth,butifyouusebothproducts,it’simportanttoreadthelatestreleasenotestounderstandthedifferences.
![Page 275: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/275.jpg)
PrerequisitesBeforedivingintospecifictechnologies,let’sgeneratesomedatathatwe’lluseintheexamplesthroughoutthischapter.We’llcreateamodifiedversionofaformerPigscriptasthemainfunctionalityforthis.ThescriptinthischapterassumesthattheElephantBirdJARsusedpreviouslyareavailableinthe/jardirectoryonHDFS.Thefullsourcecodeisathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig,butthecoreofextract_for_hive.pigisasfollows:
--loadJSONdata
tweets=load'$inputDir'using
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');—Tweets
tweets_tsv=foreachtweets{
generate
(chararray)CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)$0#'id_str',
(chararray)$0#'text'astext,
(chararray)$0#'in_reply_to',
(boolean)$0#'retweeted'asis_retweeted,
(chararray)$0#'user'#'id_str'asuser_id,(chararray)$0#'place'#'id'as
place_id;
}
storetweets_tsvinto'$outputDir/tweets'
usingPigStorage('\u0001');—Places
needed_fields=foreachtweets{
generate
(chararray)CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)$0#'id_str'asid_str,
$0#'place'asplace;
}
place_fields=foreachneeded_fields{
generate
(chararray)place#'id'asplace_id,
(chararray)place#'country_code'asco,
(chararray)place#'country'ascountry,
(chararray)place#'name'asplace_name,
(chararray)place#'full_name'asplace_full_name,
(chararray)place#'place_type'asplace_type;
}
filtered_places=filterplace_fieldsbyco!='';
unique_places=distinctfiltered_places;
storeunique_placesinto'$outputDir/places'
usingPigStorage('\u0001');
—Users
users=foreachtweets{
generate
(chararray)CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)$0#'id_str'asid_str,
$0#'user'asuser;
![Page 276: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/276.jpg)
}
user_fields=foreachusers{
generate
(chararray)CustomFormatToISO(user#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)user#'id_str'asuser_id,
(chararray)user#'location'asuser_location,
(chararray)user#'name'asuser_name,
(chararray)user#'description'asuser_description,
(int)user#'followers_count'asfollowers_count,
(int)user#'friends_count'asfriends_count,
(int)user#'favourites_count'asfavourites_count,
(chararray)user#'screen_name'asscreen_name,
(int)user#'listed_count'aslisted_count;
}
unique_users=distinctuser_fields;
storeunique_usersinto'$outputDir/users'
usingPigStorage('\u0001');
Runthisscriptasfollows:
$pig–fextract_for_hive.pig–paraminputDir=<jsoninput>-param
outputDir=<outputpath>
TheprecedingcodewritesdataintothreeseparateTSVfilesforthetweet,user,andplaceinformation.Noticethatinthestorecommand,wepassanargumentwhencallingPigStorage.ThissingleargumentchangesthedefaultfieldseparatorfromatabcharactertounicodevalueU0001,oryoucanalsouseCtrl+C+A.ThisisoftenusedasaseparatorinHivetablesandwillbeparticularlyusefultousasourtweetdatacouldcontaintabsinotherfields.
![Page 277: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/277.jpg)
OverviewofHiveWewillnowshowhowyoucanimportdataintoHiveandrunaqueryagainstthetableabstractionHiveprovidesoverthedata.Inthisexample,andintheremainderofthechapter,wewillassumethatqueriesaretypedintotheshellthatcanbeinvokedbyexecutingthehivecommand.
RecentlyaclientcalledBeelinealsobecameavailableandwilllikelybethepreferredCLIclientinthenearfuture.
WhenimportinganynewdataintoHive,thereisgenerallyathree-stageprocess:
CreatethespecificationofthetableintowhichthedataistobeimportedImportthedataintothecreatedtableExecuteHiveQLqueriesagainstthetable
MostoftheHiveQLstatementsaredirectanaloguestosimilarlynamedstatementsinstandardSQL.WeassumeonlyapassingknowledgeofSQLthroughoutthischapter,butifyouneedarefresher,therearenumerousgoodonlinelearningresources.
Hivegivesastructuredqueryviewofourdata,andtoenablethat,wemustfirstdefinethespecificationofthetable’scolumnsandimportthedataintothetablebeforewecanexecuteanyqueries.AtablespecificationisgeneratedusingaCREATEstatementthatspecifiesthetablename,thenameandtypesofitscolumns,andsomemetadataabouthowthetableisstored:
CREATEtabletweets(
created_atstring,
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
Thestatementcreatesanewtabletweetsdefinedbyalistofnamesforcolumnsinthedatasetandtheirdatatype.WespecifythatfieldsaredelimitedbytheUnicodeU0001characterandthattheformatusedtostoredataisTEXTFILE.
DatacanbeimportedfromalocationinHDFStweets/usingtheLOADDATAstatement:
LOADDATAINPATH'tweets'OVERWRITEINTOTABLEtweets;
Bydefault,dataforHivetablesisstoredonHDFSunder/user/hive/warehouse.IfaLOADstatementisgivenapathtodataonHDFS,itwillnotsimplycopythedatainto/user/hive/warehouse,butwillmoveitthereinstead.IfyouwanttoanalyzedataonHDFSthatisusedbyotherapplications,theneithercreateacopyorusetheEXTERNALmechanismthatwillbedescribedlater.
![Page 278: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/278.jpg)
OncedatahasbeenimportedintoHive,wecanrunqueriesagainstit.Forinstance:
SELECTCOUNT(*)FROMtweets;
Theprecedingcodewillreturnthetotalnumberoftweetspresentinthedataset.HiveQL,likeSQL,isnotcasesensitiveintermsofkeywords,columns,ortablenames.Byconvention,SQLstatementsuseuppercaseforSQLlanguagekeywords,andwewillgenerallyfollowthiswhenusingHiveQLwithinfiles,aswillbeshownlater.However,whentypinginteractivecommands,wewillfrequentlytakethelineofleastresistanceanduselowercase.
Ifyoulookcloselyatthetimetakenbythevariouscommandsintheprecedingexample,you’llnoticethatloadingdataintoatabletakesaboutaslongascreatingthetablespecification,buteventhesimplecountofallrowstakessignificantlylonger.TheoutputalsoshowsthattablecreationandtheloadingofdatadonotactuallycauseMapReducejobstobeexecuted,whichexplainstheveryshortexecutiontimes.
![Page 279: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/279.jpg)
ThenatureofHivetablesAlthoughHivecopiesthedatafileintoitsworkingdirectory,itdoesnotactuallyprocesstheinputdataintorowsatthatpoint.
BoththeCREATETABLEandLOADDATAstatementsdonottrulycreateconcretetabledataassuch;instead,theyproducethemetadatathatwillbeusedwhenHivegeneratesMapReducejobstoaccessthedataconceptuallystoredinthetablebutactuallyresidingonHDFS.EventhoughtheHiveQLstatementsrefertoaspecifictablestructure,itisHive’sresponsibilitytogeneratecodethatcorrectlymapsthistotheactualon-diskformatinwhichthedatafilesarestored.
ThismightseemtosuggestthatHiveisn’tarealdatabase;thisistrue,itisn’t.Whereasarelationaldatabasewillrequireatableschematobedefinedbeforedataisingestedandtheningestonlydatathatconformstothatspecification,Hiveismuchmoreflexible.ThelessconcretenatureofHivetablesmeansthatschemascanbedefinedbasedonthedataasithasalreadyarrivedandnotonsomeassumptionofhowthedatashouldbe,whichmightprovetobewrong.Thoughchangeabledataformatsaretroublesomeregardlessoftechnology,theHivemodelprovidesanadditionaldegreeoffreedominhandlingtheproblemwhen,notif,itarises.
![Page 280: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/280.jpg)
HivearchitectureUntilversion2,Hadoopwasprimarilyabatchsystem.Aswesawinpreviouschapters,MapReducejobstendtohavehighlatencyandoverheadderivedfromsubmissionandscheduling.Internally,HivecompilesHiveQLstatementsintoMapReducejobs.Hivequerieshavetraditionallybeencharacterizedbyhighlatency.ThishaschangedwiththeStingerinitiativeandtheimprovementsintroducedinHive0.13thatwewilldiscusslater.
HiverunsasaclientapplicationthatprocessesHiveQLqueries,convertsthemintoMapReducejobs,andsubmitsthesetoaHadoopclustereithertonativeMapReduceinHadoop1ortotheMapReduceApplicationMasterrunningonYARNinHadoop2.
Regardlessofthemodel,Hiveusesacomponentcalledthemetastore,inwhichitholdsallitsmetadataaboutthetablesdefinedinthesystem.Ironically,thisisstoredinarelationaldatabasededicatedtoHive’susage.IntheearliestversionsofHive,allclientscommunicateddirectlywiththemetastore,butthismeantthateveryuseroftheHiveCLItoolneededtoknowthemetastoreusernameandpassword.
HiveServerwascreatedtoactasapointofentryforremoteclients,whichcouldalsoactasasingleaccess-controlpointandwhichcontrolledallaccesstotheunderlyingmetastore.BecauseoflimitationsinHiveServer,thenewestwaytoaccessHiveisthroughthemulti-clientHiveServer2.
NoteHiveServer2introducesanumberofimprovementsoveritspredecessor,includinguserauthenticationandsupportformultipleconnectionsfromthesameclient.Moreinformationcanbefoundathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2.
InstancesofHiveServerandHiveServer2canbemanuallyexecutedwiththehive--servicehiveserverandhive--servicehiveserver2commands,respectively.
Intheexampleswesawbeforeandintheremainderofthischapter,weimplicitlyuseHiveServertosubmitqueriesviatheHivecommand-linetool.HiveServer2comeswithBeeline.Forcompatibilityandmaturityreasons,Beelinebeingrelativelynew,bothtoolsareavailableonClouderaandmostothermajordistributions.TheBeelineclientispartofthecoreApacheHivedistributionandsoisalsofullyopensource.Beelinecanbeexecutedinembeddedversionwiththefollowingcommand:
$beeline-ujdbc:hive2://
![Page 281: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/281.jpg)
DatatypesHiveQLsupportsmanyofthecommondatatypesprovidedbystandarddatabasesystems.Theseincludeprimitivetypes,suchasfloat,double,int,andstring,throughtostructuredcollectiontypesthatprovidetheSQLanaloguestotypessuchasarrays,structs,andunions(structswithoptionsforsomefields).SinceHiveisimplementedinJava,primitivetypeswillbehaveliketheirJavacounterparts.WecandistinguishHivedatatypesintothefollowingfivebroadcategories:
Numeric:tinyint,smallint,int,bigint,float,double,anddecimalDateandtime:timestampanddateString:string,varchar,andcharCollections:array,map,struct,anduniontypeMisc:boolean,binary,andNULL
![Page 282: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/282.jpg)
DDLstatementsHiveQLprovidesanumberofstatementstocreate,delete,andalterdatabases,tables,andviews.TheCREATEDATABASE<name>statementcreatesanewdatabasewiththegivenname.Adatabaserepresentsanamespacewheretableandviewmetadataiscontained.Ifmultipledatabasesarepresent,theUSE<databasename>statementspecifieswhichonetousetoquerytablesorcreatenewmetadata.Ifnodatabaseisexplicitlyspecified,Hivewillrunallstatementsagainstthedefaultdatabase.SHOW[DATABASES,TABLES,VIEWS]displaysthedatabasescurrentlyavailablewithinadatawarehouseandwhichtableandviewmetadataispresentwithinthedatabasecurrentlyinuse:
CREATEDATABASEtwitter;
SHOWdatabases;
USEtwitter;
SHOWTABLES;
TheCREATETABLE[IFNOTEXISTS]<name>statementcreatesatablewiththegivenname.Asalludedtoearlier,whatisreallycreatedisthemetadatarepresentingthetableanditsmappingtofilesonHDFSaswellasadirectoryinwhichtostorethedatafiles.Ifatableorviewwiththesamenamealreadyexists,Hivewillraiseanexception.
Bothtableandcolumnnamesarecaseinsensitive.InolderversionsofHive(0.12andearlier),onlyalphanumericandunderscorecharacterswereallowedintableandcolumnnames.AsofHive0.13,thesystemsupportsunicodecharactersincolumnnames.Reservedwords,suchasloadandcreate,needtobeescapedbybackticks(the`character)tobetreatedliterally.
TheEXTERNALkeywordspecifiesthatthetableexistsinresourcesoutofHive’scontrol,whichcanbeausefulmechanismtoextractdatafromanothersourceatthebeginningofaHadoop-basedExtract-Transform-Load(ETL)pipeline.TheLOCATIONclausespecifieswherethesourcefile(ordirectory)istobefound.TheEXTERNALkeywordandLOCATIONclausehavebeenusedinthefollowingcode:
CREATEEXTERNALTABLEtweets(
created_atstring,
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${input}/tweets';
Thistablewillbecreatedinthemetastorebutthedatawillnotbecopiedintothe/user/hive/warehousedirectory.
Tip
![Page 283: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/283.jpg)
NotethatHivehasnoconceptofprimarykeyoruniqueidentifier.Uniquenessanddatanormalizationareaspectstobeaddressedbeforeloadingdataintothedatawarehouse.
TheCREATEVIEW<viewname>…ASSELECTstatementcreatesaviewwiththegivenname.Forexample,wecancreateaviewtoisolateretweetsfromothermessages,asfollows:
CREATEVIEWretweets
COMMENT'Tweetsthathavebeenretweeted'
ASSELECT*FROMtweetsWHEREretweeted=true;
Unlessotherwisespecified,columnnamesarederivedfromthedefiningSELECTstatement.Hivedoesnotcurrentlysupportmaterializedviews.
TheDROPTABLEandDROPVIEWstatementsremovebothmetadataanddataforagiventableorview.WhendroppinganEXTERNALtableoraview,onlymetadatawillberemovedandtheactualdatafileswillnotbeaffected.
HiveallowstablemetadatatobealteredviatheALTERTABLEstatement,whichcanbeusedtochangeacolumntype,name,position,andcommentortoaddandreplacecolumns.
Whenaddingcolumns,itisimportanttorememberthatonlymetadatawillbechangedandnotthedatasetitself.Thismeansthatifweweretoaddacolumninthemiddleofthetablewhichdidn’texistinolderfiles,thenwhileselectingfromolderdata,wemightgetwrongvaluesinthewrongcolumns.Thisisbecausewewouldbelookingatoldfileswithanewformat.WewilldiscussdataandschemamigrationsinChapter8,DataLifecycleManagement,whendiscussingAvro.
Similarly,ALTERVIEW<viewname>AS<selectstatement>changesthedefinitionofanexistingview.
![Page 284: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/284.jpg)
FileformatsandstorageThedatafilesunderlyingaHivetablearenodifferentfromanyotherfileonHDFS.UserscandirectlyreadtheHDFSfilesintheHivetablesusingothertools.TheycanalsouseothertoolstowritetoHDFSfilesthatcanbeloadedintoHivethroughCREATEEXTERNALTABLEorthroughLOADDATAINPATH.
HiveusestheSerializerandDeserializerclasses,SerDe,aswellasFileFormattoreadandwritetablerows.AnativeSerDeisusedifROWFORMATisnotspecifiedorROWFORMATDELIMITEDisspecifiedinaCREATETABLEstatement.TheDELIMITEDclauseinstructsthesystemtoreaddelimitedfiles.DelimitercharacterscanbeescapedusingtheESCAPEDBYclause.
HivecurrentlyusesthefollowingFileFormatclassestoreadandwriteHDFSfiles:
TextInputFormatandHiveIgnoreKeyTextOutputFormat:willread/writedatainplaintextfileformatSequenceFileInputFormatandSequenceFileOutputFormat:classesread/writedataintheHadoopSequenceFileformat
Additionally,thefollowingSerDeclassescanbeusedtoserializeanddeserializedata:
MetadataTypedColumnsetSerDe:willread/writedelimitedrecordssuchasCSVortab-separatedrecordsThriftSerDe,andDynamicSerDe:willread/writeThriftobjects
JSONAsofversion0.13,Hiveshipswiththenativeorg.apache.hive.hcatalog.data.JsonSerDe.ForolderversionsofHive,Hive-JSON-Serde(foundathttps://github.com/rcongiu/Hive-JSON-Serde)isarguablyoneofthemostfeature-richJSONserialization/deserializationmodules.
WecanuseeithermoduletoloadJSONtweetswithoutanyneedforpreprocessingandjustdefineaHiveschemathatmatchesthecontentofaJSONdocument.Inthefollowingexample,weuseHive-JSON-Serde.
Aswithanythird-partymodule,weloadtheSerDeJARsintoHivewiththefollowingcode:
ADDJARJARjson-serde-1.3-jar-with-dependencies.jar;
Then,weissuetheusualCREATEstatement,asfollows:
CREATEEXTERNALTABLEtweets(
contributorsstring,
coordinatesstruct<
coordinates:array<float>,
type:string>,
created_atstring,
entitiesstruct<
hashtags:array<struct<
![Page 285: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/285.jpg)
indices:array<tinyint>,
text:string>>,
…
)
ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'
STOREDASTEXTFILE
LOCATION'tweets';
WiththisSerDe,wecanmapnesteddocuments(suchasentitiesorusers)tothestructormaptypes.WetellHivethatthedatastoredatLOCATION'tweets'istext(STOREDASTEXTFILE)andthateachrowisaJSONobject(ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe‘).InHive0.13andlater,wecanexpressthispropertyasROWFORMATSERDE'org.apache.hive.hcatalog.data.JsonSerDe'.
Manuallyspecifyingtheschemaforcomplexdocumentscanbeatediousanderror-proneprocess.Thehive-jsonmodule(foundathttps://github.com/hortonworks/hive-json)isahandyutilitytoanalyzelargedocumentsandgenerateanappropriateHiveschema.Dependingonthedocumentcollection,furtherrefinementmightbenecessary.
Inourexample,weusedaschemageneratedwithhive-jsonthatmapsthetweetsJSONtoanumberofstructdatatypes.Thisallowsustoquerythedatausingahandydotnotation.Forinstance,wecanextractthescreennameanddescriptionfieldsofauserobjectwiththefollowingcode:
SELECTuser.screen_name,user.descriptionFROMtweets_jsonLIMIT10;
AvroAvroSerde(https://cwiki.apache.org/confluence/display/Hive/AvroSerDe)allowsustoreadandwritedatainAvroformat.Startingfrom0.14,Avro-backedtablescanbecreatedusingtheSTOREDASAVROstatement,andHivewilltakecareofcreatinganappropriateAvroschemaforthetable.PriorversionsofHiveareabitmoreverbose.
Asanexample,let’sloadintoHivethePageRankdatasetwegeneratedinChapter6,DataAnalysiswithApachePig.ThisdatasetwascreatedusingPig’sAvroStorageclass,andhasthefollowingschema:
{
"type":"record",
"name":"record",
"fields":[
{"name":"topic","type":["null","int"]},
{"name":"source","type":["null","int"]},
{"name":"rank","type":["null","float"]}
]
}
ThetablestructureiscapturedinanAvrorecord,whichcontainsheaderinformation(anameandoptionalnamespacetoqualifythename)andanarrayofthefields.Eachfieldisspecifiedwithitsnameandtypeaswellasanoptionaldocumentationstring.
Forafewofthefields,thetypeisnotasinglevalue,butinsteadapairofvalues,oneofwhichisnull.ThisisanAvrounion,andthisistheidiomaticwayofhandlingcolumns
![Page 286: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/286.jpg)
thatmighthaveanullvalue.Avrospecifiesnullasaconcretetype,andanylocationwhereanothertypemighthaveanullvalueneedstobespecifiedinthisway.Thiswillbehandledtransparentlyforuswhenweusethefollowingschema.
Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:
CREATEEXTERNALTABLEtweets_pagerank
ROWFORMATSERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITHSERDEPROPERTIES('avro.schema.literal'='{
"type":"record",
"name":"record",
"fields":[
{"name":"topic","type":["null","int"]},
{"name":"source","type":["null","int"]},
{"name":"rank","type":["null","float"]}
]
}')
STOREDASINPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION'${data}/ch5-pagerank';
Then,lookatthefollowingtabledefinitionfromwithinHive(notealsothatHCatalog,whichwe’llintroduceinChapter8,DataLifeCycleManagement,alsosupportssuchdefinitions):
DESCRIBEtweets_pagerank;
OK
topicintfromdeserializer
sourceintfromdeserializer
rankfloatfromdeserializer
IntheDDL,wetoldHivethatdataisstoredinAvroformatusingAvroContainerInputFormatandAvroContainerOutputFormat.Eachrowneedstobeserializedanddeserializedusingorg.apache.hadoop.hive.serde2.avro.AvroSerDe.ThetableschemaisinferredbyHivefromtheAvroschemaembeddedinavro.schema.literal.
Alternatively,wecanstoreaschemaonHDFSandhaveHivereadittodeterminethetablestructure.Createtheprecedingschemainafilecalledpagerank.avsc—thisisthestandardfileextensionforAvroschemas.ThenplaceitonHDFS;weprefertohaveacommonlocationforschemafilessuchas/schema/avro.Finally,definethetableusingtheavro.schema.urlSerDepropertyWITHSERDEPROPERTIES('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc').
IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforeaccessingindividualfields.WithinHive,ontheClouderaCDH5VM:
ADDJAR/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar;
![Page 287: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/287.jpg)
Wecanalsousethistablelikeanyother.Forinstance,wecanquerythedatatoselecttheuserandtopicpairswithahighPageRank:
SELECTsource,topicfromtweets_pagerankWHERErank>=0.9;
InChapter8,DataLifecycleManagement,wewillseehowAvroandavro.schema.urlplayaninstrumentalroleinenablingschemamigrations.
ColumnarstoresHivecanalsotakeadvantageofcolumnarstorageviatheORC(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC)andParquet(https://cwiki.apache.org/confluence/display/Hive/Parquet)formats.
Ifatableisdefinedwithverymanycolumns,itisnotunusualforanygivenquerytoonlyprocessasmallsubsetofthesecolumns.ButeveninaSequenceFileeachfullrowandallitscolumnswillbereadfromdisk,decompressed,andprocessed.Thisconsumesalotofsystemresourcesfordatathatweknowinadvanceisnotofinterest.
Traditionalrelationaldatabasesalsostoredataonarowbasis,andatypeofdatabasecalledcolumnarchangedthistobecolumn-focused.Inthesimplestmodel,insteadofonefileforeachtable,therewouldbeonefileforeachcolumninthetable.Ifaqueryonlyneededtoaccessfivecolumnsinatablewith100columnsintotal,thenonlythefilesforthosefivecolumnswillberead.BothORCandParquetusethisprincipleaswellasotheroptimizationstoenablemuchfasterqueries.
![Page 288: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/288.jpg)
QueriesTablescanbequeriedusingthefamiliarSELECT…FROMstatement.TheWHEREstatementallowsthespecificationoffilteringconditions,GROUPBYaggregatesrecords,ORDERBYspecifiessortingcriteria,andLIMITspecifiesthenumberofrecordstoretrieve.Aggregatefunctions,suchascountandsum,canbeappliedtoaggregatedrecords.Forinstance,thefollowingcodereturnsthetop10mostprolificusersinthedataset:
SELECTuser_id,COUNT(*)AScntFROMtweetsGROUPBYuser_idORDERBYcnt
DESCLIMIT10
Thisreturnsthetop10mostprolificusersinthedataset:
22639496594
13321880534
9594688573
13677521183
3625629443
586460413
23752966883
14681885293
371142093
23850409403
Wecanimprovethereadabilityofthehiveoutputbysettingthefollowing:
SEThive.cli.print.header=true;
Thiswillinstructhive,thoughnotbeeline,toprintcolumnnamesaspartoftheoutput.
TipYoucanaddthecommandtothe.hivercfileusuallyfoundintherootoftheexecutinguser’shomedirectorytohaveitapplytoallhiveCLIsessions.
HiveQLimplementsaJOINoperatorthatenablesustocombinetablestogether.InthePrerequisitessection,wegeneratedseparatedatasetsfortheuserandplaceobjects.Let’snowloadthemintohiveusingexternaltables.
Wefirstcreateausertabletostoreuserdata,asfollows:
CREATEEXTERNALTABLEuser(
created_atstring,
user_idstring,
`location`string,
namestring,
descriptionstring,
followers_countbigint,
friends_countbigint,
favourites_countbigint,
screen_namestring,
listed_countbigint
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
![Page 289: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/289.jpg)
STOREDASTEXTFILE
LOCATION'${input}/users';
Wethencreateaplacetabletostorelocationdata,asfollows:
CREATEEXTERNALTABLEplace(
place_idstring,
country_codestring,
countrystring,
`name`string,
full_namestring,
place_typestring
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${input}/places';
WecanusetheJOINoperatortodisplaythenamesofthe10mostprolificusers,asfollows:
SELECTtweets.user_id,user.name,COUNT(tweets.user_id)AScnt
FROMtweets
JOINuserONuser.user_id=tweets.user_id
GROUPBYtweets.user_id,user.user_id,user.name
ORDERBYcntDESCLIMIT10;
TipOnlyequality,outer,andleft(semi)joinsaresupportedinHive.
NoticethattheremightbemultipleentrieswithagivenuserIDbutdifferentvaluesforthefollowers_count,friends_count,andfavourites_countcolumns.Toavoidduplicateentries,wecountonlyuser_idfromthetweetstable.
Wecanrewritethepreviousqueryasfollows:
SELECTtweets.user_id,u.name,COUNT(*)AScnt
FROMtweets
join(SELECTuser_id,nameFROMuserGROUPBYuser_id,name)u
ONu.user_id=tweets.user_id
GROUPBYtweets.user_id,u.name
ORDERBYcntDESCLIMIT10;
Insteadofdirectlyjoiningtheusertable,weexecuteasubquery,asfollows:
SELECTuser_id,nameFROMuserGROUPBYuser_id,name;
ThesubqueryextractsuniqueuserIDsandnames.NotethatHivehaslimitedsupportforsubqueries,historicallyonlypermittingasubqueryintheFROMclauseofaSELECTstatement.Hive0.13hasaddedlimitedsupportforsubquerieswithintheWHEREclausealso.
HiveQLisanever-evolvingrichlanguage,afullexpositionofwhichisbeyondthescopeofthischapter.Adescriptionofitsqueryandddlcapabilitiescanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual.
![Page 290: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/290.jpg)
StructuringHivetablesforgivenworkloadsOftenHiveisn’tusedinisolation,insteadtablesarecreatedwithparticularworkloadsinmindorneedsinvokedinwaysthataresuitableforinclusioninautomatedprocesses.We’llnowexploresomeofthesescenarios.
![Page 291: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/291.jpg)
PartitioningatableWithcolumnarfileformats,weexplainedthebenefitsofexcludingunneededdataasearlyaspossiblewhenprocessingaquery.AsimilarconcepthasbeenusedinSQLforsometime:tablepartitioning.
Whencreatingapartitionedtable,acolumnisspecifiedasthepartitionkey.Allvalueswiththatkeyarethenstoredtogether.InHive’scase,differentsubdirectoriesforeachpartitionkeyarecreatedunderthetabledirectoryinthewarehouselocationonHDFS.
It’simportanttounderstandthecardinalityofthepartitioncolumn.Withtoofewdistinctvalues,thebenefitsarereducedasthefilesarestillverylarge.Iftherearetoomanyvalues,thenqueriesmightneedalargenumberoffilestobescannedtoaccessalltherequireddata.Perhapsthemostcommonpartitionkeyisonebasedondate.Wecould,forexample,partitionourusertablefromearlierbasedonthecreated_atcolumn,thatis,thedatetheuserwasfirstregistered.Notethatsincepartitioningatablebydefinitionaffectsitsfilestructure,wecreatethistablenowasanon-externalone,asfollows:
CREATETABLEpartitioned_user(
created_atstring,
user_idstring,
`location`string,
namestring,
descriptionstring,
followers_countbigint,
friends_countbigint,
favourites_countbigint,
screen_namestring,
listed_countbigint
)PARTITIONEDBY(created_at_datestring)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
Toloaddataintoapartition,wecanexplicitlygiveavalueforthepartitionintowhichtoinsertthedata,asfollows:
INSERTINTOTABLEpartitioned_user
PARTITION(created_at_date='2014-01-01')
SELECT
created_at,
user_id,
location,
name,
description,
followers_count,
friends_count,
favourites_count,
screen_name,
listed_count
FROMuser;
Thisisatbestverbose,asweneedastatementforeachpartitionkeyvalue;ifasingle
![Page 292: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/292.jpg)
LOADorINSERTstatementcontainsdataformultiplepartitions,itjustwon’twork.Hivealsohasafeaturecalleddynamicpartitioning,whichcanhelpushere.Wesetthefollowingthreevariables:
SEThive.exec.dynamic.partition=true;
SEThive.exec.dynamic.partition.mode=nonstrict;
SEThive.exec.max.dynamic.partitions.pernode=5000;
Thefirsttwostatementsenableallpartitions(nonstrictoption)tobedynamic.Thethirdoneallows5,000distinctpartitionstobecreatedoneachmapperandreducernode.
Wecanthensimplyusethenameofthecolumntobeusedasthepartitionkey,andHivewillinsertdataintopartitionsdependingonthevalueofthekeyforagivenrow:
INSERTINTOTABLEpartitioned_user
PARTITION(created_at_date)
SELECT
created_at,
user_id,
location,
name,
description,
followers_count,
friends_count,
favourites_count,
screen_name,
listed_count,
to_date(created_at)ascreated_at_date
FROMuser;
Eventhoughweuseonlyasinglepartitioncolumnhere,wecanpartitionatablebymultiplecolumnkeys;justhavethemasacomma-separatedlistinthePARTITIONEDBYclause.
Notethatthepartitionkeycolumnsneedtobeincludedasthelastcolumnsinanystatementbeingusedtoinsertintoapartitionedtable.IntheprecedingcodeweuseHive’sto_datefunctiontoconvertthecreated_attimestamptoaYYYY-MM-DDformattedstring.
PartitioneddataisstoredinHDFSas/path/to/warehouse/<database>/<table>/key=<value>.Inourexample,thepartitioned_usertablestructurewilllooklike/user/hive/warehouse/default/partitioned_user/created_at=2014-04-01.
Ifdataisaddeddirectlytothefilesystem,forinstancebysomethird-partyprocessingtoolorbyhadoopfs-put,themetastorewon’tautomaticallydetectthenewpartitions.TheuserwillneedtomanuallyrunanALTERTABLEstatementsuchasthefollowingforeachnewlyaddedpartition:
ALTERTABLE<table_name>ADDPARTITION<location>;
Toaddmetadataforallpartitionsnotcurrentlypresentinthemetastorewecanuse:MSCKREPAIRTABLE<table_name>;statement.OnEMR,thisisequivalenttoexecutingthefollowingstatement:
ALTERTABLE<table_name>RECOVERPARTITIONS;
![Page 293: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/293.jpg)
NoticethatbothstatementswillworkalsowithEXTERNALtables.Inthefollowingchapter,wewillseehowthispatterncanbeexploitedtocreateflexibleandinteroperablepipelines.
OverwritingandupdatingdataPartitioningisalsousefulwhenweneedtoupdateaportionofatable.Normallyastatementofthefollowingformwillreplaceallthedataforthedestinationtable:
INSERTOVERWRITEINTO<table>…
IfOVERWRITEisomitted,theneachINSERTstatementwilladdadditionaldatatothetable.Sometimes,thisisdesirable,butoften,thesourcedatabeingingestedintoaHivetableisintendedtofullyupdateasubsetofthedataandkeeptherestuntouched.
IfweperformanINSERTOVERWRITEstatement(oraLOADOVERWRITEstatement)intoapartitionofatable,thenonlythespecifiedpartitionwillbeaffected.Thus,ifwewereinsertinguserdataandonlywantedtoaffectthepartitionswithdatainthesourcefile,wecouldachievethisbyaddingtheOVERWRITEkeywordtoourpreviousINSERTstatement.
WecanalsoaddcaveatstotheSELECTstatement.Say,forexample,weonlywantedtoupdatedataforacertainmonth:
INSERTINTOTABLEpartitioned_user
PARTITION(created_at_date)
SELECTcreated_at,
user_id,
location,
name,
description,
followers_count,
friends_count,
favourites_count,
screen_name,
listed_count,
to_date(created_at)ascreated_at_date
FROMuser
WHEREto_date(created_at)BETWEEN'2014-03-01'and'2014-03-31';
BucketingandsortingPartitioningatableisaconstructthatyoutakeexplicitadvantageofbyusingthepartitioncolumn(orcolumns)intheWHEREclauseofqueriesagainstthetables.ThereisanothermechanismcalledbucketingthatcanfurthersegmenthowatableisstoredanddoessoinawaythatallowsHiveitselftooptimizeitsinternalqueryplanstotakeadvantageofthestructure.
Let’screatebucketedversionsofourtweetsandusertables;notethefollowingadditionalCLUSTERBYandSORTBYstatementsintheCREATETABLEstatements:
CREATEtablebucketed_tweets(
tweet_idstring,
textstring,
in_reply_tostring,
![Page 294: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/294.jpg)
retweetedboolean,
user_idstring,
place_idstring
)PARTITIONEDBY(created_atstring)
CLUSTEREDBY(user_ID)into64BUCKETS
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
CREATETABLEbucketed_user(
user_idstring,
`location`string,
namestring,
descriptionstring,
followers_countbigint,
friends_countbigint,
favourites_countbigint,
screen_namestring,
listed_countbigint
)PARTITIONEDBY(created_atstring)
CLUSTEREDBY(user_ID)SORTEDBY(name)into64BUCKETS
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
Notethatwechangedthetweetstabletoalsobepartitioned;youcanonlybucketatablethatispartitioned.
Justasweneedtospecifyapartitioncolumnwheninsertingintoapartitionedtable,wemustalsotakecaretoensurethatdatainsertedintoabucketedtableiscorrectlyclustered.Wedothisbysettingthefollowingflagbeforeinsertingthedataintothetable:
SEThive.enforce.bucketing=true;
Justaswithpartitionedtables,youcannotapplythebucketingfunctionwhenusingtheLOADDATAstatement;ifyouwishtoloadexternaldataintoabucketedtable,firstinsertitintoatemporarytable,andthenusetheINSERT…SELECT…syntaxtopopulatethebucketedtable.
Whendataisinsertedintoabucketedtable,rowsareallocatedtoabucketbasedontheresultofahashfunctionappliedtothecolumnspecifiedintheCLUSTEREDBYclause.
Oneofthegreatestadvantagesofbucketingatablecomeswhenweneedtojointwotablesthataresimilarlybucketed,asinthepreviousexample.So,forexample,anyqueryofthefollowingformwouldbevastlyimproved:
SEThive.optimize.bucketmapjoin=true;
SELECT…
FROMbucketed_useruJOINbucketed_tweett
ONu.user_id=t.user_id;
Withthejoinbeingperformedonthecolumnusedtobucketthetable,Hivecanoptimizetheamountofprocessingasitknowsthateachbucketcontainsthesamesetofuser_idcolumnsinbothtables.Whiledeterminingwhichrowsagainstwhichtomatch,onlythose
![Page 295: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/295.jpg)
inthebucketneedtobecomparedagainst,andnotthewholetable.Thisdoesrequirethatthetablesarebothclusteredonthesamecolumnandthatthebucketnumbersareeitheridenticaloroneisamultipleoftheother.Inthelattercase,withsayonetableclusteredinto32bucketsandanotherinto64,thenatureofthedefaulthashfunctionusedtoallocatedatatoabucketmeansthattheIDsinbucket3inthefirsttablewillcoverthoseinbothbuckets3and35inthesecond.
SamplingdataBucketingatablecanalsohelpwhileusingHive’sabilitytosampledatainatable.Samplingallowsaquerytogatheronlyaspecifiedsubsetoftheoverallrowsinthetable.Thisisusefulwhenyouhaveanextremelylargetablewithmoderatelyconsistentdatapatterns.Insuchacase,applyingaquerytoasmallfractionofthedatawillbemuchfasterandwillstillgiveabroadlyrepresentativeresult.Note,ofcourse,thatthisonlyappliestoquerieswhereyouarelookingtodeterminetablecharacteristics,suchaspatternrangesinthedata;ifyouaretryingtocountanything,thentheresultneedstobescaledtothefulltablesize.
Foranon-bucketedtable,youcansampleinamechanismsimilartowhatwesawearlierbyspecifyingthatthequeryshouldonlybeappliedtoacertainsubsetofthetable:
SELECTmax(friends_count)
FROMuserTABLESAMPLE(BUCKET2OUTOF64ONname);
Inthisquery,Hivewilleffectivelyhashtherowsinthetableinto64bucketsbasedonthenamecolumn.Itwillthenonlyusethesecondbucketforthequery.Multiplebucketscanbespecified,andifRAND()isgivenastheONclause,thentheentirerowisusedbythebucketingfunction.
Thoughsuccessful,thisishighlyinefficientasthefulltableneedstobescannedtogeneratetherequiredsubsetofdata.Ifwesampleonabucketedtableandensurethenumberofbucketssampledisequaltooramultipleofthebucketsinthetable,thenHivewillonlyreadthebucketsinquestion.Forexample:
SELECTMAX(friends_count)
FROMbucketed_userTABLESAMPLE(BUCKET2OUTOF32onuser_id);
Intheprecedingqueryagainstthebucketed_usertable,whichiscreatedwith64bucketsontheuser_idcolumn,thesampling,sinceitisusingthesamecolumn,willonlyreadtherequiredbuckets.Inthiscase,thesewillbebuckets2and34fromeachpartition.
Afinalformofsamplingisblocksampling.Inthiscase,wecanspecifytherequiredamountofthetabletobesampled,andHivewilluseanapproximationofthisbyonlyreadingenoughsourcedatablocksonHDFStomeettherequiredsize.Currently,thedatasizecanbespecifiedaseitherapercentageofthetable,asanabsolutedatasize,orasanumberofrows(ineachblock).ThesyntaxforTABLESAMPLEisasfollows,whichwillsample0.5percentofthetable,1GBofdataor100rowspersplit,respectively:
TABLESAMPLE(0.5PERCENT)
TABLESAMPLE(1G)
![Page 296: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/296.jpg)
TABLESAMPLE(100ROWS)
Iftheselatterformsofsamplingareofinterest,thenconsultthedocumentation,astherearesomespecificlimitationsontheinputformatandfileformatsthataresupported.
![Page 297: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/297.jpg)
WritingscriptsWecanplaceHivecommandsinafileandrunthemwiththe-foptioninthehiveCLIutility:
$catshow_tables.hql
showtables;
$hive-fshow_tables.hql
WecanparameterizeHiveQLstatementsbymeansofthehiveconfmechanism.Thisallowsustospecifyanenvironmentvariablenameatthepointitisusedratherthanatthepointofinvocation.Forexample:
$catshow_tables2.hql
showtableslike'${hiveconf:TABLENAME}';
$hive-hiveconfTABLENAME=user-fshow_tables2.hql
ThevariablecanalsobesetwithintheHivescriptoraninteractivesession:
SETTABLE_NAME='user';
TheprecedinghiveconfargumentwilladdanynewvariablesinthesamenamespaceastheHiveconfigurationoptions.AsofHive0.8,thereisasimilaroptioncalledhivevarthataddsanyuservariablesintoadistinctnamespace.Usinghivevar,theprecedingcommandwouldbeasfollows:
$catshow_tables3.hql
showtableslike'${hivevar:TABLENAME}';
$hive-hivevarTABLENAME=user–fshow_tables3.hql
Orwecanwritethecommandinteractively:
SEThivevar:TABLE_NAME='user';
![Page 298: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/298.jpg)
HiveandAmazonWebServicesWithElasticMapReduceastheAWSHadoop-on-demandservice,itisofcoursepossibletorunHiveonanEMRcluster.ButitisalsopossibletouseAmazonstorageservices,particularlyS3,fromanyHadoopclusterbeitwithinEMRoryourownlocalcluster.
![Page 299: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/299.jpg)
HiveandS3AsmentionedinChapter2,Storage,itispossibletospecifyadefaultfilesystemotherthanHDFSforHadoopandS3isoneoption.But,itdoesn’thavetobeanall-or-nothingthing;itispossibletohavespecifictablesstoredinS3.Thedataforthesetableswillberetrievedintotheclustertobeprocessed,andanyresultingdatacaneitherbewrittentoadifferentS3location(thesametablecannotbethesourceanddestinationofasinglequery)orontoHDFS.
WecantakeafileofourtweetdataandplaceitontoalocationinS3withacommandsuchasthefollowing:
$awss3puttweets.tsvs3://<bucket-name>/tweets/
Wefirstlyneedtospecifytheaccesskeyandsecretaccesskeythatcanaccessthebucket.Thiscanbedoneinthreeways:
Setfs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeytotheappropriatevaluesintheHiveCLISetthesamevaluesinhive-site.xmlthoughnotethislimitsuseofS3toasinglesetofcredentialsSpecifythetablelocationexplicitlyinthetableURL,thatis,s3n://<accesskey>:<secretaccesskey>@<bucket>/<path>
Thenwecancreateatablereferencingthisdata,asfollows:
CREATEtableremote_tweets(
created_atstring,
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)CLUSTEREDBY(user_ID)into64BUCKETS
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\t'
LOCATION's3n://<bucket-name>/tweets'
ThiscanbeanincrediblyeffectivewayofpullingS3dataintoalocalHadoopclusterforprocessing.
NoteInordertouseAWScredentialsintheURIofanS3locationregardlessofhowtheparametersarepassed,thesecretandaccesskeysmustnotcontain/,+,=,or\characters.Ifnecessary,anewsetofcredentialscanbegeneratedfromtheIAMconsoleathttps://console.aws.amazon.com/iam/.
Intheory,youcanjustleavethedataintheexternaltableandrefertoitwhenneededtoavoidWANdatatransferlatencies(andcosts),eventhoughitoftenmakessensetopull
![Page 300: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/300.jpg)
thedataintoalocaltableanddofutureprocessingfromthere.Ifthetableispartitioned,thenyoumightfindyourselfretrievinganewpartitioneachday,forexample.
![Page 301: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/301.jpg)
HiveonElasticMapReduceOnonelevel,usingHivewithinAmazonElasticMapReduceisjustthesameaseverythingdiscussedinthischapter.Youcancreateapersistentcluster,logintothemasternode,andusetheHiveCLItocreatetablesandsubmitqueries.DoingallthiswillusethelocalstorageontheEC2instancesforthetabledata.
Notsurprisingly,jobsonEMRclusterscanalsorefertotableswhosedataisstoredonS3(orDynamoDB).Andalsonotsurprisingly,AmazonhasmadeextensionstoitsversionofHivetomakeallthisveryseamless.ItisquitesimplefromwithinanEMRjobtopulldatafromatablestoredinS3,processit,writeanyintermediatedatatotheEMRlocalstorage,andthenwritetheoutputresultsintoS3,DynamoDB,oroneofagrowinglistofotherAWSservices.
ThepatternmentionedearlierwherenewdataisaddedtoanewpartitiondirectoryforatableeachdayhasprovedveryeffectiveinS3;itisoftenthestoragelocationofchoiceforlargeandincrementallygrowingdatasets.ThereisasyntaxdifferencewhenusingEMR;insteadoftheMSCKcommandmentionedearlier,thecommandtoupdateaHivetablewithnewdataaddedtoapartitiondirectoryisasfollows:
ALTERTABLE<table-name>RECOVERPARTITIONS;
ConsulttheEMRdocumentationforthelatestenhancementsathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html.Also,consultthebroaderEMRdocumentation.Inparticular,theintegrationpointswithotherAWSservicesisanareaofrapidgrowth.
![Page 302: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/302.jpg)
ExtendingHiveQLTheHiveQLlanguagecanbeextendedbymeansofpluginsandthird-partyfunctions.InHive,therearethreetypesoffunctionscharacterizedbythenumberofrowstheytakeasinputandproduceasoutput:
UserDefinedFunctions(UDFs):aresimplerfunctionsthatactononerowatatime.UserDefinedAggregateFunctions(UDAFs):takemultiplerowsasinputandgeneratemultiplerowsasoutput.TheseareaggregatefunctionstobeusedinconjunctionwithaGROUPBYstatement(similartoCOUNT(),AVG(),MIN(),MAX(),andsoon).UserDefinedTableFunctions(UDTFs):takemultiplerowsasinputandgeneratealogicaltablecomprisedofmultiplerowsthatcanbeusedinjoinexpressions.
TipTheseAPIsareprovidedonlyinJava.Forotherlanguages,itispossibletostreamdatathroughauser-definedscriptusingtheTRANSFORM,MAP,andREDUCEclausesthatactasafrontendtoHadoop’sstreamingcapabilities.
TwoAPIsareavailabletowriteUDFs.AsimpleAPIorg.apache.hadoop.hive.ql.exec.UDFcanbeusedforfunctionsthattakeandreturnbasicwritabletypes.AricherAPI,whichprovidessupportfordatatypesotherthanwritableisavailableintheorg.apache.hadoop.hive.ql.udf.generic.GenericUDFpackage.We’llnowillustratehoworg.apache.hadoop.hive.ql.exec.UDFcanbeusedtoimplementastringtoIDfunctionsimilartotheoneweusedinChapter5,IterativeComputationwithSpark,tomaphashtagstointegersinPig.BuildingaUDFwiththisAPIonlyrequiresextendingtheUDFclassandwritinganevaluate()method,asfollows:
publicclassStringToIntextendsUDF{
publicIntegerevaluate(Textinput){
if(input==null)
returnnull;
Stringstr=input.toString();
returnstr.hashCode();
}
}
ThefunctiontakesaTextobjectasinputandmapsittoanintegervaluewiththehashCode()method.Thesourcecodeofthisfunctioncanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/udf/com/learninghadoop2/hive/udf/StringToInt.java.
TipAsnotedinChapter6,DataAnalysiswithApachePig,amorerobusthashfunctionshouldbeusedinproduction.
![Page 303: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/303.jpg)
WecompiletheclassandarchiveitintoaJARfile,asfollows:
$javac-classpath$(hadoop
classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*
com/learninghadoop2/hive/udf/StringToInt.java
$jarcvfmyudfs-hive.jarcom/learninghadoop2/hive/udf/StringToInt.class
Beforebeingabletouseit,aUDFmustberegisteredinHivewiththefollowingcommands:
ADDJARmyudfs-hive.jar;
CREATETEMPORARYFUNCTIONstring_to_intAS
'com.learninghadoop2.hive.udf.StringToInt';
TheADDJARstatementaddsaJARfiletothedistributedcache.TheCREATETEMPORARYFUNCTION<function>AS<class>statementregistersafunctioninHivethatimplementsagivenJavaclass.ThefunctionwillbedroppedoncetheHivesessionisclosed.AsofHive0.13,itispossibletocreatepermanentfunctionswhosedefinitioniskeptinthemetastoreusingCREATEFUNCTION….
Onceregistered,StringToIntcanbeusedinaqueryjustlikeanyotherfunction.Inthefollowingexample,wefirstextractalistofhashtagsfromthetweet’stextbyapplyingregexp_extract.Then,weusestring_to_inttomapeachtagtoanumericalID:
SELECTunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag)AS
tag_idFROM
(
SELECTregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')ashashtag
FROMtweets
GROUPBYregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')
)unique_hashtagsGROUPBYunique_hashtags.hashtag,
string_to_int(unique_hashtags.hashtag);
Justaswedidinthepreviouschapter,wecanusetheprecedingquerytocreatealookuptable:
CREATETABLElookuptable(tagstring,tag_idbigint);
INSERTOVERWRITETABLElookuptable
SELECTunique_hashtags.hashtag,
string_to_int(unique_hashtags.hashtag)astag_id
FROM
(
SELECTregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')AShashtag
FROMtweets
GROUPBYregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')
)unique_hashtags
GROUPBYunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag);
![Page 304: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/304.jpg)
ProgrammaticinterfacesInadditiontothehiveandbeelinecommand-linetools,itispossibletosubmitHiveQLqueriestothesystemviatheJDBCandThriftprogrammaticinterfaces.SupportforODBCwasbundledinolderversionsofHive,butasofHive0.12,itneedstobebuiltfromscratch.Moreinformationonthisprocesscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/HiveODBC.
![Page 305: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/305.jpg)
JDBCAHiveclientwrittenusingJDBCAPIslooksexactlythesameasaclientprogramwrittenforotherdatabasesystems(forexampleMySQL).ThefollowingisasampleHiveclientprogramusingJDBCAPIs.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveJdbcClient.java.
publicclassHiveJdbcClient{
privatestaticStringdriverName="org.apache.hive.jdbc.HiveDriver";
//connectionstring
publicstaticStringURL="jdbc:hive2://localhost:10000";
//Showalltablesinthedefaultdatabase
publicstaticStringQUERY="showtables";
publicstaticvoidmain(String[]args)throwsSQLException{
try{
Class.forName(driverName);
}
catch(ClassNotFoundExceptione){
e.printStackTrace();
System.exit(1);
}
Connectioncon=DriverManager.getConnection(URL);
Statementstmt=con.createStatement();
ResultSetresultSet=stmt.executeQuery(QUERY);
while(resultSet.next()){
System.out.println(resultSet.getString(1));
}
}
}
TheURLpartistheJDBCURIthatdescribestheconnectionendpoint.Theformatforestablishingaremoteconnectionisjdbc:hive2:<host>:<port>/<database>.Connectionsinembeddedmodecanbeestablishedbynotspecifyingahostorport,likejdbc:hive2://.
hiveandhive2arethedriverstobeusedwhenconnectingtoHiveServerandHiveServer2.QUERYcontainstheHiveQLquerytobeexecuted.
TipHive’sJDBCinterfaceexposesonlythedefaultdatabase.Inordertoaccessotherdatabases,youneedtoreferencethemexplicitlyintheunderlyingqueriesusingthe<database>.<table>notation.
FirstweloadtheHiveServer2JDBCdriverorg.apache.hive.jdbc.HiveDriver.
Tip
![Page 306: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/306.jpg)
Useorg.apache.hadoop.hive.jdbc.HiveDrivertoconnecttoHiveServer.
Then,likewithanyotherJDBCprogram,weestablishaconnectiontoURLanduseittoinstantiateaStatementclass.WeexecuteQUERY,withnoauthentication,andstoretheoutputdatasetintotheResultSetobject.Finally,wescanresultSetandprintitscontenttothecommandline.
Compileandexecutetheexamplewiththefollowingcommands:
$javacHiveJdbcClient.java
$java-cp$(hadoop
classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/C
DH/lib/hive/lib/hive-jdbc.jar:
com.learninghadoop2.hive.client.HiveJdbcClient
![Page 307: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/307.jpg)
ThriftThriftprovideslower-levelaccesstoHiveandhasanumberofadvantagesovertheJDBCimplementationofHiveServer.Primarily,itallowsmultipleconnectionsfromthesameclient,anditallowsprogramminglanguagesotherthanJavatobeusedwithease.WithHiveServer2,itisalesscommonlyusedoptionbutstillworthmentioningforcompatibility.AsampleThriftclientimplementedusingtheJavaAPIcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveThriftClient.java.ThisclientcanbeusedtoconnecttoHiveServer,butduetoprotocoldifferences,theclientwon’tworkwithHiveServer2.
IntheexamplewedefineagetClient()methodthattakesasinputthehostandportofaHiveServerserviceandreturnsaninstanceoforg.apache.hadoop.hive.service.ThriftHive.Client.
Aclientisobtainedbyfirstinstantiatingasocketconnection,org.apache.thrift.transport.TSocket,totheHiveServerservice,andbyspecifyingaprotocol,org.apache.thrift.protocol.TBinaryProtocol,toserializeandtransmitdata,asfollows:
TSockettransport=newTSocket(host,port);
transport.setTimeout(TIMEOUT);
transport.open();
TBinaryProtocolprotocol=newTBinaryProtocol(transport);
client=newThriftHive.Client(protocol);
WecallgetClient()fromthemainmethodandusetheclienttoexecuteaqueryagainstaninstanceofHiveServerrunningonlocalhostonport11111,asfollows:
publicstaticvoidmain(String[]args)throwsException{
Clientclient=getClient("localhost",11111);
client.execute("showtables");
List<String>results=client.fetchAll();
for(Stringresult:results){
System.out.println(result);
}
}
MakesurethatHiveServerisrunningonport11111,andifnot,startaninstancewiththefollowingcommand:
$sudohive--servicehiveserver-p11111
CompileandexecutetheHiveThriftClient.javaexamplewith:
$javac$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*
com/learninghadoop2/hive/client/HiveThriftClient.java
$java-cp$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:
com.learninghadoop2.hive.client.HiveThriftClient
![Page 308: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/308.jpg)
StingerinitiativeHivehasremainedverysuccessfulandcapablesinceitsearliestreleases,particularlyinitsabilitytoprovideSQL-likeprocessingonenormousdatasets.Butothertechnologiesdidnotstandstill,andHiveacquiredareputationofbeingrelativelyslow,particularlyinregardtolengthystartuptimesonlargejobsanditsinabilitytogivequickresponsestoconceptuallysimplequeries.
TheseperceivedlimitationswerelessduetoHiveitselfandmoreaconsequenceofhowtranslationofSQLqueriesintotheMapReducemodelhasmuchbuilt-ininefficiencywhencomparedtootherwaysofimplementingaSQLquery.Particularlyinregardtoverylargedatasets,MapReducesawlotsofI/O(andconsequentlytime)spentwritingouttheresultsofoneMapReducejobjusttohavethemreadbyanother.AsdiscussedinChapter3,Processing–MapReduceandBeyond,thisisamajordriverinthedesignofTez,whichcanschedulejobsonaHadoopclusterasagraphoftasksthatdoesnotrequireinefficientwritesandreadsbetweenthem.
ThefollowingisaqueryontheMapReduceframeworkversusTez:
SELECTa.country,COUNT(b.place_id)FROMplaceaJOINtweetsbON(a.
place_id=b.place_id)GROUPBYa.country;
ThefollowingfigurecontraststheexecutionplanfortheprecedingqueryontheMapReduceframeworkversusTez:
HiveonMapReduceversusTez
InplainMapReduce,twojobsarecreatedfortheGROUPBYandJOINclauses.ThefirstjobiscomposedofasetofMapReducetasksthatreaddatafromthedisktocarryoutgrouping.Thereducerswriteintermediateresultstothedisksothatoutputcanbesynchronized.Themappersinthesecondjobreadtheintermediateresultsfromthediskaswellasdatafromtableb.Thecombineddatasetisthenpassedtothereducerwheresharedkeysarejoined.WerewetoexecuteanORDERBYstatement,thiswouldhaveresultedina
![Page 309: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/309.jpg)
thirdjobandfurtherMapReducepasses.ThesamequeryisexecutedonTezasasinglejobbyasinglesetofMaptasksthatreaddatafromthedisk.I/Ogroupingandjoiningarepipelinedacrossreducers.
Alongsidethesearchitecturallimitations,therewerequiteafewareasaroundSQLlanguagesupportthatcouldalsoprovidebetterefficiency,andinearly2013,theStingerinitiativewaslaunchedwithanexplicitgoalofmakingHiveover100timesasfastandwithmuchricherSQLsupport.Hive0.13hasallthefeaturesofthethreephasesofStinger,resultinginamuchmorecompleteSQLdialect.Also,TezisofferedasanexecutionframeworkinadditiontoaMapReduce-basedimplementationatopYARNwhichismoreefficientthanpreviousimplementationsonHadoop1MapReduce.
WithTezastheexecutionengine,HiveisnolongerlimitedtoaseriesoflinearMapReducejobsandcaninsteadbuildaprocessinggraphwhereanygivenstepcan,forexample,streamresultstomultiplesub-steps.
TotakeadvantageoftheTezframework,thereisanewhivevariablesetting:
sethive.execution.engine=tez;
ThissettingreliesonTezbeinginstalledonthecluster;itisavailableinsourceformfromhttp://tez.apache.orgorinseveraldistributions,thoughatthetimeofwriting,notCloudera.
Thealternativevalueismr,whichusestheclassicMapReducemodel(atopYARN),soitispossibleinasingleinstallationtocomparewiththeperformanceofHiveusingTez.
![Page 310: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/310.jpg)
ImpalaHiveisnottheonlyproductprovidingSQL-on-Hadoopcapability.ThesecondmostwidelyusedislikelyImpala,announcedinlate2012andreleasedinspring2013.ThoughoriginallydevelopedinternallywithinCloudera,itssourcecodeisperiodicallypushedtoanopensourceGitrepository(https://github.com/cloudera/impala).
ImpalawascreatedoutofthesameperceptionofHive’sweaknessesthatledtotheStingerinitiative.
ImpalaalsotooksomeinspirationfromGoogleDremel(http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdfwhichwasfirstopenlydescribedbyapaperpublishedin2009.DremelwasbuiltatGoogletoaddressthegapbetweentheneedforveryfastqueriesonverylargedatasetsandthehighlatencyinherentintheexistingMapReducemodelunderpinningHiveatthetime.Dremelwasasophisticatedapproachtothisproblemthat,ratherthanbuildingmitigationsatopMapReducesuchasimplementedbyHive,insteadcreatedanewservicethataccessedthesamedatastoredinHDFS.Dremelalsobenefitedfromsignificantworktooptimizethestorageformatofthedatainawaythatmadeitmoreamenabletoveryfastanalyticqueries.
![Page 311: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/311.jpg)
ThearchitectureofImpalaThebasicarchitecturehasthreemaincomponents;theImpaladaemons,thestatestore,andtheclients.Recentversionshaveaddedadditionalcomponentsthatimprovetheservice,butwe’llfocusonthehigh-levelarchitecture.
TheImpaladaemon(impalad)shouldberunoneachhostwhereaDataNodeprocessismanagingHDFSdata.NotethatimpaladdoesnotaccessthefilesystemblocksthroughthefullHDFSFileSystemAPI;instead,itusesafeaturecalledshort-circuitreadstomakedataaccessmoreefficient.
Whenaclientsubmitsaquery,itcandosotoanyoftherunningimpaladprocesses,andthisonewillbecomethecoordinatorfortheexecutionofthatquery.ThekeyaspectofImpala’sperformanceisthatforeachquery,itgeneratescustomnativecode,whichisthenpushedtoandexecutedbyalltheimpaladprocessesonthesystem.Thishighlyoptimizedcodeperformsthequeryonthelocaldata,andeachimpaladthenreturnsitssubsetoftheresultsettothecoordinatornode,whichperformsthefinaldataconsolidationtoproducethefinalresult.Thistypeofarchitectureshouldbefamiliartoanyonewhohasworkedwithanyofthe(usuallycommercialandexpensive)MassivelyParallelProcessing(MPP)(thetermusedforthistypeofsharedscale-outarchitecture)datawarehousesolutionsavailabletoday.Astheclusterruns,thestatestoredaemonensuresthateachimpaladprocessisawareofalltheothersandprovidesaviewoftheoverallclusterhealth.
![Page 312: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/312.jpg)
Co-existingwithHiveImpala,asanewerproduct,tendstohaveamorerestrictedsetofSQLdatatypesandsupportsamoreconstraineddialectofSQLthanHive.Itis,however,expandingthissupportwitheachnewrelease.RefertotheImpaladocumentation(http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html)togetanoverviewofthecurrentlevelofsupport.
ImpalasupportstheHivemetastoremechanismusedbyHivetopersistentlystorethemetadatasurroundingitstablestructureandstorage.ThismeansthatonaclusterwithanexistingHivesetup,itshouldbeimmediatelypossibletouseImpalaasitwillaccessthesamemetastoreandthereforeprovideaccesstothesametablesavailableinHive.
ButbewarnedthatthedifferencesinSQLdialectanddatatypesmightcauseunexpectedresultswhenworkinginacombinedHiveandImpalaenvironment.Somequeriesmightworkononebutnottheother,theymightshowverydifferentperformancecharacteristics(moreonthislater),ortheymightactuallygivedifferentresults.Thislastpointmightbecomeapparentwhenusingdatatypessuchasfloatanddoublethataresimplytreateddifferentlyintheunderlyingsystems(HiveisimplementedonJavawhileImpalaiswritteninC++).
Asofversion1.2,itsupportsUDFswrittenbothinC++andJava,althoughC++isstronglyrecommendedasamuchfastersolution.KeepthisinmindifyouarelookingtosharecustomfunctionsbetweenHiveandImpala.
![Page 313: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/313.jpg)
AdifferentphilosophyWhenImpalawasfirstreleased,itsgreatestbenefitwasinhowittrulyenabledwhatisoftencalledspeedofthoughtanalysis.Queriescouldbereturnedsufficientlyfastthatananalystcouldexploreathreadofanalysisinacompletelyinteractivefashionwithouthavingtowaitforminutesatatimeforeachquerytocomplete.It’sfairtosaythatmostadoptersofImpalawereattimesstunnedbyitsperformance,especiallywhencomparedtotheversionofHiveshippingatthetime.
TheImpalafocushasremainedmostlyontheseshorterqueries,andthisdoesimposesomelimitationsonthesystem.Impalatendstobequitememory-heavyasitreliesonin-memoryprocessingtoachievemuchofitsperformance.Ifaqueryrequiresadatasettobeheldinmemoryratherthanbeingavailableontheexecutingnode,thenthatquerywillsimplyfailinversionsofImpalabefore2.0.
ComparingtheworkonStingertoImpala,itcouldbearguedthatImpalahasamuchstrongerfocusonexcellingintheshorter(andarguablymorecommon)queriesthatsupportinteractivedataanalysis.ManybusinessintelligencetoolsandservicesarenowcertifiedtodirectlyrunonImpala.TheStingerinitiativehasputlesseffortintomakingHivejustasfastintheareawhereImpalaexcelsbuthasinsteadimprovedHive(tovaryingdegrees)forallworkloads.ImpalaisstilldevelopingatafastpaceandStingerhasputadditionalmomentumintoHive,soitismostlikelywisetoconsiderbothproductsanddeterminewhichbestmeetstheperformanceandfunctionalityrequirementsofyourprojectsandworkflows.
ItshouldalsobekeptinmindthattherearecompetitivecommercialpressuresshapingthedirectionofImpalaandHive.ImpalawascreatedandisstilldrivenbyCloudera,themostpopularvendorofHadoopdistributions.TheStingerinitiative,thoughcontributedtobymanycompaniesasdiverseasMicrosoft(yes,really!)andIntel,wasleadbyHortonworks,probablythesecondlargestvendorofHadoopdistributions.ThefactisthatifyouareusingtheClouderadistributionofHadoop,thensomeofthecorefeaturesofHivemightbeslowertoarrive,whereasImpalawillalwaysbeup-to-date.Conversely,ifyouuseanotherdistribution,youmightgetthelatestHiverelease,butthatmighteitherhaveanolderImpalaor,asiscurrentlythecase,youmighthavetodownloadandinstallityourself.
AsimilarsituationhasarisenwiththeParquetandORCfileformatsmentionedearlier.ParquetispreferredbyImpalaanddevelopedbyagroupofcompaniesledbyCloudera,whileORCispreferredbyHiveandischampionedbyHortonworks.
Unfortunately,therealityisthatParquetsupportisoftenveryquicktoarriveintheClouderadistributionbutlesssoinsaytheHortonworksdistribution,wheretheORCfileformatispreferred.
Thesethemesarealittleconcerningsince,althoughcompetitioninthisspaceisagoodthing,andarguablytheannouncementofImpalahelpedenergizetheHivecommunity,thereisagreaterriskthatyourchoiceofdistributionmighthavealargerimpactonthe
![Page 314: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/314.jpg)
toolsandfileformatsthatwillbefullysupported,unlikeinthepast.Hopefully,thecurrentsituationisjustanartifactofwhereweareinthedevelopmentcyclesofallthesenewandimprovedtechnologies,butdoconsideryourchoiceofdistributioncarefullyinrelationtoyourSQL-on-Hadoopneeds.
![Page 315: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/315.jpg)
Drill,Tajo,andbeyondYoushouldalsoconsiderthatSQLonHadoopnolongeronlyreferstoHiveorImpala.ApacheDrill(http://drill.apache.org)isafullerimplementationoftheDremelmodelfirstdescribedbyGoogle.AlthoughImpalaimplementstheDremelarchitectureacrossHDFSdata,Drilllookstoprovidesimilarfunctionalityacrossmultipledatasources.Itisstillinitsearlystages,butifyourneedsarebroaderthanwhatHiveorImpalaprovides,itmightbeworthconsidering.
Tajo(http://tajo.apache.org)isanotherApacheprojectthatseekstobeafulldatawarehousesystemonHadoopdata.WithanarchitecturesimilartothatofImpala,itoffersamuchrichersystemwithcomponentssuchasmultipleoptimizersandETLtoolsthatarecommonplaceintraditionaldatawarehousesbutlessfrequentlybundledintheHadoopworld.Ithasamuchsmalleruserbasebuthasbeenusedbycertaincompaniesverysuccessfullyforasignificantlengthoftime,andmightbeworthconsideringifyouneedafullerdatawarehousingsolution.
Otherproductsarealsoemerginginthisspace,andit’sagoodideatodosomeresearch.HiveandImpalaareawesometools,butifyoufindthattheydon’tmeetyourneeds,thenlookaround—somethingelsemight.
![Page 316: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/316.jpg)
SummaryInitsearlydays,Hadoopwassometimeserroneouslyseenasthelatestsupposedrelationaldatabasekiller.Overtime,ithasbecomemoreapparentthatthemoresensibleapproachistoviewitasacomplementtoRDBMStechnologiesandthat,infact,theRDBMScommunityhasdevelopedtoolssuchasSQLthatarealsovaluableintheHadoopworld.
HiveQLisanimplementationofSQLonHadoopandwastheprimaryfocusofthischapter.InregardtoHiveQLanditsimplementations,wecoveredthefollowingtopics:
HowHiveQLprovidesalogicalmodelatopdatastoredinHDFSincontrasttorelationaldatabaseswherethetablestructureisenforcedinadvanceHowHiveQLsupportsmanystandardSQLdatatypesandcommandsincludingjoinsandviewsTheETL-likefeaturesofferedbyHiveQL,includingtheabilitytoimportdataintotablesandoptimizethetablestructurethroughpartitioningandsimilarmechanismsHowHiveQLofferstheabilitytoextenditscoresetofoperatorswithuser-definedcodeandhowthiscontraststothePigUDFmechanismTherecenthistoryofHivedevelopments,suchastheStingerinitiative,thathaveseenHivetransitiontoanupdatedimplementationthatusesTezThebroaderecosystemaroundHiveQLthatnowincludesproductssuchasImpala,TajoandDrillandhoweachofthesefocusesonspecificareasinwhichtoexcel
WithPigandHive,we’veintroducedalternativemodelstoprocessMapReducedata,butsofarwe’venotlookedatanotherquestion:whatapproachesandtoolsarerequiredtoactuallyallowthismassivedatasetbeingcollectedinHadooptoremainusefulandmanageableovertime?Inthenextchapter,we’lltakeaslightstepuptheabstractionhierarchyandlookathowtomanagethelifecycleofthisenormousdataasset.
![Page 317: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/317.jpg)
Chapter8.DataLifecycleManagementOurpreviouschapterswerequitetechnologyfocused,describingparticulartoolsortechniquesandhowtheycanbeused.Inthisandthenextchapter,wearegoingtotakeamoretop-downapproachwherebywewilldescribeaproblemspaceyouarelikelytoencounterandthenexplorehowtoaddressit.Inparticular,we’llcoverthefollowingtopics:
WhatwemeanbythetermdatalifecyclemanagementWhydatalifecyclemanagementissomethingtothinkaboutThecategoriesoftoolsthatcanbeusedtoaddresstheproblemHowtousethesetoolstobuildthefirsthalfofaTwittersentimentanalysispipeline
![Page 318: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/318.jpg)
WhatdatalifecyclemanagementisDatadoesn’texistonlyatapointintime.Particularlyforlong-runningproductionworkflows,youarelikelytoacquireasignificantquantityofdatainaHadoopcluster.Requirementsrarelystaystaticforlong,soalongsidenewlogicyoumightalsoseetheformatofthatdatachangeorrequiremultipledatasourcestobeusedtoprovidethedatasetprocessedinyourapplication.Weusethetermdatalifecyclemanagementtodescribeanapproachtohandlingthecollection,storage,andtransformationofdatathatensuresthatdataiswhereitneedstobe,intheformatitneedstobein,inawaythatallowsdataandsystemevolutionovertime.
![Page 319: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/319.jpg)
ImportanceofdatalifecyclemanagementIfyoubuilddataprocessingapplications,youarebydefinitionreliantonthedatathatisprocessed.Justasweconsiderthereliabilityofapplicationsandsystems,itbecomesnecessarytoensurethatthedataisalsoproduction-ready.
DataatsomepointneedstobeingestedintoHadoop.Itisonepartofanenterpriseandoftenhasmultiplepointsofintegrationwithexternalsystems.Iftheingestofdatacomingfromthosesystemsisnotreliable,thentheimpactonthejobsthatprocessthatdataisoftenasdisruptiveasamajorsystemfailure.Dataingestbecomesacriticalcomponentinitsownright.Andwhenwesaytheingestneedstobereliable,wedon’tjustmeanthatdataisarriving;italsohastobearrivinginaformatthatisusableandthroughamechanismthatcanhandleevolutionovertime.
Theproblemwithmanyoftheseissuesisthattheydonotariseinasignificantfashionuntiltheflowsarelarge,thesystemiscritical,andthebusinessimpactofanyproblemsisnon-trivial.Adhocapproachesthatworkedforalesscriticaldataflowoftenwillsimplynotscale,butwillbeverypainfultoreplaceonalivesystem.
![Page 320: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/320.jpg)
ToolstohelpButdon’tpanic!Thereareanumberofcategoriesoftoolsthatcanhelpwiththedatalifecyclemanagementproblem.We’llgiveexamplesofthefollowingthreebroadcategoriesinthischapter:
Orchestrationservices:buildinganingestpipelineusuallyhasmultiplediscretestages,andwewilluseanorchestrationtooltoallowthesetobedescribed,executed,andmanagedConnectors:giventheimportanceofintegrationwithexternalsystems,wewilllookathowwecanuseconnectorstosimplifytheabstractionsprovidedbyHadoopstorageFileformats:howwestorethedataimpactshowwemanageformatevolutionovertime,andseveralrichstorageformatshavewaysofsupportingthis
![Page 321: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/321.jpg)
BuildingatweetanalysiscapabilityInearlierchapters,weusedvariousimplementationsofTwitterdataanalysistodescribeseveralconcepts.Wewilltakethiscapabilitytoadeeperlevelandapproachitasamajorcasestudy.
Inthischapter,wewillbuildadataingestpipeline,constructingaproduction-readydataflowthatisdesignedwithreliabilityandfutureevolutioninmind.
We’llbuildoutthepipelineincrementallythroughoutthechapter.Ateachstage,we’llhighlightwhathaschangedbutcan’tincludefulllistingsateachstagewithouttreblingthesizeofthechapter.Thesourcecodeforthischapter,however,haseveryiterationinitsfullglory.
![Page 322: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/322.jpg)
GettingthetweetdataThefirstthingweneedtodoisgettheactualtweetdata.Asinpreviousexamples,wecanpassthe-jand-nargumentstostream.pytodumpJSONtweetstostdout:
$stream.py-j-n10000>tweets.json
Sincewehavethistoolthatcancreateabatchofsampletweetsondemand,wecouldstartouringestpipelinebyhavingthisjobrunonaperiodicbasis.Buthow?
![Page 323: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/323.jpg)
IntroducingOozieWecould,ofcourse,bangrockstogetherandusesomethinglikecronforsimplejobscheduling,butrecallthatwewantaningestpipelinethatisbuiltwithreliabilityinmind.So,wereallywantaschedulingtoolthatwecanusetodetectfailuresandotherwiserespondtoexceptionalsituations.
ThetoolwewillusehereisOozie(http://oozie.apache.org),aworkflowengineandschedulerbuiltwithafocusontheHadoopecosystem.
Oozieprovidesameanstodefineaworkflowasaseriesofnodeswithconfigurableparametersandcontrolledtransitionfromonenodetothenext.ItisinstalledaspartoftheClouderaQuickStartVM,andthemaincommand-lineclientis,notsurprisingly,calledoozie.
NoteWe’vetestedtheworkflowsinthischapteragainstversion5.0oftheClouderaQuickStartVM,andatthetimeofwritingOozieinthelatestversion,5.1,hassomeissues.There’snothingparticularlyversion-specificinourworkflows,however,sotheyshouldbecompatiblewithanycorrectlyworkingOoziev4implementation.
Thoughpowerfulandflexible,Ooziecantakealittlegettingusedto,sowe’llgivesomeexamplesanddescribewhatwearedoingalongtheway.
ThemostcommonnodeinanOozieworkflowisanaction.Itiswithinactionnodesthatthestepsoftheworkflowareactuallyexecuted;theothernodetypeshandlemanagementoftheworkflowintermsofdecisions,parallelism,andfailuredetection.Ooziehasmultipletypesofactionsthatitcanperform.Oneoftheseistheshellaction,whichcanbeusedtoexecuteanycommandonthesystem,suchasnativebinaries,shellscripts,oranyothercommand-lineutility.Let’screateascripttogenerateafileoftweetsandcopythistoHDFS:
set-e
sourcetwitter.keys
pythonstream.py-j-n500>/tmp/tweets.out
hdfsdfs-put/tmp/tweets.out/tmp/tweets/tweets.out
rm-f/tmp/tweets.out
Notethatthefirstlinewillcausetheentirescripttofailshouldanyoftheincludedcommandsfail.WeuseanenvironmentfiletoprovidetheTwitterkeystoourscriptintwitter.keys,whichisofthefollowingform:
exportTWITTER_CONSUMER_KEY=<value>
exportTWITTER_CONSUMER_SECRET=<value>
exportTWITTER_ACCESS_KEY=<value>
exportTWITTER_ACCESS_SECRET=<value>
OozieusesXMLtodescribeitsworkflows,usuallystoredinafilecalledworkflow.xml.Let’swalkthroughthedefinitionforanOozieworkflowthatcallsashellcommand.
![Page 324: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/324.jpg)
TheschemaforanOozieworkflowiscalledworkflow-app,andwecangivetheworkflowaspecificname.ThisisusefulwhenviewingjobhistoryintheCLIorOoziewebUI.Intheexamplesinthisbook,we’lluseanincreasingversionnumbertoallowustomoreeasilyseparatetheiterationswithinthesourcerepository.Thisishowwegivetheworkflow-appaspecificname:
<workflow-appxmlns="uri:oozie:workflow:0.4"name="v1">
Oozieworkflowsaremadeupofaseriesofconnectednodes,eachofwhichrepresentsastepintheprocess,andwhicharerepresentedbyXMLnodesintheworkflowdefinition.Ooziehasanumberofnodesthatdealwiththetransitionoftheworkflowfromonesteptothenext.Thefirstoftheseisthestartnode,whichsimplystatesthenameofthefirstnodetobeexecutedaspartoftheworkflow,asfollows:
<startto="fs-node"/>
Wethenhavethedefinitionforthenamedstartnode.Inthiscase,itisanactionnode,whichisthegenericnodetypeformostOozienodesthatactuallyperformsomeprocessing,asfollows:
<actionname="fs-node">
Actionisabroadcategoryofnodes,andwewilltypicallythenspecializeitwiththeparticularprocessingforthisgivennode.Inthiscase,weareusingthefsnodetype,whichallowsustoperformfilesystemoperations:
<fs>
WewanttoensurethatthedirectoryonHDFStowhichwewishtocopythefileoftweetdata,exists,isempty,andhassuitablepermissions.Wedothisbytryingtodeletethedirectoryifitexists,thencreatingit,andfinallyapplyingtherequiredpermissions,asfollows:
<deletepath="${nameNode}/tmp/tweets"/>
<mkdirpath="${nameNode}/tmp/tweets"/>
<chmodpath="${nameNode}/tmp/tweets"permissions="777"/>
</fs>
We’llseeanalternativewayofsettingupdirectorieslater.Afterperformingthefunctionalityofthenode,Oozieneedsknowhowtoproceedwiththeworkflow.Inmostcases,thiswillcomprisemovingtoanotheractionnodeifthisnodewassuccessfulandabortingtheworkflowotherwise.Thisisspecifiedbythenextelements.Theoknodegivesthenameofthenodetowhichtotransitioniftheexecutionwassuccessful;theerrornodenamesthedestinationnodeforfailurescenarios.Here’showtheokandfailnodesareused:
<okto="shell-node"/>
<errorto="fail"/>
</action>
<actionname="shell-node">
Thesecondactionnodeisagainspecializedwithitsspecificprocessingtype;inthiscase,
![Page 325: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/325.jpg)
wehaveashellnode:
<shellxmlns="uri:oozie:shell-action:0.2">
TheshellactionthenhastheHadoopJobTrackerandNameNodelocationsspecified.Notethattheactualvaluesaregivenbyvariables;we’llexplainwheretheycomefromlater.TheJobTrackerandNameNodearespecifiedasfollows:
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
AsmentionedinChapter3,Processing–MapReduceandBeyond,MapReduceusesmultiplequeuestoprovidesupportfordifferentapproachestoresourcescheduling.ThenextelementspecifiestheMapReducequeuetowhichtheworkflowshouldbesubmitted:
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
Nowthattheshellnodeisfullyconfigured,wecanspecifythecommandtoinvoke,againviaavariable,asfollows:
<exec>${EXEC}</exec>
ThevariousstepsofOozieworkflowsareexecutedasMapReducejobs.Thisshellactionwill,therefore,beexecutedasaspecifictaskinstanceonaparticularTaskTracker.We,therefore,needtospecifywhichfilesneedtobecopiedtothelocalworkingdirectoryontheTaskTrackermachinebeforetheactioncanbeperformed.Inthiscase,weneedtocopythemainshellscript,thePythontweetgenerator,andtheTwitterconfigfile,asfollows:
<file>${workflowRoot}/${EXEC}</file>
<file>${workflowRoot}/twitter.keys</file>
<file>${workflowRoot}/stream.py</file>
Afterclosingtheshellelement,weagainspecifywhattododependingonwhethertheactioncompletedsuccessfullyornot.BecauseMapReduceisusedforjobexecution,themajorityofnodetypesbydefinitionhavebuilt-inretryandrecoverylogic,thoughthisisnotthecaseforshellnodes:
</shell>
<okto="end"/>
<errorto="fail"/>
</action>
Iftheworkflowfails,let’sjustkillitinthiscase.Thekillnodetypedoesexactlythat—terminatetheworkflowfromproceedingtoanyfurthersteps,usuallyloggingerrormessagesalongtheway.Here’showthekillnodetypeisused:
<killname="fail">
<message>Shellactionfailed,error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
![Page 326: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/326.jpg)
</kill>
TheendnodeontheotherhandsimplyhaltstheworkflowandlogsitasasuccessfulcompletionwithinOozie:
<endname="end"/>
</workflow-app>
Theobviousquestioniswhattheprecedingvariablesrepresentandfromwheretheygettheirconcretevalues.TheprecedingvariablesareexamplesoftheOozieExpressionLanguageoftenreferredtoasEL.
Alongsidetheworkflowdefinitionfile(workflow.xml),whichdescribesthestepsintheflow,wealsoneedtocreateaconfigurationfilethatgivesthespecificvaluesforagivenexecutionoftheworkflow.Thisseparationoffunctionalityandconfigurationallowsustowriteworkflowsthatcanbeusedondifferentclusters,ondifferentfilelocations,orwithdifferentvariablevalueswithouthavingtorecreatetheworkflowitself.Byconvention,thisfileisusuallynamedjob.properties.Fortheprecedingworkflow,here’sasamplejob.propertiesfile.
Firstly,wespecifythelocationoftheJobTracker,theNameNode,andtheMapReducequeuetowhichtosubmittheworkflow.ThefollowingshouldworkontheCloudera5.0QuickStartVM,thoughinv5.1thehostnamehasbeenchangedtoquickstart.cloudera.TheimportantthingisthatthespecifiedNameNodeandJobTrackeraddressesneedtobeintheOoziewhitelist—thelocalservicesontheVMareaddedautomatically:
jobTracker=localhost.localdomain:8032
nameNode=hdfs://localhost.localdomain:8020
queueName=default
Next,wesetsomevaluesforwheretheworkflowdefinitionsandassociatedfilescanbefoundontheHDFSfilesystem.Notetheuseofavariablerepresentingtheusernamerunningthejob.Thisallowsasingleworkflowtobeappliedtodifferentpathsdependingonthesubmittinguser,asfollows:
tasksRoot=book
workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v1
oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v1
Next,wenamethecommandtobeexecutedintheworkflowas${EXEC}:
EXEC=gettweets.sh
Morecomplexworkflowswillrequireadditionalentriesinthejob.propertiesfile;theprecedingworkflowisassimpleasitgets.
Theooziecommand-linetoolneedstoknowwheretheOozieserverisrunning.ThiscanbeaddedasanargumenttoeveryOozieshellcommand,butthatgetsunwieldyveryquickly.Instead,youcansettheshellenvironmentvariable,asfollows:
$exportOOZIE_URL='http://localhost:11000/oozie'
Afterallthatwork,wecannowactuallyrunanOozieworkflow.Createadirectoryon
![Page 327: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/327.jpg)
HDFSasspecifiedinthevaluesinthejob.propertiesfile.Intheprecedingcommand,we’dbecreatingthisasbook/v1underourhomedirectoryonHDFS.Copythestream.py,gettweets.shandtwitter.propertiesfilestothatdirectory;thesearethefilesrequiredtoperformtheactualexecutionoftheshellcommand.Then,addtheworkflow.xmlfiletothesamedirectory.
Toruntheworkflowthen,wedothefollowing:
$ooziejob-run-config<path-to-job.properties>
Ifsubmittedsuccessfully,Ooziewillprintthejobnametothescreen.Youcanseethecurrentstatusofthisworkflowwith:
$ooziejob-info<job-id>
Youcanalsocheckthelogsforthejob:
$ooziejob-log<job-id>
Inaddition,allcurrentandrecentjobscanbeviewedwith:
$ooziejobs
AnoteonHDFSfilepermissionsThereisasubtleaspectintheshellcommandthatcancatchtheunwary.Asanalternativetohavingthefsnode,wecouldinsteadincludeapreparationelementwithintheshellnodetocreatethedirectoryweneedonthefilesystem.Itwouldlooklikethefollowing:
<prepare>
<mkdirpath="${nameNode}/tmp/tweets"/>
</prepare>
Thepreparestageisexecutedbytheuserwhosubmittedtheworkflow,butsincetheactualscriptexecutionisperformedonYARN,itisusuallyexecutedastheyarnuser.Youmighthitaproblemwherethescriptgeneratesthetweets,the/tmp/tweetsdirectoryiscreatedonHDFS,butthescriptthenfailstohavepermissiontowritetothatdirectory.Youcaneitherresolvethisthroughassigningpermissionsmorepreciselyor,asshownearlier,youaddafilesystemnodetoencapsulatetheneededoperations.We’lluseamixtureofbothtechniquesinthischapter;fornon-shellnodes,we’lluseprepareelements,particularlyiftheneededdirectoryismanipulatedonlybythatnode.Forcaseswhereashellnodeisinvolvedorwherethecreateddirectorieswillbeusedacrossmultiplenodes,we’llbesafeandusethemoreexplicitfsnode.
MakingdevelopmentalittleeasierItcansometimesgetawkwardtomanagethefilesandresourcesforanOoziejobduringdevelopment.SomeneedtobeonHDFS,whilesomeneedtobelocal,andchangestosomefilesrequirechangestoothers.TheeasiestapproachisoftentodevelopormakechangesinacompletecloneoftheworkflowdirectoryonthelocalfilesystemandpushchangesfromtheretothesimilarlynameddirectoryinHDFS,notforgetting,ofcourse,toensurethatallchangesareunderrevisioncontrol!Foroperationalexecutionofthe
![Page 328: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/328.jpg)
workflow,thejob.propertiesfileistheonlythingthatneedstobeonthelocalfilesystemand,conversely,alltheotherfilesneedtobeonHDFS.Alwaysrememberthis:it’salltooeasytomakechangestoalocalcopyofaworkflow,forgettopushthechangestoHDFS,andthenbeconfusedastowhytheworkflowisn’treflectingthechanges.
ExtractingdataandingestingintoHiveWithourdataonHDFS,wecannowextracttheseparatedatasetsfortweetsandusers,andplacedataasinpreviouschapters.Wecanreuseextract_for_hive.pigtoparsetherawtweetJSONintoseparatefiles,storethemagainonHDFS,andthenfollowupwithaHivestepthatingeststhesenewfilesintoHivetablesfortweets,users,andplaces.
TodothiswithinOozie,we’llneedtoaddtwonewnodestoourworkflow,aPigactionforthefirststepandaHiveactionforthesecond.
ForourHiveaction,we’lljustcreatethreeexternaltablesthatpointtothefilesgeneratedbyPig.ThiswouldthenallowustofollowourpreviouslydescribedmodelofingestingintotemporaryorexternaltablesandusingHiveQLINSERTstatementsfromtheretoinsertintotheoperational,andoftenpartitioned,tables.Thiscreate.hqlscriptcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch8/v2/hive/create.hqlbutissimplyofthefollowingform:
CREATEDATABASEIFNOTEXISTStwttr;
USEtwttr;
DROPTABLEIFEXISTStweets;
CREATEEXTERNALTABLEtweets(
...
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${ingestDir}/tweets';
DROPTABLEIFEXISTSuser;
CREATEEXTERNALTABLEuser(
...
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${ingestDir}/users';
DROPTABLEIFEXISTSplace;
CREATEEXTERNALTABLEplace(
...
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${ingestDir}/places';
NotethatthefileseparatoroneachtableisalsoexplicitlysettomatchwhatweareoutputtingfromPig.Inadditiontothis,locationsinbothscriptsarespecifiedbyvariablesforwhichwewillprovideconcretevaluesinourjob.propertiesfile.
Withtheprecedingstatements,wecancreatethePignodeforourworkflowfoundinthe
![Page 329: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/329.jpg)
sourcecodeasv2ofthepipeline.Muchofthenodedefinitionlookssimilartotheshellnodeusedpreviously,aswesetthesameconfigurationelements;alsonoticeouruseoftheprepareelementtocreatetheneededoutputdirectory.WecancreatethePignodeforourworkflowasshowninthefollowingaction:
<actionname="pig-node">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<deletepath="${nameNode}/${outputDir}"/>
<mkdirpath="${nameNode}/${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
Similarlyaswiththeshellcommand,weneedtotellthePigactionthelocationoftheactualPigscript.Thisisspecifiedinthefollowingscriptelement:
<script>${workflowRoot}/pig/extract_for_hive.pig</script>
WealsoneedtomodifythecommandlineusedtoinvokethePigscripttoaddseveralparameters.Thefollowingelementsdothis;notetheconstructionpatternwhereinoneelementaddstheactualparameternameandthenextitsvalue(we’llseeanalternativemechanismforpassingargumentsinthenextsection):
<argument>-param</argument>
<argument>inputDir=${inputDir}</argument>
<argument>-param</argument>
<argument>outputDir=${outputDir}</argument>
</pig>
BecausewewanttomovefromthissteptotheHivenode,weneedtosetthefollowingelementsappropriately:
<okto="hive-node"/>
<errorto="fail"/>
</action>
TheHiveactionitselfisalittledifferentthanthepreviousnodes;eventhoughitstartsinasimilarfashion,itspecifiestheHiveaction-specificnamespace,asfollows:
<actionname="hive-node">
<hivexmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
TheHiveactionneedsmanyoftheconfigurationelementsusedbyHiveitselfand,inmostcases,wecopythehive-site.xmlfileintotheworkflowdirectoryandspecifyitslocation,asshowninthefollowingxml;notethatthismechanismisnotHive-specificand
![Page 330: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/330.jpg)
canalsobeusedforcustomactions:
<job-xml>${workflowRoot}/hive-site.xml</job-xml>
Inaddition,wemightneedtooverridesomeMapReducedefaultconfigurationproperties,asshowninthefollowingxml,wherewespecifythatintermediatecompressionshouldbeusedforourjob:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
AfterconfiguringtheHiveenvironment,wenowspecifythelocationoftheHivescript:
<script>${workflowRoot}/hive/create.hql</script>
WealsohavetoprovidethemechanismtopassargumentstotheHivescript.Butinsteadofbuildingoutthecommandlineonecomponentatatime,we’lladdtheparamelementsthatmapthenameofaconfigurationelementinthejob.propertiesfiletovariablesspecifiedintheHivescript;thismechanismisalsosupportedwithPigactions:
<param>dbName=${dbName}</param>
<param>ingestDir=${ingestDir}</param>
</hive>
TheHivenodethenclosesastheothers,asfollows:
<okto="end"/>
<errorto="fail"/>
</action>
WenowneedtoputallthistogethertorunthemultistageworkflowinOozie.Thefullworkflow.xmlfilecanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch8/v2andtheworkflowisvisualizedinthefollowingdiagram:
Dataingestionworkflowv2
![Page 331: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/331.jpg)
Thisworkflowperformsallthestepsdiscussedbefore;itgeneratestweetdata,extractssubsetsofdataviaPig,andtheningeststheseintoHive.
AnoteonworkflowdirectorystructureWenowhavequiteafewfilesinourworkflowdirectoryanditisbesttoadoptsomestructureandnamingconventions.Forthecurrentworkflow,ourdirectoryonHDFSlookslikethefollowing:
/hive/
/hive/create.hql
/lib/
/pig/
/pig/extract_for_hive.pig
/scripts/
/scripts/gettweets.sh
/scripts/stream-json-batch.py
/scripts/twitter-keys
/hive-site.xml
/job.properties
/workflow.xml
Themodelwefollowistokeepconfigurationfilesinthetop-leveldirectorybuttokeepfilesrelatedtoagivenactiontypeindedicatedsubdirectories.Notethatitisusefultohavealibdirectoryevenifempty,assomenodetypeslookforit.
Withtheprecedingstructure,thejob.propertiesfileforourcombinedjobisnowthefollowing:
jobTracker=localhost.localdomain:8032
nameNode=hdfs://localhost.localdomain:8020
queueName=default
tasksRoot=book
workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v2
oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v2
oozie.use.system.libpath=true
EXEC=gettweets.sh
inputDir=/tmp/tweets
outputDir=/tmp/tweetdata
ingestDir=/tmp/tweetdata
dbName=twttr
Intheprecedingcode,we’vefullyupdatedtheworkflow.xmldefinitiontoincludeallthestepsdescribedsofar—includinganinitialfsnodetocreatetherequireddirectorywithoutworryingaboutuserpermissions.
IntroducingHCatalogIfwelookatourcurrentworkflow,thereisinefficiencyinhowweuseHDFSastheinterfacebetweenPigandHive.WeneedtooutputtheresultofourPigscriptontoHDFS,wheretheHivescriptcanthenuseitasthelocationofsomenewtables.Whatthis
![Page 332: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/332.jpg)
highlightsisthatitisoftenveryusefultohavedatastoredinHive,butthisislimited,asfewtools(primarilyHive)canaccesstheHivemetastoreandhencereadandwritesuchdata.Ifwethinkaboutit,Hivehastwomainlayers:itstoolsforaccessingandmanipulatingitsdataplustheexecutionframeworktorunqueriesonthatdata.
TheHCatalogsubprojectofHiveeffectivelyprovidesanindependentimplementationofthefirstoftheselayers—themeanstoaccessandmanipulatedataintheHivemetastore.HCatalogprovidesmechanismsforothertools,suchasPigandMapReduce,tonativelyreadandwritetable-structureddatathatisstoredonHDFS.
Remember,ofcourse,thatthedataisstoredonHDFSinoneformatoranother.TheHivemetastoreprovidesthemodelstoabstractthesefilesintotherelationaltablestructurefamiliarfromHive.SowhenwesaywearestoringdatainHCatalog,whatwereallymeanisthatwearestoringdataonHDFSinsuchawaythatthisdatacanthenbeexposedbytablestructuresspecifiedwithintheHivemetastore.Conversely,whenwerefertoHivedata,whatwereallymeanisdatawhosemetadataisstoredintheHivemetastore,andwhichcanbeaccessedbyanymetastore-awaretool,suchasHCatalog.
UsingHCatalog
TheHCatalogcommand-linetooliscalledhcatandwillbepreinstalledontheClouderaQuickStartVM—itisinstalled,infact,withanyversionofHivelaterthan0.11inclusive.
Thehcatutilitydoesn’thaveaninteractivemode,sogenerallyyouwilluseitwithexplicitcommand-lineargumentsorbypointingitatafileofcommands,asfollows:
$hcat–e"usedefault;showtables"
$hcat–fcommands.hql
Thoughthehcattoolisusefulandcanbeincorporatedintoscripts,themoreinterestingelementofHCatalogforourpurposeshereisitsintegrationwithPig.HCatalogdefinesanewPigloadercalledHCatLoaderandastorercalledHCatStorer.Asthenamessuggest,theseallowPigscriptstoreadfromorwritetoHivetablesdirectly.WecanusethismechanismtoreplaceourpreviousPigandHiveactionsinourOozieworkflowwithasingleHCatalog-basedPigactionthatwritestheoutputofthePigjobdirectlyintoourtablesinHive.
Forclarity,we’llcreatenewtablesnamedtweets_hcat,places_hcat,andusers_hcatintowhichwe’llinsertthisdata;notethatthesearenolongerexternaltables:
CREATETABLEtweets_hcat…
CREATETABLEplaces_hcat…
CREATETABLEusers_hcat…
Notethatifwehadthesecommandsinascriptfile,wecouldusethehcatCLItooltoexecutethem,asfollows:
$hcat–fcreate.hql
TheHCatCLItooldoesnot,however,offeraninteractiveshellakintotheHiveCLI.WecannowuseourpreviousPigscriptandneedtoonlychangethestorecommands,replacingtheuseofPigStoragewithHCatStorer.OurupdatedPigscript,
![Page 333: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/333.jpg)
extract_to_hcat.pig,thereforeincludesstorecommandssuchasthefollowing:
storetweets_tsvinto'twttr.tweets_hcat'using
org.apache.hive.hcatalog.pig.HCatStorer();
NotethatthepackagenamefortheHCatStorerclasshastheorg.apache.hive.hcatalogprefix;whenHCatalogwasintheApacheincubator,itusedorg.apache.hcatalogforitspackageprefix.Thisolderformisnowdeprecated,andthenewformthatexplicitlyshowsHCatalogasasubprojectofHiveshouldbeusedinstead.
WiththisnewPigscript,wecannowreplaceourpreviousPigandHiveactionwithanupdatedPigactionusingHCatalog.ThisalsorequiresthefirstusageoftheOoziesharelib,whichwe’lldiscussinthenextsection.Inourworkflowdefinition,thepigelementofthisactionwillbedefinedasshowninthefollowingxmlandcanbefoundasv3ofthepipelineinthesourcebundle;inv3,we’vealsoaddedautilityHivenodetorunbeforethePignodetoensurethatallnecessarytablesexistbeforethePigscriptthatrequiresthemisexecuted.
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflowRoot}/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.action.sharelib.for.pig</name>
<value>pig,hcatalog</value>
</property>
</configuration>
<script>${workflowRoot}/pig/extract_to_hcat.pig
</script>
<argument>-param</argument>
<argument>inputDir=${inputDir}</argument>
</pig>
Thetwochangesofnotearetheadditionoftheexplicitreferencetothehive-site.xmlfile;thisisrequiredbyHCatalog,andthenewconfigurationelementthattellsOozietoincludetherequiredHCatalogJARs.
TheOoziesharelibThatlastadditiontouchedonanimportantaspectofOoziewe’venotmentionedthusfar:theOoziesharelib.WhenOozierunsallitsvariousactiontypes,itrequiresmultipleJARstoaccessHadoopandtoinvokevarioustools,suchasHiveandPig.AspartoftheOozieinstallation,alargenumberofdependentJARshavebeenplacedonHDFStobeusedbyOozieanditsvariousactiontypes:thisistheOoziesharelib.
FormostusagesofOozie,it’senoughtoknowthesharelibexists,usuallyunder/user/oozie/share/libonHDFS,andwhen,asinthepreviousexample,someexplicit
![Page 334: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/334.jpg)
configurationvaluesneedtobeadded.WhenusingaPigaction,thePigJARswillautomaticallygetpickedup,butwhenthePigscriptusessomethinglikeHCatalog,thenthisdependencywillnotbeexplicitlyknowntoOozie.
TheOozieCLIallowsmanipulationofthesharelib,thoughthescenarioswherethiswillberequiredareoutsideofthescopeofthisbook.ThefollowingcommandcanbeusefulthoughtoseewhichcomponentsareincludedintheOoziesharelib:
$oozieadmin-shareliblist
ThefollowingcommandisusefultoseetheindividualJARscomprisingaparticularcomponentwithinthesharelib,inthiscaseHCatalog:
$oozieadmin-shareliblisthcat
ThesecommandscanbeusefultoverifythattherequiredJARsarebeingincludedandtoseewhichspecificversionsarebeingused.
HCatalogandpartitionedtablesIfyourerunthepreviousworkflowasecondtime,itwillfail;digintothelogs,andyouwillseeHCatalogcomplainingthatitcannotwritetoatablethatalreadycontainsdata.ThisisacurrentlimitationofHCatalog;itviewstablesandpartitionswithintablesasimmutablebydefault.Hive,ontheotherhand,willaddnewdatatoatableorpartition;itsdefaultviewofatableisthatitismutable.
UpcomingchangestoHiveandHCatalogwillseethesupportofanewtablepropertythatwillcontrolthisbehaviorineithertool;forexample,thefollowingaddedtoatabledefinitionwouldallowtableappendsassupportedinHivetoday:
TBLPROPERTIES("immutable"="false")
ThisiscurrentlynotavailableintheshippingversionofHiveandHCatalog,however.Forustohaveaworkflowthataddsmoreandmoredataintoourtables,wethereforeneedtocreateanewpartitionforeachnewrunoftheworkflow.We’vemadethesechangesinv4ofourpipeline,wherewefirstrecreatethetableswithanintegerpartitionkey,asfollows:
CREATETABLEtweets_hcat(
…)
PARTITIONEDBY(partition_keyint)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASSEQUENCEFILE;
CREATETABLE`places_hcat`(
…)
partitionedby(partition_keyint)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASSEQUENCEFILE
TBLPROPERTIES("immutable"="false");
CREATETABLE`users_hcat`(
![Page 335: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/335.jpg)
…)
partitionedby(partition_keyint)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASSEQUENCEFILE
TBLPROPERTIES("immutable"="false");
ThePigHCatStorertakesanoptionalpartitiondefinitionandwemodifythestorestatementsinourPigscriptaccordingly;forexample:
storetweets_tsvinto'twttr.tweets_hcat'
usingorg.apache.hive.hcatalog.pig.HCatStorer(
'partition_key=$partitionKey');
WethenmodifyourPigactionintheworkflow.xmlfiletoincludethisadditionalparameter:
<script>${workflowRoot}/pig/extract_to_hcat.pig</script>
<param>inputDir=${inputDir}</param>
<param>partitionKey=${partitionKey}</param>
Thequestionisthenhowwepassthispartitionkeytotheworkflow.Wecouldspecifyitinthejob.propertiesfile,butbydoingsowewouldhitthesameproblemwithtryingtowritetoanexistingpartitiononthenextre-run.
Ingestionworkflowv4
Fornow,we’llpassthisasanexplicitargumenttotheinvocationoftheOozieCLIandexplorebetterwaystodothislater:
$ooziejob–run–configv4/job.properties–DpartitionKey=12345
NoteNotethataconsequenceofthisbehavioristhatrerunninganHCatworkflowwiththesameargumentswillfail.Beawareofthiswhentestingworkflowsorplayingwiththesamplecodefromthisbook.
![Page 336: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/336.jpg)
ProducingderiveddataNowthatwehaveourmaindatapipelineestablished,thereismostlikelyaseriesofactionsthatwewishtotakeafterweaddeachnewadditionaldataset.Asasimpleexample,notethatwithourpreviousmechanismofaddingeachsetofuserdatatoaseparatepartition,theusers_hcattablewillcontainusersmultipletimes.Let’screateanewtableforuniqueusersandregeneratethiseachtimeweaddnewuserdata.
NotethatgiventheaforementionedlimitationsofHCatalog,we’lluseaHiveactionforthispurpose,asweneedtoreplacethedatainatable.
First,we’llcreateanewtableforuniqueuserinformation,asfollows:
CREATETABLEIFNOTEXISTS`unique_users`(
`user_id`string,
`name`string,
`description`string,
`screen_name`string)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\t'
STOREDASsequencefile;
Inthistable,we’llonlystoretheattributesofauserthateitherneverchange(ID)orchangerarely(thescreenname,andsoon).WecanthenwriteasimpleHivestatementtopopulatethistablefromthefullusers_hcattable:
USEtwttr;
INSERTOVERWRITETABLEunique_users
SELECTDISTINCTuser_id,name,description,screen_name
FROMusers_hcat;
WecanthenaddanadditionalHiveactionnodethatcomesafterourpreviousPignodeintheworkflow.Whendoingthis,wediscoverthatourpatternofsimplygivingnodesnamessuchashive-nodeisareallybadidea,aswenowhavetwoHive-basednodes.Inv5oftheworkflow,weaddthisnewnodeandalsochangeournodestohavemoredescriptivenames:
![Page 337: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/337.jpg)
Ingestionworkflowv5
PerformingmultipleactionsinparallelOurworkflowhastwotypesofactivity:initialsetupwiththenodesthatinitializethefilesystemandHivetables,andthefunctionalnodesthatperformactualprocessing.Ifwelookatthetwosetupnodeswehavebeenusing,itisobviousthattheyarequitedistinctandnotinterdependent.WecanthereforetakeadvantageofanOoziefeaturecalledforkandjoinnodestoexecutetheseactionsinparallel.Thestartofourworkflow.xmlfilenowbecomes:
<startto="setup-fork-node"/>
TheOozieforknodecontainsanumberofpathelements,eachofwhichspecifiesastartingnode.Eachofthesewillbelaunchedinparallel:
<forkname="setup-fork-node">
<pathstart="setup-filesystem-node"/>
<pathstart="create-tables-node"/>
</fork>
Eachofthespecifiedactionnodesisnodifferentfromanywehaveusedpreviously.Anactionnodecanlinktoaseriesofothernodes;theonlyrequirementisthateachparallelseriesofactionsmustendwithatransitiontothejoinnodeassociatedwiththeforknode,asfollows:
<actionname="setup-filesystem-node">
…
<okto="setup-join-node"/>
<errorto="fail"/>
</action>
<actionname="create-tables-node">
…
<okto="setup-join-node"/>
<errorto="fail"/>
</action>
Thejoinnodeitselfactsasthepointofcoordination;anyworkflowthathascompletedwillwaituntilallthepathsspecifiedintheforknodereachthispoint.Atthatpoint,theworkflowcontinuesatthenodespecifiedwithinthejoinnode.Here’showthejoinnodeisused:
<joinname="create-join-node"to="gettweets-node"/>
Intheprecedingcodeweomittedtheactiondefinitionsforspacepurposes,butthefullworkflowdefinitionisinv6:
![Page 338: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/338.jpg)
Ingestionworkflowv6
CallingasubworkflowThoughthefork/joinmechanismmakestheprocessofparallelactionsmoreefficient,itdoesstilladdsignificantverbosityifweincludeitinourmainworkflow.xmldefinition.Conceptually,wehaveaseriesofactionsthatareperformingrelatedtasksrequiredbyourworkflowbutnotnecessarilypartofit.Forthisandsimilarcases,Oozieofferstheabilitytoinvokeasubworkflow.Theparentworkflowwillexecutethechildandwaitforittocomplete,withtheabilitytopassconfigurationelementsfromoneworkflowtotheother.
Thechildworkflowwillbeafullworkflowinitsownright,usuallystoredinadirectoryonHDFSwithalltheusualstructureweexpectforaworkflow,themainworkflow.xmlfile,andanyrequiredHive,Pig,orsimilarfiles.
WecancreateanewdirectoryonHDFScalledsetup-workflow,andinthiscreatethefilesrequiredonlyforourfilesystemandHivecreationactions.Thesubworkflowconfigurationfilewilllooklikethefollowing:
<workflow-appxmlns="uri:oozie:workflow:0.4"name="create-workflow">
<startto="setup-fork-node"/>
<forkname="setup-fork-node">
<pathstart="setup-filesystem-node"/>
<pathstart="create-tables-node"/>
</fork>
<actionname="setup-filesystem-node">
…
</action>
<actionname="create-tables-node">
…
</action>
<joinname="create-join-node"to="end"/>
<killname="fail">
<message>Actionfailed,error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<endname="end"/>
</workflow-app>
![Page 339: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/339.jpg)
Withthissubworkflowdefined,wethenmodifythefirstnodesofourmainworkflowtouseasubworkflownode,asinthefollowing:
<startto="create-subworkflow-node"/>
<actionname="create-subworkflow-node">
<sub-workflow>
<app-path>${subWorkflowRoot}</app-path>
<propagate-configuration/>
</sub-workflow>
<okto="gettweets-node"/>
<errorto="fail"/>
</action>
WewillspecifythesubWorkflowPathinthejob.propertiesofourparentworkflow,andthepropagate-configurationelementwillpasstheconfigurationoftheparentworkflowtothechild.
AddingglobalsettingsByextractingutilitynodesintosubworkflows,wecansignificantlyreduceclutterandcomplexityinourmainworkflowdefinition.Inv7ofouringestpipeline,we’llmakeoneadditionalsimplificationandaddaglobalconfigurationsection,asinthefollowing:
<workflow-appxmlns="uri:oozie:workflow:0.4"name="v7">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflowRoot}/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</global>
<startto="create-subworkflow-node"/>
Byaddingthisglobalconfigurationsection,weremovetheneedtospecifyanyofthesevaluesintheHiveandPignodesintheremainingworkflow(notethatcurrentlytheshellnodedoesnotsupporttheglobalconfigurationmechanism).Thiscandramaticallysimplifysomeofournodes;forexample,ourPignodeisnowasfollows:
<actionname="hcat-ingest-node">
<pig>
<configuration>
<property>
<name>oozie.action.sharelib.for.pig</name>
<value>pig,hcatalog</value>
</property>
</configuration>
<script>${workflowRoot}/pig/extract_to_hcat.pig</script>
<param>inputDir=${inputDir}</param>
<param>dbName=${dbName}</param>
<param>partitionKey=${partitionKey}</param>
![Page 340: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/340.jpg)
</pig>
<okto="derived-data-node"/>
<errorto="fail"/>
</action>
Ascanbeseen,wecanaddadditionalconfigurationelements,orindeedoverridethosespecifiedintheglobalsection,resultinginamuchcleareractiondefinitionthatfocusesonlyontheinformationspecifictotheactioninquestion.Ourworkflowv7hashadbothaglobalsectionaddedaswellastheadditionofthesubworkflow,andthismakesasignificantimprovementintheworkflowreadability:
Ingestionworkflowv7
![Page 341: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/341.jpg)
ChallengesofexternaldataWhenwerelyonexternaldatatodriveourapplication,weareimplicitlydependentonthequalityandstabilityofthatdata.Thisis,ofcourse,trueforanydata,butwhenthedataisgeneratedbyanexternalsourceoverwhichwedonothavecontrol,therisksaremostlikelyhigher.Regardless,whenbuildingwhatweexpecttobereliableapplicationsontopofsuchdatafeeds,andespeciallywhenourdatavolumesgrow,weneedtothinkabouthowtomitigatetheserisks.
![Page 342: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/342.jpg)
DatavalidationWeusethegeneraltermdatavalidationtorefertotheactofensuringthatincomingdatacomplieswithourexpectationsandpotentiallyapplyingnormalizationtomodifyitaccordinglyortoevendeletemalformedorcorruptinput.Whatthisactuallyinvolveswillbeveryapplication-specific.Insomecases,theimportantthingisensuringthesystemonlyingestsdatathatconformstoagivendefinitionofaccurateorclean.Forourtweetdata,wedon’tcareabouteverysinglerecordandcouldveryeasilyadoptapolicysuchasdroppingrecordsthatdon’thavevaluesinparticularfieldswecareabout.Forotherapplications,however,itisimperativetocaptureeveryinputrecord,andthismightdrivetheimplementationoflogictoreformateveryrecordtomakesureitcomplieswiththerequirements.Inyetothercases,onlycorrectrecordswillbeingested,buttherest,insteadofbeingdiscarded,mightbestoredelsewhereforlateranalysis.
Thebottomlineisthattryingtodefineagenericapproachtodatavalidationisvastlybeyondthescopeofthischapter.
However,wecanoffersomethoughtsonwhereinthepipelinetoincorporatevarioustypesofvalidationlogic.
ValidationactionsLogictodoanynecessaryvalidationorcleanupcanbeincorporateddirectlyintootheractions.Ashellnoderunningascripttogatherdatacanhavecommandsaddedtohandlemalformedrecordsdifferently.PigandHiveactionsthatloaddataintotablescaneitherperformfilteringoningest(easierdoneinPig)oraddcaveatswhencopyingdatafromaningesttabletotheoperationalstore.
Thereisanargumentthoughfortheadditionofavalidationnodeintotheworkflow,evenifinitiallyitperformsnoactuallogic.Thiscould,forinstance,beaPigactionthatreadsthedata,appliesthevalidation,andwritesthevalidateddatatoanewlocationtobereadbyfollow-onnodes.Theadvantagehereisthatwecanlaterupdatethevalidationlogicwithoutalteringourotheractions,whichshouldreducetheriskofaccidentallybreakingtherestofthepipelineandalsomakenodesmorecleanlydefinedintermsofresponsibilities.Thenaturalextensionofthistrainofthoughtisthatanewsubworkflowforvalidationismostlikelyagoodmodelaswell,asitnotonlyprovidesseparationofresponsibilities,butalsomakesthevalidationlogiceasiertotestandupdate.
Theobviousdisadvantageofthisapproachisthatitaddsadditionalprocessingandanothercycleofreadingthedataandwritingitallagain.Thisis,ofcourse,directlyworkingagainstoneoftheadvantageswehighlightedwhenconsideringtheuseofHCatalogfromPig.
Intheend,itwillcomedowntoatrade-offofperformanceagainstworkflowcomplexityandmaintainability.Whenconsideringhowtoperformvalidationandjustwhatthatmeansforyourworkflow,takealltheseelementsintoaccountbeforedecidingonanimplementation.
![Page 343: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/343.jpg)
HandlingformatchangesWecan’tdeclarevictoryjustbecausewehavedataflowingintooursystemandareconfidentthedataissufficientlyvalidated.Particularlywhenthedatacomesfromanexternalsourcewehavetothinkabouthowthestructureofthedatamightchangeovertime.
RememberthatsystemssuchasHiveonlyapplythetableschemawhenthedataisbeingread.Thisisahugebenefitinenablingflexibledatastorageandingest,butcanleadtouser-facingqueriesorworkloadsfailingsuddenlywhentheingesteddatanolongermatchesthequeriesbeingexecutedagainstit.Arelationaldatabase,whichappliesschemasonwrite,wouldnotevenallowsuchdatatobeingestedintothesystem.
Theobviousapproachtohandlingchangesmadetothedataformatwouldbetoreprocessexistingdataintothenewformat.Thoughthisistractableonsmallerdatasets,itquicklybecomesinfeasibleonthesortofvolumesseeninlargeHadoopclusters.
![Page 344: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/344.jpg)
HandlingschemaevolutionwithAvroAvrohassomefeatureswithrespecttoitsintegrationwithHivethathelpuswiththisproblem.Ifwetakeourtablefortweetsdata,wecouldrepresentthestructureofatweetrecordbythefollowingAvroschema:
{
"namespace":"com.learninghadoop2.avrotables",
"type":"record",
"name":"tweets_avro",
"fields":[
{"name":"created_at","type":["null","string"]},
{"name":"tweet_id_str","type":["null","string"]},
{"name":"text","type":["null","string"]},
{"name":"in_reply_to","type":["null","string"]},
{"name":"is_retweeted","type":["null","string"]},
{"name":"user_id","type":["null","string"]},
{"name":"place_id","type":["null","string"]}
]
}
Createtheprecedingschemainafilecalledtweets_avro.avsc—thisisthestandardfileextensionforAvroschemas.Then,placeitonHDFS;weliketohaveacommonlocationforschemafilessuchas/schema/avro.
Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:
CREATETABLEtweets_avro
PARTITIONEDBY(`partition_key`int)
ROWFORMATSERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITHSERDEPROPERTIES(
'avro.schema.url'='hdfs://localhost.localdomain:8020/schema/avro/tweets_avr
o.avsc'
)
STOREDASINPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
Then,lookatthetabledefinitionfromwithinHive(orHCatalog,whichalsosupportssuchdefinitions):
describetweets_avro
OK
created_atstringfromdeserializer
tweet_id_strstringfromdeserializer
textstringfromdeserializer
in_reply_tostringfromdeserializer
is_retweetedstringfromdeserializer
user_idstringfromdeserializer
place_idstringfromdeserializer
partition_keyintNone
![Page 345: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/345.jpg)
Wecanalsousethistablelikeanyother,forexample,tocopythedatafrompartition3fromthenon-AvrotableintotheAvrotable,asfollows:
SEThive.exec.dynamic.partition.mode=nonstrict
INSERTINTOTABLEtweets_avro
PARTITION(partition_key)
SELECTFROMtweets_hcat
NoteJustasinpreviousexamples,ifAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforebeingabletoselectfromthetable.
WenowhaveanewtweetstablespecifiedbyanAvroschema;sofaritjustlookslikeothertables.ButtherealbenefitsforourpurposesinthischapterareinhowwecanusetheAvromechanismtohandleschemaevolution.Let’saddanewfieldtoourtableschema,asfollows:
{
"namespace":"com.learninghadoop2.avrotables",
"type":"record",
"name":"tweets_avro",
"fields":[
{"name":"created_at","type":["null","string"]},
{"name":"tweet_id_str","type":["null","string"]},
{"name":"text","type":["null","string"]},
{"name":"in_reply_to","type":["null","string"]},
{"name":"is_retweeted","type":["null","string"]},
{"name":"user_id","type":["null","string"]},
{"name":"place_id","type":["null","string"]},
{"name":"new_feature","type":"string","default":"wow!"}
]
}
Withthisnewschemainplace,wecanvalidatethatthetabledefinitionhasalsobeenupdated,asfollows:
describetweets_avro;
OK
created_atstringfromdeserializer
tweet_id_strstringfromdeserializer
textstringfromdeserializer
in_reply_tostringfromdeserializer
is_retweetedstringfromdeserializer
user_idstringfromdeserializer
place_idstringfromdeserializer
new_featurestringfromdeserializer
partition_keyintNone
Withoutaddinganynewdata,wecanrunqueriesonthenewfieldthatwillreturnthedefaultvalueforourexistingdata,asfollows:
SELECTnew_featureFROMtweets_avroLIMIT5;
...
![Page 346: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/346.jpg)
OK
wow!
wow!
wow!
wow!
wow!
Evenmoreimpressiveisthefactthatthenewcolumndoesn’tneedtobeaddedattheend;itcanbeanywhereintherecord.Withthismechanism,wecannowupdateourAvroschemastorepresentthenewdatastructureandseethesechangesautomaticallyreflectedinourHivetabledefinitions.Anyqueriesthatrefertothenewcolumnwillretrievethedefaultvalueforallourexistingdatathatdoesnothavethatfieldpresent.
NotethatthedefaultmechanismweareusinghereiscoretoAvroandisnotspecifictoHive.Avroisaverypowerfulandflexibleformatthathasapplicationsinmanyareasandisdefinitelyworthdeeperexaminationthanwearegivingithere.
Technically,whatthisprovidesuswithisforwardcompatibility.Wecanmakechangestoourtableschemaandhaveallourexistingdataremainautomaticallycompliantwiththenewstructurewecan’t,however,continuetoingestdataoftheoldformatintotheupdatedtablessincethemechanismdoesnotprovidebackwardcompatibility:
INSERTINTOTABLEtweets_avro
PARTITION(partition_key)
SELECT*FROMtweets_hcat;
FAILED:SemanticException[Error10044]:Line1:18Cannotinsertinto
targettablebecausecolumnnumber/typesaredifferent'tweets_avro':Table
insclause-0has8columns,butqueryhas7columns.
SupportingschemaevolutionwithAvroallowsdatachangestobesomethingthatishandledaspartofnormalbusinessinsteadofthefirefightingemergencytheyalltoooftenturninto.Butplainly,it’snotforfree;thereisstillaneedtomakethechangesinthepipelineandrolltheseintoproduction.HavingHivetablesthatprovideforwardcompatibilitydoes,however,allowtheprocesstobeperformedinmoremanageablesteps;otherwise,youwouldneedtosynchronizechangesacrosseverystageofthepipeline.IfthechangesaremadefromingestuptothepointtheyareinsertedintoAvro-backedHivetables,thenallusersofthosetablescanremainunchanged(aslongastheydon’tdothingslikeselect*,whichisusuallyaterribleideaanyway)andcontinuetorunexistingqueriesagainstthenewdata.Theseapplicationscanthenbechangedonadifferenttimetabletotheingestionmechanism.Inourv8oftheingestpipeline,weshowhowtofullyuseAvrotablesforallofourexistingfunctionality.
NoteNotethatHive0.14,currentlyunreleasedatthetimeofwritingthis,willlikelyincludemorebuilt-insupportforAvrothatmightsimplifytheprocessofschemaevolutionevenfurther.IfHive0.14isavailablewhenyoureadthis,thendocheckoutthefinalimplementation.
FinalthoughtsonusingAvroschemaevolution
![Page 347: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/347.jpg)
WiththisdiscussionofAvro,wehavetouchedonsomeaspectsofmuchbroadertopics,inparticularofdatamanagementonabroaderscaleandpoliciesarounddataversioningandretention.Muchofthisareabecomesveryspecifictoanorganization,buthereareafewpartingthoughtsthatwefeelaremorebroadlyapplicable.
Onlymakeadditivechanges
Wediscussedaddingcolumnsintheprecedingexample.Sometimes,thoughmorerarely,yoursourcedatadropscolumnsoryoudiscoveryounolongerneedanewcolumn.Avrodoesn’treallyprovidetoolstohelpwiththis,andwefeelitisoftenundesirable.Insteadofdroppingoldcolumns,wetendtomaintaintheolddataandsimplydonotusetheemptycolumnsinallthenewdata.Thisismucheasiertomanageifyoucontrolthedataformat;ifyouareingestingexternalsources,thentofollowthisapproachyouwilleitherneedtoreprocessdatatoremovetheoldcolumnorchangetheingestmechanismtoaddadefaultvalueforallnewdata.
Manageschemaversionsexplicitly
Intheprecedingexamples,wehadasingleschemafiletowhichwemadechangesdirectly.Thisislikelyaverybadidea,asitremovesourabilitytotrackschemachangesovertime.Inadditiontotreatingschemasasartifactstobekeptunderversioncontrol(yourschemasareinGittoo,aren’tthey?)itisoftenusefultotageachschemawithanexplicitversion.Thisisparticularlyusefulwhentheincomingdataisalsoexplicitlyversioned.Then,insteadofoverwritingtheexistingschemafile,youcanaddthenewfileanduseanALTERTABLEstatementtopointtheHivetabledefinitionatthenewschema.Weare,ofcourse,assumingherethatyoudon’thavetheoptionofusingadifferentqueryfortheolddatawiththedifferentformat.ThoughthereisnoautomaticmechanismforHivetoselectschema,theremightbecaseswhereyoucancontrolthismanuallyandsidesteptheevolutionquestion.
Thinkaboutschemadistribution
Whenusingaschemafile,thinkabouthowitwillbedistributedtotheclients.If,asinthepreviousexample,thefileisonHDFS,thenitlikelymakessensetogiveitahighreplicationfactor.ThefilewillberetrievedbyeachmapperineveryMapReducejobthatqueriesthetable.
TheAvroURLcanalsobespecifiedasalocalfilesystemlocation(file://),whichisusefulfordevelopmentandalsoasawebresource(http://).Thoughthelatterisveryusefulasitisaconvenientmechanismtodistributetheschematonon-Hadoopclients,rememberthattheloadonthewebservermightbehigh.Withmodernhardwareandefficientwebservers,thisismostlikelynotahugeconcern,butifyouhaveaclusterofthousandsofmachinesrunningmanyparalleljobswhereeachmapperneedstohitthewebserver,thenbecareful.
![Page 348: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/348.jpg)
CollectingadditionaldataManydataprocessingsystemsdon’thaveasingledataingestsource;often,oneprimarysourceisenrichedbyothersecondarysources.Wewillnowlookathowtoincorporatetheretrievalofsuchreferencedataintoourdatawarehouse.
Atahighlevel,theproblemisn’tverydifferentfromourretrievaloftherawtweetdata,aswewishtopulldatafromanexternalsource,possiblydosomeprocessingonit,andstoreitsomewherewhereitcanbeusedlater.Butthisdoeshighlightanaspectweneedtoconsider;dowereallywanttoretrievethisdataeverytimeweingestnewtweets?Theansweriscertainlyno.Thereferencedatachangesveryrarely,andwecouldeasilyfetchitmuchlessfrequentlythannewtweetdata.Thisraisesaquestionwe’veskirteduntilnow:justhowdowescheduleOozieworkflows?
![Page 349: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/349.jpg)
SchedulingworkflowsUntilnow,we’verunallourOozieworkflowsondemandfromtheCLI.OoziealsohasaschedulerthatallowsjobstobestartedeitheronatimedbasisorwhenexternalcriteriasuchasdataappearinginHDFSaremet.Itwouldbeagoodfitforourworkflowstohaveourmaintweetpipelinerun,say,every10minutesbutthereferencedataonlyrefresheddaily.
TipRegardlessofwhendataisretrieved,thinkcarefullyhowtohandledatasetsthatperformadelete/replaceoperation.Inparticular,don’tdothedeletebeforeretrievingandvalidatingthenewdata;otherwise,anyjobsthatrequirethereferencedatawillfailuntilthenextrunoftheretrievalsucceeds.Itcouldbeagoodoptiontoincludethedestructiveoperationsinasubworkflowthatisonlytriggeredaftersuccessfulcompletionoftheretrievalsteps.
Oozieactuallydefinestwotypesofapplicationsthatitcanrun:workflowssuchaswe’veusedsofarandcoordinators,whichscheduleworkflowstobeexecutedbasedonvariouscriteria.Acoordinatorjobisconceptuallysimilartoourotherworkflows;wepushanXMLconfigurationfileontoHDFSanduseaparameterizedpropertiesfiletoconfigureitatruntime.Inaddition,coordinatorjobshavethefacilitytoreceiveadditionalparameterizationfromtheeventsthattriggertheirexecution.
Thisispossiblybestdescribedbyanexample.Let’ssay,wewishtodoaspreviouslymentionedandcreateacoordinatorthatexecutesv7ofouringestworkflowevery10minutes.Here’sthecoordinator.xmlfile(thestandardnameforthecoordinatorXMLdefinition):
<coordinator-appname="tweets-10min-coordinator"frequency="${freq}"
start="${startTime}"end="${endTime}"timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
Themainactionnodeinacoordinatoristheworkflow,forwhichweneedtospecifyitsrootlocationonHDFSandallrequiredproperties,asfollows:
<action>
<workflow>
<app-path>${workflowPath}</app-path>
<configuration>
<property>
<name>workflowRoot</name>
<value>${workflowRoot}</value>
</property>
…
Wealsoneedtoincludeanypropertiesrequiredbyanyactionintheworkfloworbyanysubworkflowittriggers;ineffect,thismeansthatanyuser-definedvariablespresentinanyoftheworkflowstobetriggeredneedtobeincludedhere,asfollows:
<property>
<name>dbName</name>
![Page 350: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/350.jpg)
<value>${dbName}</value>
</property>
<property>
<name>partitionKey</name>
<value>${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}
</value>
</property>
<property>
<name>exec</name>
<value>gettweets.sh</value>
</property>
<property>
<name>inputDir</name>
<value>/tmp/tweets</value>
</property>
<property>
<name>subWorkflowRoot</name>
<value>${subWorkflowRoot}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
Weusedafewcoordinator-specificfeaturesintheprecedingxml.Notethespecificationofthestartingandendingtimeofthecoordinatorandalsoitsfrequency(inminutes).Weareusingthesimplestformhere;Ooziealsohasasetoffunctionstoallowquiterichspecificationsofthefrequency.
WeusecoordinatorELfunctionsinourdefinitionofthepartitionKeyvariable.Earlier,whenrunningworkflowsfromtheCLI,wespecifiedtheseexplicitlybutmentionedtherewasabetterway—thisisit.Thefollowingexpressiongeneratesaformattedoutputcontainingtheyear,month,day,hour,andminute:
${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}
Ifwethenusethisasthevalueforourpartitionkey,wecanensurethateachinvocationoftheworkflowcorrectlycreatesauniquepartitioninourHCatalogtables.
Thecorrespondingjob.propertiesforthecoordinatorjoblooksmuchlikeourpreviousconfigfileswiththeusualentriesfortheNameNodeandsimilarvariablesaswellashavingvaluesfortheapplication-specificvariables,suchasdbName.Inaddition,weneedtospecifytherootofthecoordinatorlocationonHDFS,asfollows:
oozie.coord.application.path=${nameNode}/user/${user.name}/${tasksRoot}/twe
ets_10min
Notetheoozie.coordnamespaceprefixinsteadofthepreviouslyusedoozie.wf.WiththecoordinatordefinitiononHDFS,wecansubmitthefiletoOoziejustaswiththepreviousjobs.Butinthiscase,thejobwillonlyrunforagiventimeperiod.Specifically,itwillruneveryfiveminutes(thefrequencyisvariable)whenthesystemclockisbetweenstartTimeandendTime.
We’veincludedthefullconfigurationinthetweets_10mindirectoryinthesourcecodefor
![Page 351: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/351.jpg)
thischapter.
![Page 352: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/352.jpg)
OtherOozietriggersTheprecedingcoordinatorhasaverysimpletrigger;itstartsperiodicallywithinaspecifiedtimerange.Ooziehasanadditionalcapabilitycalleddatasets,whereitcanbetriggeredbytheavailabilityofnewdata.
Thisisn’tagreatfitforhowwe’vedefinedourpipelineuntilnow,butimaginethat,insteadofourworkflowcollectingtweetsasitsfirststep,anexternalsystemwaspushingnewfilesoftweetsontoHDFSonacontinuousbasis.OoziecanbeconfiguredtoeitherlookforthepresenceofnewdatabasedonadirectorypatternortospecificallytriggerwhenareadyfileappearsonHDFS.ThislatterconfigurationprovidesaveryconvenientmechanismwithwhichtointegratetheoutputofMapReducejobs,whichbydefault,writea_SUCCESSfileintotheiroutputdirectory.
Ooziedatasetsarearguablyoneofthemostpowerfulpartsofthewholesystem,andwecannotdothemjusticehereforspacereasons.ButwedostronglyrecommendthatyouconsulttheOoziehomepageformoreinformation.
![Page 353: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/353.jpg)
PullingitalltogetherLet’sreviewwhatwe’vediscusseduntilnowandhowwecanuseOozietobuildasophisticatedseriesofworkflowsthatimplementanapproachtodatalifecyclemanagementbyputtingtogetherallthediscussedtechniques.
First,it’simportanttodefineclearresponsibilitiesandimplementpartsofthesystemusinggooddesignandseparationofconcernprinciples.Byapplyingthis,weendupwithseveraldifferentworkflows:
Asubworkflowtoensuretheenvironment(mainlyHDFSandHivemetadata)iscorrectlyconfiguredAsubworkflowtoperformdatavalidationThemainworkflowthattriggersboththeprecedingsubworkflowsandthenpullsnewdatathroughamultistepingestpipelineAcoordinatorthatexecutestheprecedingworkflowsevery10minutesAsecondcoordinatorthatingestsreferencedatathatwillbeusefultotheapplicationpipeline
WealsodefineallourtableswithAvroschemasandusethemwhereverpossibletohelpmanageschemaevolutionandchangingdataformatsovertime.
Wepresentthefullsourcecodeofthesecomponentsinthefinalversionoftheworkflowinthesourcecodeofthischapter.
![Page 354: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/354.jpg)
OthertoolstohelpThoughOozieisaverypowerfultool,sometimesitcanbesomewhatdifficulttocorrectlywriteworkflowdefinitionfiles.Aspipelinesgetsizeable,managingcomplexitybecomesachallengeevenwithgoodfunctionalpartitioningintomultipleworkflows.Atasimplerlevel,XMLisjustneverfunforahumantowrite!Thereareafewtoolsthatcanhelp.Hue,thetoolcallingitselftheHadoopUI(http://gethue.com/),providessomegraphicaltoolstohelpcompose,execute,andmanageOozieworkflows.Thoughpowerful,Hueisnotabeginnertool;we’llmentionitalittlemoreinChapter11,WheretoGoNext.
AnewApacheprojectcalledFalcon(http://falcon.incubator.apache.org)mightalsobeofinterest.FalconusesOozietobuildarangeofmuchhigher-leveldataflowsandactions.Forexample,Falconprovidesrecipestoenableandensurecross-sitereplicationacrossmultipleHadoopclusters.TheFalconteamisworkingonmuchbetterinterfacestobuildtheirworkflows,sotheprojectmightwellbeworthwatching.
![Page 355: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/355.jpg)
SummaryHopefully,thischapterpresentedthetopicofdatalifecyclemanagementassomethingotherthanadryabstractconcept.Wecoveredalot,particularly:
ThedefinitionofdatalifecyclemanagementandhowitcoversanumberofissuesandtechniquesthatusuallybecomeimportantwithlargedatavolumesTheconceptofbuildingadataingestpipelinealonggooddatalifecyclemanagementprinciplesthatcanthenbeutilizedbyhigher-levelanalytictoolsOozieasaHadoop-focusedworkflowmanagerandhowwecanuseittocomposeaseriesofactionsintoaunifiedworkflowVariousOozietools,suchassubworkflows,parallelactionexecution,andglobalvariables,thatallowustoapplytruedesignprinciplestoourworkflowsHCatalogandhowitprovidesthemeansfortoolsotherthanHivetoreadandwritetable-structureddata;weshoweditsgreatpromiseandintegrationwithtoolssuchasPigbutalsohighlightedsomecurrentweaknessesAvroasourtoolofchoicetohandleschemaevolutionovertimeUsingOoziecoordinatorstobuildscheduledworkflowsbasedeitherontimeintervalsordataavailabilitytodrivetheexecutionofmultipleingestpipelinesSomeothertoolsthatcanmakethesetaskseasier,namely,HueandFalcon
Inthenextchapter,we’lllookatseveralofthehigher-levelanalytictoolsandframeworksthatcanbuildsophisticatedapplicationlogicuponthedatacollectedinaningestpipeline.
![Page 356: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/356.jpg)
Chapter9.MakingDevelopmentEasierInthischapter,wewilllookathow,dependingonusecasesandendgoals,applicationdevelopmentinHadoopcanbesimplifiedusinganumberofabstractionsandframeworksbuiltontopoftheJavaAPIs.Inparticular,wewilllearnaboutthefollowingtopics:
HowthestreamingAPIallowsustowriteMapReducejobsusingdynamiclanguagessuchasPythonandRubyHowframeworkssuchasApacheCrunchandKiteMorphlinesallowustoexpressdatatransformationpipelinesusinghigher-levelabstractionsHowKiteData,apromisingframeworkdevelopedbyCloudera,providesuswiththeabilitytoapplydesignpatternsandboilerplatetoeaseintegrationandinteroperabilityofdifferentcomponentswithintheHadoopecosystem
![Page 357: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/357.jpg)
ChoosingaframeworkInthepreviouschapters,welookedattheMapReduceandSparkprogrammingAPIstowritedistributedapplications.Althoughverypowerfulandflexible,theseAPIscomewithacertainlevelofcomplexityandpossiblyrequiresignificantdevelopmenttime.
Inanefforttoreduceverbosity,weintroducedthePigandHiveframeworks,whichcompiledomain-specificlanguages,PigLatinandHiveQL,intoanumberofMapReducejobsorSparkDAGs,effectivelyabstractingtheAPIsaway.BothlanguagescanbeextendedwithUDFs,whichisawayofmappingcomplexlogictothePigandHivedatamodels.
Attimeswhenweneedacertaindegreeofflexibilityandmodularity,thingscangettricky.Dependingontheusecaseanddeveloperneeds,theHadoopecosystempresentsavastchoiceofAPIs,frameworks,andlibraries.Inthischapter,weidentifyfourcategoriesofusersandmatchthemwiththefollowingrelevanttools:
DevelopersthatwanttoavoidJavainfavorofscriptingMapReducejobsusingdynamiclanguages,oruselanguagesnotimplementedontheJVM.Atypicalusecasewouldbeupfrontanalysisandrapidprototyping:HadoopstreamingJavadevelopersthatneedtointegratecomponentsoftheHadoopecosystemandcouldbenefitfromcodifieddesignpatternsandboilerplate:KiteDataJavadeveloperswhowanttowritemodulardatapipelinesusingafamiliarAPI:ApacheCrunchDeveloperswhowouldratherconfigurechainsofdatatransformations.Forinstance,adataengineerthatwantstoembedexistingcodeinanETLpipeline:KiteMorphlines
![Page 358: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/358.jpg)
HadoopstreamingWehavementionedpreviouslythatMapReduceprogramsdon’thavetobewritteninJava.Thereareseveralreasonswhyyoumightwantorneedtowriteyourmapandreducetasksinanotherlanguage.Perhapsyouhaveexistingcodetoleverageorneedtousethird-partybinaries—thereasonsarevariedandvalid.
Hadoopprovidesanumberofmechanismstoaidnon-Javadevelopment,primaryamongstwhichareHadooppipesthatprovideanativeC++interfaceandHadoopstreamingthatallowsanyprogramthatusesstandardinputandoutputtobeusedformapandreducetasks.WiththeMapReduceJavaAPI,bothmapandreducetasksprovideimplementationsformethodsthatcontainthetaskfunctionality.ThesemethodsreceivetheinputtothetaskasmethodargumentsandthenoutputresultsviatheContextobject.Thisisaclearandtype-safeinterface,butitisbydefinitionJava-specific.
Hadoopstreamingtakesadifferentapproach.Withstreaming,youwriteamaptaskthatreadsitsinputfromstandardinput,onelineatatime,andgivestheoutputofitsresultstostandardoutput.Thereducetaskthendoesthesame,againusingonlystandardinputandoutputforitsdataflow.
Anyprogramthatreadsandwritesfromstandardinputandoutputcanbeusedinstreaming,suchascompiledbinaries,Unixshellscripts,orprogramswritteninadynamiclanguagesuchasPythonorRuby.ThebiggestadvantagetostreamingisthatitcanallowyoutotryideasanditeratethemmorequicklythanusingJava.Insteadofacompile/JAR/submitcycle,youjustwritethescriptsandpassthemasargumentstothestreamingJARfile.Especiallywhendoinginitialanalysisonanewdatasetortryingoutnewideas,thiscansignificantlyspeedupdevelopment.
Theclassicdebateregardingdynamicversusstaticlanguagesbalancesthebenefitsofswiftdevelopmentagainstruntimeperformanceandtypechecking.Thesedynamicdownsidesalsoapplywhenusingstreaming.Consequently,wefavortheuseofstreamingforupfrontanalysisandJavafortheimplementationofjobsthatwillbeexecutedontheproductioncluster.
![Page 359: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/359.jpg)
StreamingwordcountinPythonWe’lldemonstrateHadoopstreamingbyre-implementingourfamiliarwordcountexampleusingPython.First,wecreateascriptthatwillbeourmapper.ItconsumesUTF-8encodedrowsoftextfromstandardinputwithaforloop,splitsthisintowords,andusestheprintfunctiontowriteeachwordtostandardoutput,asfollows:
#!/bin/envpython
importsys
forlineinsys.stdin:
#skipemptylines
ifline=='\n':
continue
#preserveutf-8encoding
try:
line=line.encode('utf-8')
exceptUnicodeDecodeError:
continue
#newlinecharacterscanappearwithinthetext
line=line.replace('\n','')
#lowercaseandtokenize
line=line.lower().split()
forterminline:
ifnotterm:
continue
try:
print(
u"%s"%(
term.decode('utf-8')))
exceptUnicodeEncodeError:
continue
Thereducercountsthenumberofoccurrencesofeachwordfromstandardinput,andgivestheoutputasthefinalvaluetostandardoutput,asfollows:
#!/bin/envpython
importsys
count=1
current=None
forwordinsys.stdin:
word=word.strip()
ifword==current:
count+=1
else:
ifcurrent:
print"%s\t%s"%(current.decode('utf-8'),count)
current=word
![Page 360: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/360.jpg)
count=1
ifcurrent==word:
print"%s\t%s"%(current.decode('utf-8'),count)
NoteInbothcases,weareimplicitlyusingHadoopinputandoutputformatsdiscussedintheearlierchapters.ItistheTextInputFormatthatprocessesthesourcefileandprovideseachlineoneatatimetothemapscript.Conversely,theTextOutputFormatwillensurethattheoutputofreducetasksisalsocorrectlywrittenastext.
Copymap.pyandreduce.pytoHDFS,andexecutethescriptsasastreamingjobusingthesampledatafromthepreviouschapters,asfollows:
$hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-filemap.py\
-mapper"pythonmap.py"\
-filereduce.py\
-reducer"pythonreduce.py"\
-inputsample.txt\
-outputoutput.txt
NoteTweetsareUTF-8encoded.MakesurethatPYTHONIOENCODINGissetaccordinglyinordertopipedatainaUNIXterminal:
$exportPYTHONIOENCODING='UTF-8'
Thesamecodecanbeexecutedfromthecommand-lineprompt:
$catsample.txt|pythonmap.py|pythonreduce.py>out.txt
Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/wc/python/map.py.
![Page 361: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/361.jpg)
DifferencesinjobswhenusingstreamingInJava,weknowthatourmap()methodwillbeinvokedonceforeachinputkey/valuepairandourreduce()methodwillbeinvokedforeachkeyanditssetofvalues.
Withstreaming,wedon’thavetheconceptofthemaporreducemethodsanymore;insteadwehavewrittenscriptsthatprocessstreamsofreceiveddata.Thischangeshowweneedtowriteourreducer.InJava,thegroupingofvaluestoeachkeywasperformedbyHadoop;eachinvocationofthereducemethodwouldreceiveasingle,tabseparatedkeyandallitsvalues.Instreaming,eachinstanceofthereducetaskisgiventheindividualungatheredvaluesoneatatime.
Hadoopstreamingdoessortthekeys,forexample,ifamapperemittedthefollowingdata:
First1
Word1
Word1
A1
First1
Thestreamingreducerwouldreceiveitinthefollowingorder:
A1
First1
First1
Word1
Word1
Hadoopstillcollectsthevaluesforeachkeyandensuresthateachkeyispassedonlytoasinglereducer.Inotherwords,areducergetsallthevaluesforanumberofkeys,andtheyaregroupedtogether;however,theyarenotpackagedintoindividualexecutionsofthereducer,thatis,oneperkey,aswiththeJavaAPI.SinceHadoopstreamingusesthestdinandstdoutchannelstoexchangedatabetweentasks,debuganderrormessagesshouldnotbeprintedtostandardoutput.Inthefollowingexample,wewillusethePythonlogging(https://docs.python.org/2/library/logging.html)packagetologwarningstatementstoafile.
![Page 362: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/362.jpg)
FindingimportantwordsintextWewillnowimplementametric,TermFrequency-InverseDocumentFrequency(TF-IDF),thatwillhelpustodeterminetheimportanceofwordsbasedonhowfrequentlytheyappearacrossasetofdocuments(tweets,inourcase).
Intuitively,ifawordappearsfrequentlyinadocumentitisimportantandshouldbegivenahighscore.However,ifawordappearsinmanydocuments,weshouldpenalizeitwithalowerscore,asitisacommonwordanditsfrequencyisnotuniquetothisdocument.
Therefore,commonwordssuchasthe,andfor,whichappearinmanydocuments,willbescaleddown.Wordsthatappearfrequentlyinasingletweetwillbescaledup.UsesofTF-IDF,oftenincombinationwithothermetricsandtechniques,includestopwordremovalandtextclassification.Notethatthistechniquewillhaveshortcomingswhendealingwithshortdocuments,suchastweets.Insuchcases,thetermfrequencycomponentwilltendtobecomeone.Conversely,onecouldexploitthispropertytodetectoutliers.
ThedefinitionofTF-IDFwewilluseinourexampleisthefollowing:
tf=#oftimestermappearsinadocument(rawfrequency)
idf=1+log(#ofdocuments/#documentswithterminit)
tf-idf=tf*idf
WewillimplementthealgorithminPythonusingthreeMapReducejobs:
ThefirstonecalculatestermfrequencyThesecondonecalculatesdocumentfrequency(thedenominatorofIDF)Thethirdonecalculatesper-tweetTF-IDF
CalculatetermfrequencyThetermfrequencypartisverysimilartothewordcountexample.Themaindifferenceisthatwewillbeusingamulti-field,tab-separated,keytokeeptrackofco-occurrencesoftermsanddocumentIDs.Foreachtweet—inJSONformat—themapperextractstheid_strandtextfields,tokenizestext,andemitsaterm,doc_idtuple:
fortweetinsys.stdin:
#skipemptylines
iftweet=='\n':
continue
try:
tweet=json.loads(tweet)
except:
logger.warn("Invalidinput%s"%tweet)
continue
#Inourexampleonetweetcorrespondstoonedocument.
doc_id=tweet['id_str']
ifnotdoc_id:
continue
#preserveutf-8encoding
text=tweet['text'].encode('utf-8')
![Page 363: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/363.jpg)
#newlinecharacterscanappearwithinthetext
text=text.replace('\n','')
#lowercaseandtokenize
text=text.lower().split()
fortermintext:
try:
print(
u"%s\t%s"%(
term.decode('utf-8'),doc_id.decode('utf-8'))
)
exceptUnicodeEncodeError:
logger.warn("Invalidterm%s"%term)
Inthereducer,weemitthefrequencyofeachterminadocumentasatab-separatedstring:
freq=1
cur_term,cur_doc_id=sys.stdin.readline().split()
forlineinsys.stdin:
line=line.strip()
try:
term,doc_id=line.split('\t')
except:
logger.warn("Invalidrecord%s"%line)
#thekeyisa(doc_id,term)pair
if(doc_id==cur_doc_id)and(term==cur_term):
freq+=1
else:
print(
u"%s\t%s\t%s"%(
cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),
freq))
cur_doc_id=doc_id
cur_term=term
freq=1
print(
u"%s\t%s\t%s"%(
cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),freq))
Forthisimplementationtowork,itiscrucialthatthereducerinputissortedbyterm.Wecantestbothscriptsfromthecommandlinewiththefollowingpipe:
$cattweets.json|pythonmap-tf.py|sort-k1,2|\
pythonreduce-tf.py
Whereasatthecommandlineweusethesortutility,inMapReducewewilluseorg.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator.Thiscomparatorimplementsasubsetoffeaturesprovidedbythesortcommand.Inparticular,orderingbyfieldcanbespecifiedwiththe–k<position>option.Tofilterbyterm,thefirstfieldofourkey,weset-Dmapreduce.text.key.comparator.options=-k1:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
![Page 364: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/364.jpg)
streaming.jar\
-Dmap.output.key.field.separator=\t\
-Dstream.num.map.output.key.fields=2\
-Dmapreduce.output.key.comparator.class=\
org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\
-Dmapreduce.text.key.comparator.options=-k1,2\
-inputtweets.json\
-output/tmp/tf-out.tsv\
-filemap-tf.py\
-mapper"pythonmap-tf.py"\
-filereduce-tf.py\
-reducer"pythonreduce-tf.py"
NoteWespecifywhichfieldsbelongtothekey(forshuffling)inthecomparatoroptions.
Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-tf.py.
CalculatedocumentfrequencyThemainlogictocalculatedocumentfrequencyisinthereducer,whilethemapperisjustanidentityfunctionthatloadsandpipesthe(orderedbyterm)outputoftheTFjob.Inthereducer,foreachterm,wecounthowmanytimesitoccursacrossalldocuments.Foreachterm,wekeepabufferkey_cacheof(term,doc_id,tf)tuples,andwhenanewtermisfoundweflushthebuffertostandardoutput,togetherwiththeaccumulateddocumentfrequencydf:
#Cachethe(term,doc_id,tf)tuple.
key_cache=[]
line=sys.stdin.readline().strip()
cur_term,cur_doc_id,cur_tf=line.split('\t')
cur_tf=int(cur_tf)
cur_df=1
forlineinsys.stdin:
line=line.strip()
try:
term,doc_id,tf=line.strip().split('\t')
tf=int(tf)
except:
logger.warn("Invalidrecord:%s"%line)
continue
#termistheonlykeyforthisinput
if(term==cur_term):
#incrementdocumentfrequency
cur_df+=1
key_cache.append(
u"%s\t%s\t%s"%(term.decode('utf-8'),doc_id.decode('utf-8'),
tf))
![Page 365: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/365.jpg)
else:
forkeyinkey_cache:
print("%s\t%s"%(key,cur_df))
print(
u"%s\t%s\t%s\t%s"%(
cur_term.decode('utf-8'),
cur_doc_id.decode('utf-8'),
cur_tf,cur_df)
)
#flushthecache
key_cache=[]
cur_doc_id=doc_id
cur_term=term
cur_tf=tf
cur_df=1
forkeyinkey_cache:
print(u"%s\t%s"%(key.decode('utf-8'),cur_df))
print(
u"%s\t%s\t%s\t%s\n"%(
cur_term.decode('utf-8'),
cur_doc_id.decode('utf-8'),
cur_tf,cur_df))
Wecantestthescriptsfromthecommandlinewith:
$cat/tmp/tf-out.tsv|pythonmap-df.py|pythonreduce-df.py>
/tmp/df-out.tsv
AndwecantestthescriptsonHadoopstreamingwith:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-Dmap.output.key.field.separator=\t\
-Dstream.num.map.output.key.fields=3\
-Dmapreduce.output.key.comparator.class=\
org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\
-Dmapreduce.text.key.comparator.options=-k1\
-input/tmp/tf-out.tsv/part-00000\
-output/tmp/df-out.tsv\
-mapperorg.apache.hadoop.mapred.lib.IdentityMapper\
-filereduce-df.py\
-reducer"pythonreduce-df.py"
OnHadoopweuseorg.apache.hadoop.mapred.lib.IdentityMapper,whichprovidesthesamelogicasthemap-df.pyscript.
Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-df.py.
Puttingitalltogether–TF-IDFTocalculateTF-IDF,weonlyneedamapperthatconsumestheoutputoftheprevious
![Page 366: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/366.jpg)
step:
num_doc=sys.argv[1]
forlineinsys.stdin:
line=line.strip()
try:
term,doc_id,tf,df=line.split('\t')
tf=float(tf)
df=float(df)
num_doc=float(num_doc)
except:
logger.warn("Invalidrecord%s"%line)
#idf=num_doc/df
tf_idf=tf*(1+math.log(num_doc/df))
print("%s\t%s\t%s"%(term,doc_id,tf_idf))
Thenumberofdocumentsinthecollectionispassedasaparametertotf-idf.py:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-Dmapreduce.reduce.tasks=0\
-input/tmp/df-out.tsv/part-00000\
-output/tmp/tf-idf.out\
-filetf-idf.py\
-mapper"pythontf-idf.py15578"
Tocalculatethetotalnumberoftweets,wecanusethecatandwcUnixutilitiesincombinationwithHadoopstreaming:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-inputtweets.json\
-outputtweets.cnt\
-mapper/bin/cat\
-reducer/usr/bin/wc
Themappersourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py.
![Page 367: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/367.jpg)
KiteDataTheKiteSDK(http://www.kitesdk.org)isacollectionofclasses,command-linetools,andexamplesthataimsateasingtheprocessofbuildingapplicationsontopofHadoop.
InthissectionwewilllookathowKiteData,asubprojectofKite,caneaseintegrationwithseveralcomponentsofaHadoopdatawarehouse.Kiteexamplescanbefoundathttps://github.com/kite-sdk/kite-examples.
OnCloudera’sQuickStartVM,KiteJARscanbefoundat/opt/cloudera/parcels/CDH/lib/kite/.
KiteDataisorganizedinanumberofsubprojects,someofwhichwe’lldescribeinthefollowingsections.
![Page 368: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/368.jpg)
DataCoreAsthenamesuggests,thecoreisthebuildingblockforallcapabilitiesprovidedintheDatamodule.Itsprincipalabstractionsaredatasetsandrepositories.
Theorg.kitesdk.data.Datasetinterfaceisusedtorepresentanimmutablesetofdata:
@Immutable
publicinterfaceDataset<E>extendsRefinableView<E>{
StringgetName();
DatasetDescriptorgetDescriptor();
Dataset<E>getPartition(PartitionKeykey,booleanautoCreate);
voiddropPartition(PartitionKeykey);
Iterable<Dataset<E>>getPartitions();
URIgetUri();
}
Eachdatasetisidentifiedbyanameandaninstanceoftheorg.kitesdk.data.DatasetDescriptorinterface,thatisthestructuraldescriptionofadatasetandprovidesitsschema(org.apache.avro.Schema)andpartitioningstrategy.
ImplementationsoftheReader<E>interfaceareusedtoreaddatafromanunderlyingstoragesystemandproducedeserializedentitiesoftypeE.ThenewReader()methodcanbeusedtogetanappropriateimplementationforagivendataset:
publicinterfaceDatasetReader<E>extendsIterator<E>,Iterable<E>,
Closeable{
voidopen();
booleanhasNext();
Enext();
voidremove();
voidclose();
booleanisOpen();
}
AninstanceofDatasetReaderwillprovidemethodstoreadanditerateoverstreamsofdata.Similarly,org.kitesdk.data.DatasetWriterprovidesaninterfacetowritestreamsofdatatotheDatasetobjects:
publicinterfaceDatasetWriter<E>extendsFlushable,Closeable{
voidopen();
voidwrite(Eentity);
voidflush();
voidclose();
booleanisOpen();
}
Likereaders,writersareuse-onceobjects.TheyserializeinstancesofentitiesoftypeEandwritethemtotheunderlyingstoragesystem.Writersareusuallynotinstantiateddirectly;rather,anappropriateimplementationcanbecreatedbythenewWriter()factorymethod.ImplementationsofDatasetWriterwillholdresourcesuntilclose()iscalledandexpect
![Page 369: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/369.jpg)
thecallertoinvokeclose()inafinallyblockwhenthewriterisnolongerinuse.Finally,notethatimplementationsofDatasetWriteraretypicallynotthread-safe.Thebehaviorofawriterbeingaccessedfrommultiplethreadsisundefined.
AparticularcaseofadatasetistheViewinterface,whichisasfollows:
publicinterfaceView<E>{
Dataset<E>getDataset();
DatasetReader<E>newReader();
DatasetWriter<E>newWriter();
booleanincludes(Eentity);
publicbooleandeleteAll();
}
Viewscarrysubsetsofthekeysandpartitionsofanexistingdataset;theyareconceptuallysimilartothenotionof“view”intherelationalmodel.
AViewinterfacecanbecreatedfromrangesofdata,orrangesofkeys,orasaunionbetweenotherviews.
![Page 370: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/370.jpg)
DataHCatalogDataHCatalogisamodulethatenablestheaccessingofHCatalogrepositories.Thecoreabstractionsofthismoduleareorg.kitesdk.data.hcatalog.HCatalogAbstractDatasetRepositoryanditsconcreteimplementation,org.kitesdk.data.hcatalog.HCatalogDatasetRepository.
TheydescribeaDatasetRepositorythatusesHCatalogtomanagemetadataandHDFSforstorage,asfollows:
publicclassHCatalogDatasetRepositoryextends
HCatalogAbstractDatasetRepository{
HCatalogDatasetRepository(Configurationconf){
super(conf,newHCatalogManagedMetadataProvider(conf));
}
HCatalogDatasetRepository(Configurationconf,MetadataProviderprovider)
{
super(conf,provider);
}
public<E>Dataset<E>create(Stringname,DatasetDescriptordescriptor)
{
getMetadataProvider().create(name,descriptor);
returnload(name);
}
publicbooleandelete(Stringname){
returngetMetadataProvider().delete(name);
}
publicstaticclassBuilder{
…
}
}
NoteAsofKite0.17,DataHCatalogisdeprecatedinfavorofthenewDataHivemodule.
ThelocationofthedatadirectoryiseitherchosenbyHive/HCatalog(so-called“managedtables”),orspecifiedwhencreatinganinstanceofthisclassbyprovidingafilesystemandarootdirectoryintheconstructor(externaltables).
![Page 371: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/371.jpg)
DataHiveThekite-data-moduleexposesHiveschemasviatheDatasetinterface.AsofKite0.17,thispackagesupersedesDataHCatalog.
![Page 372: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/372.jpg)
DataMapReduceTheorg.kitesdk.data.mapreducepackageprovidesinterfacestoreadandwritedatatoandfromaDatasetwithMapReduce.
![Page 373: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/373.jpg)
DataSparkTheorg.kitesdk.data.sparkpackageprovidesinterfacesforreadingandwritingdatatoandfromaDatasetwithApacheSpark.
![Page 374: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/374.jpg)
DataCrunchTheorg.kitesdk.data.crunch.CrunchDatasetspackageisahelperclasstoexposedatasetsandviewsasCrunchReadableSourceorTargetclasses:
publicclassCrunchDatasets{
publicstatic<E>ReadableSource<E>asSource(View<E>view,Class<E>type){
returnnewDatasetSourceTarget<E>(view,type);
}
publicstatic<E>ReadableSource<E>asSource(URIuri,Class<E>type){
returnnewDatasetSourceTarget<E>(uri,type);
}
publicstatic<E>ReadableSource<E>asSource(Stringuri,Class<E>type){
returnasSource(URI.create(uri),type);
}
publicstatic<E>TargetasTarget(View<E>view){
returnnewDatasetTarget<E>(view);
}
publicstaticTargetasTarget(Stringuri){
returnasTarget(URI.create(uri));
}
publicstaticTargetasTarget(URIuri){
returnnewDatasetTarget<Object>(uri);
}
}
![Page 375: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/375.jpg)
ApacheCrunchApacheCrunch(http://crunch.apache.org)isaJavaandScalalibrarytocreatepipelinesofMapReducejobs.ItisbasedonGoogle’sFlumeJava(http://dl.acm.org/citation.cfm?id=1806638)paperandlibrary.TheprojectgoalistomakethetaskofwritingMapReducejobsasstraightforwardaspossibleforanybodyfamiliarwiththeJavaprogramminglanguagebyexposinganumberofpatternsthatimplementoperationssuchasaggregating,joining,filtering,andsortingrecords.
SimilartotoolssuchasPig,Crunchpipelinesarecreatedbycomposingimmutable,distributeddatastructuresandrunningallprocessingoperationsonsuchstructures;theyareexpressedandimplementedasuser-definedfunctions.PipelinesarecompiledintoaDAGofMapReducejobs,whoseexecutionismanagedbythelibrary’splanner.Crunchallowsustowriteiterativecodeandabstractsawaythecomplexityofthinkingintermsofmapandreduceoperations,whileatthesametimeavoidingtheneedofanadhocprogramminglanguagesuchasPigLatin.Inaddition,Crunchoffersahighlycustomizabletypesystemthatallowsustoworkwith,andmix,HadoopWritables,HBase,andAvroserializedobjects.
FlumeJava’smainassumptionisthatMapReduceisthewronglevelofabstractionforseveralclassesofproblems,wherecomputationsareoftenmadeupofmultiple,chainedjobs.Frequently,weneedtocomposelogicallyindependentoperations(forexample,filtering,projecting,grouping,andothertransformations)intoasinglephysicalMapReducejobforperformancereasons.Thisaspectalsohasimplicationsforcodetestability.Althoughwewon’tcoverthisaspectinthischapter,thereaderisencouragedtolookfurtherintoitbyconsultingCrunch’sdocumentation.
![Page 376: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/376.jpg)
GettingstartedCrunchJARsarealreadyinstalledontheQuickStartVM.Bydefault,theJARsarefoundin/opt/cloudera/parcels/CDH/lib/crunch.
Alternatively,recentCrunchlibrariescanbedownloadedfromhttps://crunch.apache.org/download.html,fromMavenCentralorCloudera-specificrepositories.
![Page 377: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/377.jpg)
ConceptsCrunchpipelinesarecreatedbycomposingtwoabstractions:PCollectionandPTable.
ThePCollection<T>interfaceisadistributed,immutablecollectionofobjectsoftypeT.ThePTable<Key,Value>interfaceisadistributed,immutablehashtable—asub-interfaceofPCollection—ofkeysoftheKeytypeandvaluesoftheValuetypethatexposesmethodstoworkwiththekey-valuepairs.
Thesetwoabstractionssupportthefollowingfourprimitiveoperations:
parallelDo:appliesauser-definedfunction,DoFn,toagivenPCollectionandreturnsanewPCollectionunion:mergestwoormorePCollectionsintoasinglevirtualPCollectiongroupByKey:sortsandgroupstheelementsofaPTablebytheirkeyscombineValues:aggregatesthevaluesfromagroupByKeyoperation
Thehttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/HashtagCount.javaimplementsaCrunchMapReducepipelinethatcountshashtagoccurrences:
Pipelinepipeline=newMRPipeline(HashtagCount.class,getConf());
pipeline.enableDebug();
PCollection<String>lines=pipeline.readTextFile(args[0]);
PCollection<String>words=lines.parallelDo(newDoFn<String,String>(){
publicvoidprocess(Stringline,Emitter<String>emitter){
for(Stringword:line.split("\\s+")){
if(word.matches("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")){
emitter.emit(word);
}
}
}
},Writables.strings());
PTable<String,Long>counts=words.count();
pipeline.writeTextFile(counts,args[1]);
//ExecutethepipelineasaMapReduce.
pipeline.done();
Inthisexample,wefirstcreateaMRPipelinepipelineanduseittofirstreadthecontentofsample.txtcreatedwithstream.py-tintoacollectionofstrings,whereeachelementofthecollectionrepresentsatweet.Wetokenizeeachtweetintowordswithtweet.split("\\s+"),andweemiteachwordthatmatchesthehashtagregularexpression,serializedasWritable.NotethatthetokenizingandfilteringoperationsareexecutedinparallelbyMapReducejobscreatedbytheparallelDocall.WecreateaPTablethatassociateseachhashtag,representedasastring,withthenumberoftimesitoccurredinthedatasets.Finally,wewritethePTablecountsintoHDFSasatextfile.The
![Page 378: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/378.jpg)
pipelineisexecutedwithpipeline.done().
Tocompileandexecutethepipeline,wecanuseGradletomanagetheneededdependencies,asfollows:
$./gradlewjar
$./gradlewcopyJars
AddtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable:
$exportCRUNCH_DEPS=build/libjars/crunch-example/lib
$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-
cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-
mapred-1.7.5-cdh5.0.3-hadoop2.jar
Then,runtheexampleonHadoop:
$hadoopjarbuild/libs/crunch-example.jar\
com.learninghadoop2.crunch.HashtagCount\
tweets.jsoncount-out\
-libjars$LIBJARS
![Page 379: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/379.jpg)
DataserializationOneoftheframework’sgoalsistomakeiteasytoprocesscomplexrecordscontainingnestedandrepeateddatastructures,suchasprotocolbuffersandThriftrecords.
Theorg.apache.crunch.types.PTypeinterfacedefinesthemappingbetweenadatatypethatisusedinaCrunchpipelineandaserializationandstorageformatthatisusedtoread/writedatafrom/toHDFS.EveryPCollectionhasanassociatedPTypethattellsCrunchhowtoread/writedata.
Theorg.apache.crunch.types.PTypeFamilyinterfaceprovidesanabstractfactorytoimplementinstancesofPTypethatsharethesameserializationformat.Currently,Crunchsupportstwotypefamilies:onebasedontheWritableinterfaceandtheotheronApacheAvro.
NoteAlthoughCrunchpermitsmixingandmatchingPCollectioninterfacesthatusedifferentinstancesofPTypeinthesamepipeline,eachPCollectioninterfaces’sPTypemustbelongtoauniquefamily.Forinstance,itisnotpossibletohaveaPTablewithakeyserializedasWritableanditsvalueserializedusingAvro.
Bothtypefamiliessupportacommonsetofprimitivetypes(strings,longs,integers,floats,doubles,booleans,andbytes)aswellasmorecomplexPTypeinterfacesthatcanbeconstructedoutofotherPTypes.TheseincludetuplesandcollectionsofotherPType.Aparticularlyimportant,complex,PTypeistableOf,whichdetermineswhetherthereturntypeofparalleDowillbeaPCollectionorPTable.
NewPTypescanbecreatedbyinheritingandextendingthebuilt-insoftheAvroandWritablefamilies.ThisrequiresimplementinginputMapFn<S,T>andoutputMapFn<T,S>classes.WeareimplementingPTypeforinstanceswhereSistheoriginaltypeandTisthenewtype.
DerivedPTypescanbefoundinthePTypesclass.Theseincludeserializationsupportforprotocolbuffers,Thriftrecords,JavaEnums,BigInteger,andUUIDs.TheElephantBirdlibrarywediscussedinChapter6,DataAnalysiswithApachePig,containsadditionalexamples.
![Page 380: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/380.jpg)
Dataprocessingpatternsorg.apache.crunch.libimplementsanumberofdesignpatternsforcommondatamanipulationoperations.
AggregationandsortingMostofthedataprocessingpatternsprovidedbyorg.apache.crunch.librelyonthePTable‘sgroupByKeymethod.Themethodhasthreedifferentoverloadedforms:
groupByKey():letstheplannerdeterminethenumberofpartitionsgroupByKey(intnumPartitions):isusedtosetthenumberofpartitionsspecifiedbythedevelopergroupByKey(GroupingOptionsoptions):allowsustospecifycustompartitionsandcomparatorsforshuffling
Theorg.apache.crunch.GroupingOptionsclasstakesinstancesofHadoop’sPartitionerandRawComparatorclassestoimplementcustompartitioningandsortingoperations.
ThegroupByKeymethodreturnsaninstanceofPGroupedTable,Crunch’srepresentationofagroupedtable.ItcorrespondstotheoutputoftheshufflephaseofaMapReducejobandallowsvaluestobecombinedwiththecombineValuemethod.
Theorg.apache.crunch.lib.Aggregatepackageexposesmethodstoperformsimpleaggregations(count,max,top,andlength)onthePCollectioninstances.
SortprovidesanAPItosortPCollectionandPTableinstanceswhosecontentsimplementtheComparableinterface.
Bydefault,Crunchsortsdatausingonereducer.Thisbehaviorcanbemodifiedbypassingthenumberofpartitionsrequiredtothesortmethod.TheSort.Ordermethodsignalstheorderinwhichasortshouldbedone.
Thefollowingarehowdifferentsortoptionscanbespecifiedforcollections:
publicstatic<T>PCollection<T>sort(PCollection<T>collection)
publicstatic<T>PCollection<T>sort(PCollection<T>collection,Sort.Order
order)
publicstatic<T>PCollection<T>sort(PCollection<T>collection,int
numReducers,Sort.Orderorder)
Thefollowingarehowdifferentsortoptionscanbespecifiedfortables:
publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table)
publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,Sort.Orderkey)
publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,intnumReducers,
Sort.Orderkey)
Finally,sortPairssortsthePCollectionofpairsusingthespecifiedcolumnorderinSort.ColumnOrder:
![Page 381: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/381.jpg)
sortPairs(PCollection<Pair<U,V>>collection,Sort.ColumnOrder…
columnOrders)
JoiningdataTheorg.apache.crunch.lib.JoinpackageisanAPItojoinPTablesbasedonacommonkey.Thefollowingfourjoinoperationsaresupported:
fullJoin
join(defaultstoinnerJoin)leftJoin
rightJoin
Themethodshaveacommonreturntypeandsignature.Forreference,wewilldescribethecommonlyusedjoinmethodthatimplementsaninnerjoin:
publicstatic<K,U,V>PTable<K,Pair<U,V>>join(PTable<K,U>left,
PTable<K,V>right)
Theorg.apache.crunch.lib.Join.JoinStrategypackageprovidesaninterfacetodefinecustomjoinstrategies.Crunch’sdefaultstrategy(defaultStrategy)istojoindatareduce-side.
![Page 382: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/382.jpg)
PipelinesimplementationandexecutionCrunchcomeswiththreeimplementationsofthepipelineinterface.Theoldestone,implicitlyusedinthischapter,isorg.apache.crunch.impl.mr.MRPipeline,whichusesHadoop’sMapReduceasitsexecutionengine.org.apache.crunch.impl.mem.MemPipelineallowsalloperationstobeperformedinmemory,withnoserializationtodiskperformed.Crunch0.10introducedorg.apache.crunch.impl.spark.SparkPipelinewhichcompilesandrunsaDAGofPCollectionstoApacheSpark.
SparkPipelineWithSparkPipeline,CrunchdelegatesmuchoftheexecutiontoSparkanddoesrelativelylittleoftheplanningtasks,withthefollowingexceptions:
MultipleinputsMultipleoutputsDataserializationCheckpointing
Atthetimeofwriting,SparkPipelineisstillheavilyunderdevelopmentandmightnothandlealloftheusecasesofastandardMRPipeline.TheCrunchcommunityisactivelyworkingtoensurecompletecompatibilitybetweenthetwoimplementations.
MemPipelineMemPipelineexecutesin-memoryonaclient.UnlikeMRPipeline,MemPipelineisnotexplicitlycreatedbutreferencedbycallingthestaticmethodMemPipeline.getInstance().Alloperationsareinmemory,andtheuseofPTypesisminimal.
![Page 383: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/383.jpg)
CrunchexamplesWewillnowuseApacheCrunchtoreimplementsomeoftheMapReducecodewrittensofarinamoremodularfashion.
Wordco-occurrenceInChapter3,Processing–MapReduceandBeyond,weshowedaMapReducejob,BiGramCount,tocountco-occurrencesofwordsintweets.ThatsamelogiccanbeimplementedasaDoFn.Insteadofemittingamulti-fieldkeyandhavingtoparseitatalaterstage,withCrunchwecanuseacomplextypePair<String,String>,asfollows:
classBiGramextendsDoFn<String,Pair<String,String>>{
@Override
publicvoidprocess(Stringtweet,
Emitter<Pair<String,String>>emitter){
String[]words=tweet.split("");
Textbigram=newText();
Stringprev=null;
for(Strings:words){
if(prev!=null){
emitter.emit(Pair.of(prev,s));
}
prev=s;
}
}
}
Noticehow,comparedtoMapReduce,theBiGramCrunchimplementationisastandaloneclass,easilyreusableinanyothercodebase.Thecodeforthisexampleisincludedinhttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/DataPreparationPipeline.java
TF-IDFWecanimplementtheTF-IDFchainofjobswithaMRPipeline,asfollows:
publicclassCrunchTermFrequencyInvertedDocumentFrequency
extendsConfiguredimplementsTool,Serializable{
privateLongnumDocs;
@SuppressWarnings("deprecation")
publicstaticclassTF{
Stringterm;
StringdocId;
intfrequency;
publicTF(){}
![Page 384: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/384.jpg)
publicTF(Stringterm,
StringdocId,Integerfrequency){
this.term=term;
this.docId=docId;
this.frequency=(int)frequency;
}
}
publicintrun(String[]args)throwsException{
if(args.length!=2){
System.err.println();
System.err.println("Usage:"+this.getClass().getName()+"
[genericoptions]inputoutput");
return1;
}
//Createanobjecttocoordinatepipelinecreationandexecution.
Pipelinepipeline=
newMRPipeline(TermFrequencyInvertedDocumentFrequency.class,getConf());
//enabledebugoptions
pipeline.enableDebug();
//ReferenceagiventextfileasacollectionofStrings.
PCollection<String>tweets=pipeline.readTextFile(args[0]);
numDocs=tweets.length().getValue();
//WeuseAvroreflectionstomaptheTFPOJOtoavsc
PTable<String,TF>tf=tweets.parallelDo(newTermFrequencyAvro(),
Avros.tableOf(Avros.strings(),Avros.reflects(TF.class)));
//CalculateDF
PTable<String,Long>df=Aggregate.count(tf.parallelDo(new
DocumentFrequencyString(),Avros.strings()));
//FinallywecalculateTF-IDF
PTable<String,Pair<TF,Long>>tfDf=Join.join(tf,df);
PCollection<Tuple3<String,String,Double>>tfIdf=
tfDf.parallelDo(newTermFrequencyInvertedDocumentFrequency(),
Avros.triples(
Avros.strings(),
Avros.strings(),
Avros.doubles()));
//Serializeasavro
tfIdf.write(To.avroFile(args[1]));
//ExecutethepipelineasaMapReduce.
PipelineResultresult=pipeline.done();
returnresult.succeeded()?0:1;
}
…
}
![Page 385: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/385.jpg)
Theapproachthatwefollowherehasanumberofadvantagescomparedtostreaming.Firstofall,wedon’tneedtomanuallychainMapReducejobsusingaseparatescript.ThistaskisCrunch’smainpurpose.Secondly,wecanexpresseachcomponentofthemetricasadistinctclass,makingiteasiertoreuseinfutureapplications.
Toimplementtermfrequency,wecreateaDoFnclassthattakesasinputatweetandemitsPair<String,TF>.Thefirstelementisaterm,andthesecondisaninstanceofthePOJOclassthatwillbeserializedusingAvro.TheTFpartcontainsthreevariables:term,documentId,andfrequency.Inthereferenceimplementation,weexpectinputdatatobeaJSONstringthatwedeserializeandparse.Wealsoincludetokenizingasasubtaskoftheprocessmethod.
Dependingontheusecases,wecouldabstractbothoperationsinseparateDoFns,asfollows:
classTermFrequencyAvroextendsDoFn<String,Pair<String,TF>>{
publicvoidprocess(StringJSONTweet,
Emitter<Pair<String,TF>>emitter){
Map<String,Integer>termCount=newHashMap<>();
Stringtweet;
StringdocId;
JSONParserparser=newJSONParser();
try{
Objectobj=parser.parse(JSONTweet);
JSONObjectjsonObject=(JSONObject)obj;
tweet=(String)jsonObject.get("text");
docId=(String)jsonObject.get("id_str");
for(Stringterm:tweet.split("\\s+")){
if(termCount.containsKey(term.toLowerCase())){
termCount.put(term,
termCount.get(term.toLowerCase())+1);
}else{
termCount.put(term.toLowerCase(),1);
}
}
for(Entry<String,Integer>entry:termCount.entrySet()){
emitter.emit(Pair.of(entry.getKey(),newTF(entry.getKey(),
docId,entry.getValue())));
}
}catch(ParseExceptione){
e.printStackTrace();
}
}
}
}
Documentfrequencyisstraightforward.ForeachPair<String,TF>generatedintheterm
![Page 386: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/386.jpg)
frequencystep,weemittheterm—thefirstelementofthepair.WeaggregateandcounttheresultingPCollectionoftermstoobtaindocumentfrequency,asfollows:
classDocumentFrequencyStringextendsDoFn<Pair<String,TF>,String>{
@Override
publicvoidprocess(Pair<String,TF>tfAvro,
Emitter<String>emitter){
emitter.emit(tfAvro.first());
}
}
WefinallyjointhePTableTFwiththePTableDFonthesharedkey(term)andfeedtheresultingPair<String,Pair<TF,Long>>objecttoTermFrequencyInvertedDocumentFrequency.
Foreachtermanddocument,wecalculateTF-IDFandreturnaterm,docIf,andtfIdftriple:
classTermFrequencyInvertedDocumentFrequencyextendsMapFn<Pair<String,
Pair<TF,Long>>,Tuple3<String,String,Double>>{
@Override
publicTuple3<String,String,Double>map(
Pair<String,Pair<TF,Long>>input){
Pair<TF,Long>tfDf=input.second();
Longdf=tfDf.second();
TFtf=tfDf.first();
doubleidf=1.0+Math.log(numDocs/df);
doubletfIdf=idf*tf.frequency;
returnTuple3.of(tf.term,tf.docId,tfIdf);
}
}
WeuseMapFnbecausewearegoingtooutputonerecordforeachinput.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/CrunchTermFrequencyInvertedDocumentFrequency.java
Theexamplecanbecompiledandexecutedwiththefollowingcommands:
$./gradlewjar
$./gradlewcopyJars
Ifnotalreadydone,addtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable,asfollows:
$exportCRUNCH_DEPS=build/libjars/crunch-example/lib
$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-
cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-
mapred-1.7.5-cdh5.0.3-hadoop2.jar
Furthermore,addthejson-simpleJARtoLIBJARS:
$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/json-simple-1.1.1.jar
![Page 387: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/387.jpg)
Finally,runCrunchTermFrequencyInvertedDocumentFrequencyasaMapReducejob,asfollows:
$hadoopjarbuild/libs/crunch-example.jar\
com.learninghadoop2.crunch.CrunchTermFrequencyInvertedDocumentFrequency\
-libjars${LIBJARS}\
tweets.jsontweets.avro-out
![Page 388: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/388.jpg)
KiteMorphlinesKiteMorphlinesisadatatransformationlibrary,inspiredbyUnixpipes,originallydevelopedaspartofClouderaSearch.Amorphlineisanin-memorychainoftransformationcommandsthatreliesonapluginstructuretotapheterogeneousdatasources.ItusesdeclarativecommandstocarryoutETLoperationsonrecords.Commandsaredefinedinaconfigurationfile,whichislaterfedtoadriverclass.
ThegoalistomakeembeddingETLlogicintoanyJavacodebaseatrivialtaskbyprovidingalibrarythatallowsdeveloperstoreplaceprogrammingwithaseriesofconfigurationsettings.
ConceptsMorphlinesarebuiltaroundtwoabstractions:CommandandRecord.
Recordsareimplementationsoftheorg.kitesdk.morphline.api.Recordinterface:
publicfinalclassRecord{
privateArrayListMultimap<String,Object>fields;
…
privateRecord(ArrayListMultimap<String,Object>fields){…}
publicListMultimap<String,Object>getFields(){…}
publicListget(Stringkey){…}
publicvoidput(Stringkey,Objectvalue){…}
…
}
Arecordisasetofnamedfields,whereeachfieldhasalistofoneormorevalues.ARecordisimplementedontopofGoogleGuava’sListMultimapandArrayListMultimapclasses.NotethatavaluecanbeanyJavaobject,fieldscanbemultivalued,andtworecordsdon’tneedtousecommonfieldnames.Arecordcancontainan_attachment_bodyfieldthatcanbeajava.io.InputStreamorabytearray.
Commandsimplementtheorg.kitesdk.morphline.api.Commandinterface:
publicinterfaceCommand{
voidnotify(Recordnotification);
booleanprocess(Recordrecord);
CommandgetParent();
}
Acommandtransformsarecordintozeroormorerecords.CommandscancallthemethodsontheRecordinstanceprovidedforreadandwriteoperationsaswellasforaddingorremovingfields.
Commandsarechainedtogether,andateachstepofamorphlinetheparentcommandsendsrecordstoitschild,whichinturnprocessesthem.Informationbetweenparentsandchildrenisexchangedusingtwocommunicationchannels(planes);notificationsaresentviaacontrolplane,andrecordsaresentoveradataplane.Recordsareprocessedbytheprocess()method,whichreturnsaBooleanvaluetoindicatewhetheramorphlineshouldproceedornot.
![Page 389: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/389.jpg)
Commandsarenotinstantiateddirectly,butviaanimplementationoftheorg.kitesdk.morphline.api.CommandBuilderinterface:
publicinterfaceCommandBuilder{
Collection<String>getNames();
Commandbuild(Configconfig,
Commandparent,
Commandchild,
MorphlineContextcontext);
}
ThegetNamesmethodreturnsthenameswithwhichthecommandcanbeinvoked.Multiplenamesaresupportedtoallowbackwardscompatiblenamechanges.Thebuild()methodcreatesandreturnsacommandrootedatthegivenmorphlineconfiguration.
Theorg.kitesdk.morphline.api.MorphlineContextinterfaceallowsadditionalparameterstobepassedtoallmorphlinecommands.
Thedatamodelofmorphlinesisstructuredfollowingasource-pipe-sinkpattern,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink.
MorphlinecommandsKiteMorphlinescomeswithanumberofdefaultcommandsthatimplementdatatransformationsoncommonserializationformats(plaintext,Avro,JSON).Currentlyavailablecommandsareorganizedassubprojectsofmorphlinesandinclude:
kite-morphlines-core-stdio:willreaddatafrombinarylargeobjects(BLOBs)andtextkite-morphlines-core-stdlib:wrapsaroundJavadatatypesfordatamanipulationandrepresentationkite-morphlines-avro:isusedforserializationintoanddeserializationfromdataintheAvroformatkite-morphlines-json:willserializeanddeserializedatainJSONformatkite-morphlines-hadoop-core:isusedtoaccessHDFSkite-morphlines-hadoop-parquet-avro:isusedtoserializeanddeserializedataintheParquetformatkite-morphlines-hadoop-sequencefile:isusedtoserializeanddeserializedataintheSequencefileformatkite-morphlines-hadoop-rcfile:isusedtoserializeanddeserializedatainRCfileformat
Alistofallavailablecommandscanbefoundathttp://kitesdk.org/docs/0.17.0/kite-morphlines/morphlinesReferenceGuide.html.
Commandsaredefinedbydeclaringachainoftransformationsinaconfigurationfile,morphline.conf,whichisthencompiledandexecutedbyadriverprogram.Forinstance,wecanspecifyaread_tweetsmorphlinethatwillloadtweetsstoredasJSONdata,serializeanddeserializethemusingJackson,andprintthefirst10,bycombiningthe
![Page 390: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/390.jpg)
defaultreadJsonandheadcommandscontainedintheorg.kitesdk.morphlinepackage,asfollows:
morphlines:[{
id:read_tweets
importCommands:["org.kitesdk.morphline.**"]
commands:[{
readJson{
outputClass:com.fasterxml.jackson.databind.JsonNode
}}
{
head{
limit:10
}}
]
}]
WewillnowshowhowthismorphlinecanbeexecutedbothfromastandaloneJavaprogramaswellasfromMapReduce.
MorphlineDriver.javashowshowtousethelibraryembeddedintoahostsystem.Thefirststepthatwecarryoutinthemainmethodistoloadmorphline’sJSONconfiguration,buildaMorphlineContextobject,andcompileitintoaninstanceofCommandthatactsasthestartingnodeofthemorphline.NotethatCompiler.compile()takesafinalChildparameter;inthiscase,itisRecordEmitter.WeuseRecordEmittertoactasasinkforthemorphline,byeitherprintingarecordtostdoutorstoringitintoHDFS.IntheMorphlineDriverexample,weuseorg.kitesdk.morphline.base.Notificationstomanageandmonitorthemorphlinelifecycleinatransactionalfashion.
AcalltoNotifications.notifyStartSession(morphline)startsthetransformationchainwithinatransactiondefinedbycallingNotifications.notifyBeginTransaction.Uponsuccess,weterminatethepipelinewithNotifications.notifyShutdown(morphline).Intheeventoffailure,werollbackthetransaction,Notifications.notifyRollbackTransaction(morphline),andpassanexceptionhandlerfromthemorphlinecontexttothecallingJavacode:
publicclassMorphlineDriver{
privatestaticfinalclassRecordEmitterimplementsCommand{
privatefinalTextline=newText();
@Override
publicCommandgetParent(){
returnnull;
}
@Override
publicvoidnotify(Recordrecord){
}
@Override
publicbooleanprocess(Recordrecord){
![Page 391: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/391.jpg)
line.set(record.get("_attachment_body").toString());
System.out.println(line);
returntrue;
}
}
publicstaticvoidmain(String[]args)throwsIOException{
/*loadamorphlineconfandsetitup*/
FilemorphlineFile=newFile(args[0]);
StringmorphlineId=args[1];
MorphlineContextmorphlineContext=new
MorphlineContext.Builder().build();
Commandmorphline=newCompiler().compile(morphlineFile,
morphlineId,morphlineContext,newRecordEmitter());
/*Preparethemorphlineforexecution
*
*Notificationsaresentthroughthecommunicationchannel
**/
Notifications.notifyBeginTransaction(morphline);
/*Notethatweareusingthelocalfilesystem,nothdfs*/
InputStreamin=newBufferedInputStream(new
FileInputStream(args[2]));
/*fillinarecordandpassitover*/
Recordrecord=newRecord();
record.put(Fields.ATTACHMENT_BODY,in);
try{
Notifications.notifyStartSession(morphline);
booleansuccess=morphline.process(record);
if(!success){
System.out.println("Morphlinefailedtoprocessrecord:"+
record);
}
/*Committhemorphline*/
}catch(RuntimeExceptione){
Notifications.notifyRollbackTransaction(morphline);
morphlineContext.getExceptionHandler().handleException(e,null);
}
finally{
in.close();
}
/*shutitdown*/
Notifications.notifyShutdown(morphline);
}
}
Inthisexample,weloaddatainJSONformatfromthelocalfilesystemintoanInputStreamobjectanduseittoinitializeanewRecordinstance.TheRecordEmitter
![Page 392: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/392.jpg)
classcontainsthelastprocessedrecordinstanceofthechain,onwhichweextract_attachment_bodyandprintittostandardoutput.ThesourcecodeforMorphlineDrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriver.java
UsingthesamemorphlinefromaMapReducejobisstraightforward.DuringthesetupphaseoftheMapper,webuildacontextthatcontainstheinstantiationlogic,whilethemapmethodsetstheRecordobjectupandfiresofftheprocessinglogic,asfollows:
publicstaticclassReadTweets
extendsMapper<Object,Text,Text,NullWritable>{
privatefinalRecordrecord=newRecord();
privateCommandmorphline;
@Override
protectedvoidsetup(Contextcontext)
throwsIOException,InterruptedException{
FilemorphlineConf=newFile(context.getConfiguration()
.get(MORPHLINE_CONF));
StringmorphlineId=context.getConfiguration()
.get(MORPHLINE_ID);
MorphlineContextmorphlineContext=
newMorphlineContext.Builder()
.build();
morphline=neworg.kitesdk.morphline.base.Compiler()
.compile(morphlineConf,
morphlineId,
morphlineContext,
newRecordEmitter(context));
}
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
record.put(Fields.ATTACHMENT_BODY,
newByteArrayInputStream(
value.toString().getBytes("UTF8")));
if(!morphline.process(record)){
System.out.println(
"Morphlinefailedtoprocessrecord:"+record);
}
record.removeAll(Fields.ATTACHMENT_BODY);
}
}
IntheMapReducecodewemodifyRecordEmittertoextracttheFieldspayloadfrompost-processedrecordsandstoreitintocontext.ThisallowsustowritedataintoHDFSbyspecifyingaFileOutputFormatintheMapReduceconfigurationboilerplate:
privatestaticfinalclassRecordEmitterimplementsCommand{
privatefinalTextline=newText();
privatefinalMapper.Contextcontext;
privateRecordEmitter(Mapper.Contextcontext){
![Page 393: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/393.jpg)
this.context=context;
}
@Override
publicvoidnotify(Recordnotification){
}
@Override
publicCommandgetParent(){
returnnull;
}
@Override
publicbooleanprocess(Recordrecord){
line.set(record.get(Fields.ATTACHMENT_BODY).toString());
try{
context.write(line,null);
}catch(Exceptione){
e.printStackTrace();
returnfalse;
}
returntrue;
}
}
Noticethatwecannowchangetheprocessingpipelinebehaviorandaddfurtherdatatransformationsbymodifyingmorphline.confwithouttheexplicitneedtoaltertheinstantiationandprocessinglogic.TheMapReducedriversourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriverMapReduce.java
Bothexamplescanbecompiledfromch9/kite/withthefollowingcommands:
$./gradlewjar
$./gradlewcopyJar
WeaddtheruntimedependenciestoLIBJARS,asfollows
$exportKITE_DEPS=/home/cloudera/review/hadoop2book-private-reviews-
gabriele-ch8/src/ch8/kite/build/libjars/kite-example/lib
exportLIBJARS=${LIBJARS},${KITE_DEPS}/kite-morphlines-core-
0.17.0.jar,${KITE_DEPS}/kite-morphlines-json-
0.17.0.jar,${KITE_DEPS}/metrics-core-3.0.2.jar,${KITE_DEPS}/metrics-
healthchecks-3.0.2.jar,${KITE_DEPS}/config-1.0.2.jar,${KITE_DEPS}/jackson-
databind-2.3.1.jar,${KITE_DEPS}/jackson-core-
2.3.1.jar,${KITE_DEPS}/jackson-annotations-2.3.0.jar
WecanruntheMapReducedriverwiththefollowing:
$hadoopjarbuild/libs/kite-example.jar\
com.learninghadoop2.kite.morphlines.MorphlineDriverMapReduce\
-libjars${LIBJARS}\
morphline.conf\
read_tweets\
tweets.json\
morphlines-out
![Page 394: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/394.jpg)
TheJavastandalonedrivercanbeexecutedwiththefollowingcommand:
$exportCLASSPATH=${CLASSPATH}:${KITE_DEPS}/kite-morphlines-core-
0.17.0.jar:${KITE_DEPS}/kite-morphlines-json-
0.17.0.jar:${KITE_DEPS}/metrics-core-3.0.2.jar:${KITE_DEPS}/metrics-
healthchecks-3.0.2.jar:${KITE_DEPS}/config-1.0.2.jar:${KITE_DEPS}/jackson-
databind-2.3.1.jar:${KITE_DEPS}/jackson-core-
2.3.1.jar:${KITE_DEPS}/jackson-annotations-2.3.0.jar:${KITE_DEPS}/slf4j-
api-1.7.5.jar:${KITE_DEPS}/guava-11.0.2.jar:${KITE_DEPS}/hadoop-common-
2.3.0-cdh5.0.3.jar
$java-cp$CLASSPATH:./build/libs/kite-example.jar\
com.learninghadoop2.kite.morphlines.MorphlineDriver\
morphline.conf\
read_tweetstweets.json\
morphlines-out
![Page 395: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/395.jpg)
SummaryInthischapter,weintroducedfourtoolstoeasedevelopmentonHadoop.Inparticular,wecovered:
HowHadoopstreamingallowsthewritingofMapReducejobsusingdynamiclanguagesHowKiteDatasimplifiesinterfacingwithheterogeneousdatasourcesHowApacheCrunchprovidesahigh-levelabstractiontowritepipelinesofSparkandMapReducejobsthatimplementcommondesignpatternsHowMorphlinesallowsustodeclarechainsofcommandsanddatatransformationsthatcanthenbeembeddedinanyJavacodebase
InChapter10,RunningaHadoop2Cluster,wewillshiftourfocusfromthedomainofsoftwaredevelopmenttosystemadministration.Wewilldiscusshowtosetup,manage,andscaleaHadoopcluster,whiletakingaspectssuchasmonitoringandsecurityintoconsideration.
![Page 396: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/396.jpg)
Chapter10.RunningaHadoopClusterInthischapter,wewillchangeourfocusalittleandlookatsomeoftheconsiderationsyouwillfacewhenrunninganoperationalHadoopcluster.Inparticular,wewillcoverthefollowingtopics:
WhyadevelopershouldcareaboutoperationsandwhyHadoopoperationsaredifferentMoredetailonClouderaManageranditscapabilitiesandlimitationsDesigningaclusterforuseonbothphysicalhardwareandEMRSecuringaHadoopclusterHadoopmonitoringTroubleshootingproblemswithanapplicationrunningonHadoop
![Page 397: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/397.jpg)
I’madeveloper–Idon’tcareaboutoperations!Beforegoinganyfurther,weneedtoexplainwhyweareputtingachapteraboutsystemsoperationsinabooksquarelyaimedatdevelopers.Foranyonewhohasdevelopedformoretraditionalplatforms(forexample,webapps,databaseprogramming,andsoon)thenthenormmightwellhavebeenforaverycleardelineationbetweendevelopmentandoperations.Thefirstgroupbuildsthecodeandpackagesitup,andthesecondgroupcontrolsandoperatestheenvironmentinwhichitruns.
Inrecentyears,theDevOpsmovementhasgainedmomentumwithabeliefthatitisbestforeveryoneifthesesilosareremovedandthattheteamsworkmorecloselytogether.WhenitcomestorunningsystemsandservicesbasedonHadoop,webelievethisisabsolutelyessential.
![Page 398: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/398.jpg)
HadoopandDevOpspracticesEventhoughadevelopercanconceptuallybuildanapplicationreadytobedroppedintoYARNandforgottenabout,therealityisoftenmorenuanced.Howmanyresourcesareallocatedtotheapplicationatruntimeismostlikelysomethingthedeveloperwishestoinfluence.Oncetheapplicationisrunning,theoperationsstafflikelywantsomeinsightintotheapplicationwhentheyaretryingtooptimizethecluster.Therereallyisn’tthesameclear-cutsplitofresponsibilitiesseenintraditionalenterpriseIT.Andthat’slikelyareallygoodthing.
Inotherwords,developersneedtobemoreawareoftheoperationsaspects,andtheoperationsstaffneedtobemoreawareofwhatthedevelopersaredoing.Soconsiderthischapterourcontributiontohelpyouhavethosediscussionswithyouroperationsstaff.Wedon’tintendtomakeyouanexpertHadoopadministratorbytheendofthischapter;thatreallyisemergingasadedicatedroleandskillsetinitself.Instead,wewillgiveawhistle-stoptourofissuesyoudoneedsomeawarenessofandthatwillmakeyourlifeeasieronceyourapplicationsarerunningonliveclusters.
Bythenatureofthiscoverage,wewillbetouchingonalotoftopicsandgoingintothemonlylightly;ifanyareofdeeperinterest,thenweprovidelinksforfurtherinvestigation.Justmakesureyoukeepyouroperationsstaffinvolved!
![Page 399: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/399.jpg)
ClouderaManagerInthisbook,weusedasthemostcommonplatformtheClouderaHadoopDistribution(CDH)withitsconvenientQuickStartvirtualmachineandthepowerfulClouderaManagerapplication.WithaCloudera-basedcluster,ClouderaManagerwillbecome(atleastinitially)yourprimaryinterfaceintothesystemtomanageandmonitorthecluster,solet’sexploreitalittle.
NotethatClouderaManagerhasextensiveandhigh-qualityonlinedocumentation.Wewon’tduplicatethisdocumentationhere;insteadwe’llattempttohighlightwhereClouderaManagerfitsintoyourdevelopmentandoperationalworkflowsandhowitmightormightnotbesomethingyouwanttoembrace.DocumentationforthelatestandpreviousversionsofClouderaManagercanbeaccessedviathemainClouderadocumentationpageathttp://www.cloudera.com/content/support/en/documentation.html.
![Page 400: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/400.jpg)
TopayornottopayBeforegettingallexcitedaboutClouderaManager,it’simportanttoconsultthecurrentdocumentationconcerningwhatfeaturesareavailableinthefreeversionandwhichonesrequiresubscriptiontoapaid-forClouderaoffering.Ifyouabsolutelywantsomeofthefeaturesofferedonlyinthepaid-forversionbuteithercan’tordon’twishtopayforsubscriptionservices,thenClouderaManager,andpossiblytheentireClouderadistribution,mightnotbeagoodfitforyou.We’llreturntothistopicinChapter11,WheretoGoNext.
![Page 401: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/401.jpg)
ClustermanagementusingClouderaManagerUsingtheQuickStartVM,itwon’tbeobvious,butClouderaManageristheprimarytooltobeusedformanagementofallservicesinthecluster.Ifyouwanttoenableanewservice,you’lluseClouderaManager.Tochangeaconfiguration,youwillneedClouderaManager.Toupgradetothelatestrelease,youwillagainrequireClouderaManager.
Eveniftheprimarymanagementoftheclusterishandledbyoperationalstaff,asadeveloperyou’lllikelystillwanttobecomefamiliarwiththeClouderaManagerinterfacejusttolooktoseeexactlyhowtheclusterisconfigured.Ifyourjobsarerunningslowly,thenlookingintoClouderaManagertoseejusthowthingsarecurrentlyconfiguredwilllikelybeyourfirststart.ThedefaultportfortheClouderaManagerwebinterfaceis7180,sothehomepagewillusuallybeconnectedtoviaaURLsuchashttp://<hostname>:7180/cmf/home,andcanbeseeninthefollowingscreenshot:
ClouderaManagerhomepage
It’sworthpokingaroundtheinterface;however,ifyouareconnectingwithauseraccountwithadminprivileges,becareful!
ClickontheClusterslink,andthiswillexpandtogivealistoftheclusterscurrentlymanagedbythisinstanceofClouderaManager.ThisshouldtellyouthatasingleClouderaManagerinstancecanmanagemultipleclusters.Thisisveryuseful,especiallyifyouhavemanyclustersspreadacrossdevelopmentandproduction.
Foreachexpandedcluster,therewillbealistoftheservicescurrentlyrunningonthecluster.Clickonaservice,andthenyouwillseealistofadditionalchoices.SelectConfiguration,andyoucanstartbrowsingthedetailedconfigurationofthatparticularservice.ClickonActions,andyouwillgetsomeservice-specificoptions;thiswillusuallyincludestopping,starting,restarting,andotherwisemanagingtheservice.
![Page 402: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/402.jpg)
ClickontheHostsoptioninsteadofClusters,andyoucanstartdrillingdownintotheserversmanagedbyClouderaManager,andfromthere,seewhichservicecomponentsaredeployedoneach.
ClouderaManagerandothermanagementtoolsThatlastcommentmightraiseaquestion:howdoesClouderaManagerintegratewithothersystemsmanagementtools?GivenourearliercommentsregardingtheimportanceofDevOpsphilosophies,howwelldoesitintegratewiththetoolsfavoredinDevOpsenvironments?
Thehonestanswer:notalwaysverywell.ThoughthemainClouderaManagerservercanitselfbemanagedbyautomationtools,suchasPuppetorChef,thereisanexplicitassumptionthatClouderaManagerwillcontroltheinstallationandconfigurationofallthesoftwareClouderaManagerneedsonallthehoststhatwillbeincludedinitsclusters.Tosomeadministrators,thismakesthehardwarebehindClouderaManagerlooklikeabig,blackbox;theymightcontroltheinstallationofthebaseoperatingsystem,butthemanagementoftheconfigurationbaselinegoingforwardisentirelymanagedbyClouderaManager.There’snothingmuchtobedonehere;itiswhatitis—togetthebenefitsofClouderaManager,itwilladditselfasanewmanagementsysteminyourinfrastructure,andhowwellthatfitsinwithyourbroaderenvironmentwillbedeterminedonacase-by-casebasis.
![Page 403: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/403.jpg)
MonitoringwithClouderaManagerAsimilarpointcanbemaderegardingsystemsmonitoringasClouderaManagerisalsoconceptuallyapointofduplicationhere.Butstartclickingaroundtheinterface,anditwillbecomeapparentveryquicklythatClouderaManagerprovidesanexceptionallyrichsetoftoolstoassessthehealthandperformanceofmanagedclusters.
FromgraphingtherelativeperformanceofImpalaqueriesthroughshowingthejobstatusforYARNapplicationsandgivinglow-leveldataontheblocksstoredonHDFS,itisallthereinasingleinterface.We’lldiscusslaterinthischapterhowtroubleshootingonHadoopcanbechallenging,butthesinglepointofvisibilityprovidedbyClouderaManagerisagreattoolwhenlookingtoassessclusterhealthorperformance.We’lldiscussmonitoringinalittlemoredetaillaterinthischapter.
FindingconfigurationfilesOneofthefirstconfusionsfacedwhenrunningaclustermanagedbyClouderaManageristryingtofindtheconfigurationfilesusedbythecluster.InthevanillaApachereleasesofproducts,suchasthecoreHadoop,therewouldbefilestypicallystoredin/etc/hadoop,similarly/etc/hiveforHive,/etc/oozieforOozie,andsoon.
InaClouderaManagermanagedcluster,however,theconfigfilesareregeneratedeachtimeaserviceisrestarted,andinsteadofsittinginthe/etclocationsonthefilesystem,willbefoundat/var/run/cloudera-scm-agent-process/<pid>-<taskname>/,wherethelastdirectorymighthaveanamesuchas7007-yarn-NODEMANAGER.ThismightseemoddtoanyoneusedtoworkingonearlierHadoopclustersorotherdistributionsthatdon’tdosuchathing.ButinaClouderaManager-controlledcluster,itmightoftenbeeasiertousethewebinterfacetobrowsetheconfigurationinsteadoflookingfortheunderlyingconfigfiles.Whichapproachisbest?Thisisalittlephilosophical,andeachteamneedstodecidewhichworksbestforthem.
![Page 404: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/404.jpg)
ClouderaManagerAPIWe’veonlygiventhehighestlevelofoverviewofClouderaManager,andindoingso,havecompletelyignoredoneareathatmightbeveryusefulforsomeorganizations:ClouderaManageroffersanAPIthatallowsintegrationofitscapabilitiesintoothersystemsandtools.Consultthedocumentationifthismightbeofinteresttoyou.
![Page 405: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/405.jpg)
ClouderaManagerlock-inThisbringsustothepointthatisimplicitinthewholediscussionaroundClouderaManager:itdoescauseadegreeoflock-intoClouderaandtheirdistribution.Thatlock-inmightonlyexistincertainways;code,forexample,shouldbeportableacrossclustersmodulotheusualcaveatsaboutdifferentunderlyingversions—buttheclusteritselfmightnoteasilybereconfiguredtouseadifferentdistribution.Assumethatswitchingdistributionswouldbeacompleteremove/reformat/reinstallactivity.
Wearen’tsayingdon’tuseit,ratherthatyouneedtobeawareofthelock-inthatcomeswiththeuseofClouderaManager.Forsmallteamswithlittlededicatedoperationssupportorexistinginfrastructure,theimpactofsuchalock-inislikelyoutweighedbythesignificantcapabilitiesthatClouderaManagergivesyou.
Forlargerteamsoronesworkinginanenvironmentwhereintegrationwithexistingtoolsandprocesseshasmoreweight,thedecisionmightbelessclear.LookatClouderaManager,discusswithyouroperationspeople,anddeterminewhatisrightforyou.
NotethatitispossibletomanuallydownloadandinstallthevariouscomponentsoftheClouderadistributionwithoutusingClouderaManagertomanagetheclusteranditshosts.ThismightbeanattractivemiddlegroundforsomeusersastheClouderasoftwarecanbeused,butdeploymentandmanagementcanbebuiltintotheexistingdeploymentandmanagementtools.Thisisalsopotentiallyawayofavoidingtheadditionalexpenseofthepaid-forlevelsofClouderasupportmentionedearlier.
![Page 406: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/406.jpg)
Ambari–theopensourcealternativeAmbariisanApacheproject(http://ambari.apache.org),whichintheory,providesanopensourcealternativetoClouderaManager.ItistheadministrationconsolefortheHortonworksdistribution.AtthetimeofwritingHortonworksemployeesarealsothevastmajorityoftheprojectcontributors.
Ambari,asonewouldexpectgivenitsopensourcenature,reliesonotheropensourceproducts,suchasPuppetandNagios,toprovidethemanagementandmonitoringofitsmanagedclusters.Italsohashigh-levelfunctionalitysimilartoClouderaManager,thatis,theinstallation,configuration,management,andmonitoringofaHadoopcluster,andthecomponentserviceswithinit.
ItisgoodtobeawareoftheAmbariprojectasthechoiceisnotjustbetweenfulllock-intoClouderaandClouderaManageroramanuallymanagedcluster.Ambariprovidesagraphicaltoolthatmightbeworthconsideration,orindeedinvolvement,asitmatures.OnanHDPcluster,theAmbariUIequivalenttotheClouderaManagerhomepageshownearliercanbereachedathttp://<hostname>:8080/#/main/dashboardandlookslikethefollowingscreenshot:
Ambari
![Page 407: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/407.jpg)
OperationsintheHadoop2worldAsmentionedinChapter2,Storage,someofthemostsignificantchangesmadetoHDFSinHadoop2involveitsfaulttoleranceandbetterintegrationwithexternalsystems.Thisisnotjustacuriosity,buttheNameNodeHighAvailabilityfeatures,inparticular,havemadeamassivedifferenceinthemanagementofclusterssinceHadoop1.Inthebadolddaysof2012orso,asignificantpartoftheoperationalpreparednessofaHadoopclusterwasbuiltaroundmitigationsfor,andrestorationprocessesaroundfailureoftheNameNode.IftheNameNodediedinHadoop1,andyoudidn’thaveabackupoftheHDFSfsimagemetadatafile,thenyoubasicallylostaccesstoallyourdata.Ifthemetadatawaspermanentlylost,thensowasthedata.
Hadoop2hasaddedthein-builtNameNodeHAandthemachinerytomakeitwork.Inaddition,therearecomponentssuchastheNFSgatewayintoHDFS,whichmakeitamuchmoreflexiblesystem.Butthisadditionalcapabilitydoescomeattheexpenseofmoremovingparts.ToenableNameNodeHA,thereareadditionalcomponentsintheJournalManagerandFailoverController,andtheNFSgatewayrequiresHadoop-specificimplementationsoftheportmapandnfsdservices.
Hadoop2alsonowhasextensiveotherintegrationpointswithexternalservicesaswellasamuchbroaderselectionofapplicationsandservicesthatrunatopit.Consequently,itmightbeusefultoviewHadoop2intermsofoperationsashavingtradedthesimplicityofHadoop1foradditionalcomplexity,whichdeliversasubstantiallymorecapableplatform.
![Page 408: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/408.jpg)
SharingresourcesInHadoop1,theonlytimeonehadtoconsiderresourcesharingwasinconsideringwhichschedulertousefortheMapReduceJobTracker.SincealljobswereeventuallytranslatedintoMapReducecodehavingapolicyforresourcesharingattheMapReducelevelwasusuallysufficienttomanageclusterworkloadsinthelarge.
Hadoop2andYARNchangedthispicture.AswellasrunningmanyMapReducejobs,aclustermightalsoberunningmanyotherapplicationsatopotherYARNApplicationMasters.TezandSparkareframeworksintheirownrightthatrunadditionalapplicationsatoptheirprovidedinterfaces.
IfeverythingrunsonYARN,thenitprovideswaysofconfiguringthemaximumresourceallocation(intermsofCPU,memory,andsoonI/O)consumedbyeachcontainerallocatedtoanapplication.Theprimarygoalhereistoensurethatenoughresourcesareallocatedtokeepthehardwarefullyutilizedwithouteitherhavingunusedcapacityoroverloadingit.
Thingsgetsomewhatmoreinterestingwhennon-YARNapplications,suchasImpala,arerunningontheclusterandwanttograballocatedslicesofcapacity(particularlymemoryinthecaseofImpala).Thiscouldalsohappenif,say,youwererunningSparkonthesamehostsinitsnon-YARNmodeorindeedanyotherdistributedapplicationthatmightbenefitfromco-locationontheHadoopmachines.
Basically,inHadoop2,youneedtothinkoftheclusterasmuchmoreofamulti-tenancyenvironmentthatrequiresmoreattentiongiventotheallocationofresourcestothevarioustenants.
Therereallyisnosilverbulletrecommendationhere;therightconfigurationwillbeentirelydependentontheservicesco-locatedandtheworkloadstheyarerunning.Thisisanotherexamplewhereyouwanttoworkcloselywithyouroperationsteamtodoaseriesofloadtestswiththresholdstodeterminejustwhattheresourcerequirementsofthevariousclientsareandwhichapproachwillgivethemaximumutilizationandperformance.ThefollowingblogpostfromClouderaengineersgivesagoodoverviewofhowtheyapproachthisveryissueinhavingImpalaandMapReducecoexisteffectively:http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/.
![Page 409: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/409.jpg)
BuildingaphysicalclusterThereisoneminorrequirementbeforethinkingaboutallocationofhardwareresources:definingandselectingthehardwareusedforyourcluster.Inthissection,we’lldiscussaphysicalclusterandmoveontoAmazonEMRinthenext.
Anyspecifichardwareadvicewillbeoutofdatethemomentitiswritten.WeadviseperusingthewebsitesofthevariousHadoopdistributionvendorsastheyregularlywritenewarticlesonthecurrentlyrecommendedconfigurations.
InsteadoftellingyouhowmanycoresorGBofmemoryyouneed,we’lllookathardwareselectionataslightlyhigherlevel.ThefirstthingtorealizeisthatthehostsrunningyourHadoopclusterwillmostlikelylookverydifferentfromtherestofyourenterprise.Hadoopisoptimizedforlow(er)costhardware,soinsteadofseeingasmallnumberofverylargeservers,expecttoseealargernumberofmachineswithfewerenterprisereliabilityfeatures.Butdon’tthinkthatHadoopwillrungreatonanyjunkyouhavelyingaround.Itmight,butrecentlytheprofileoftypicalHadoopservershasbeenmovingawayfromthebottom-endofthemarket,andinstead,thesweetspotwouldseemtobeinmid-rangeserverswherethemaximumcores/disks/memorycanbeachievedatapricepoint.
YoushouldalsoexpecttohavedifferentresourcerequirementsforthehostsrunningservicessuchastheHDFSNameNodeortheYARNResourceManager,asopposedtotheworkernodesstoringdataandexecutingtheapplicationlogic.Fortheformer,thereisusuallymuchlessrequirementforlotsofstorage,butfrequently,aneedformorememoryandpossiblyfasterdisks.
ForHadoopworkernodes,theratiobetweenthethreemainhardwarecategoriesofcores,memory,andI/Oisoftenthemostimportantthingtogetright.Andthiswilldirectlyinformthedecisionsyoumakeregardingworkloadandresourceallocation.
Forexample,manyworkloadstendtobecomeI/Oboundandhavingmanytimesasmanycontainersallocatedonahostthantherearephysicaldisksmightactuallycauseanoverallslowdownduetocontentionforthespinningdisks.Atthetimeofwriting,currentrecommendationshereareforthenumberofYARNcontainerstobenomorethan1.8timesthenumberofdisks.IfyouhaveworkloadsthatareI/Obound,thenyouwillmostlikelygetmuchbetterperformancebyaddingmorehoststotheclusterinsteadoftryingtogetmorecontainersrunningorindeedfasterprocessorsormorememoryonthecurrenthosts.
Conversely,ifyouexpecttorunlotsofconcurrentImpala,Spark,andothermemory-hungryjobs,thenmemorymightquicklybecometheresourcemostunderpressure.Thisiswhyeventhoughyoucangetcurrenthardwarerecommendationsforgeneral-purposeclustersfromthedistributionvendors,youstillneedtovalidateagainstyourexpectedworkloadsandtailoraccordingly.ThereisreallynosubstituteforbenchmarkingonasmalltestclusterorindeedonEMR,whichcanbeagreatplatformtoexploretheresourcerequirementsofmultipleapplicationsthatcaninformhardwareacquisitiondecisions.PerhapsEMRmightbeyourmainenvironment;ifso,we’lldiscussthatinalatersection.
![Page 410: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/410.jpg)
![Page 411: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/411.jpg)
PhysicallayoutIfyoudouseaphysicalcluster,thereareafewthingsyouwillneedtoconsiderthatarelargelytransparentonEMR.
RackawarenessThefirstoftheseaspectsforclusterslargeenoughtoconsumemorethanonerackofdatacenterspaceisbuildingrackawareness.AsmentionedinChapter2,Storage,whenHDFSplacesreplicasofnewfiles,itattemptstoplacethesecondreplicaonadifferenthostthanthefirst,andthethirdinadifferentrackofequipmentinamulti-racksystem.Thisheuristicisaimedatmaximizingresilience;therewillbeatleastonereplicaavailableevenifanentirerackofequipmentfails.MapReduceusessimilarlogictoattempttogetabetter-balancedtaskspread.
Ifyoudonothing,theneachhostwillbespecifiedasbeinginthesingledefaultrack.But,iftheclustergrowsbeyondthispoint,youwillneedtoupdatetherackname.
Underthecovers,Hadoopdiscoversanode’srackbyexecutingauser-suppliedscriptthatmapsnodehostnametoracknames.ClouderaManagerallowsracknamestobesetonagivenhost,andthisisthenretrievedwhenitsrackawarenessscriptsarecalledbyHadoop.Tosettherackforahost,clickonHosts-><hostname>->AssignRack,andthenassigntherackfromtheClouderaManagerhomepage.
ServicelayoutAsmentionedearlier,youarelikelytohavetwotypesofhardwareinyourcluster:themachinesrunningtheworkersandthoserunningtheservers.Whendeployingaphysicalcluster,youwillneedtodecidewhichservicesandwhichsubcomponentsoftheservicesrunonwhichphysicalmachines.
Fortheworkers,thisisusuallyprettystraightforward;most,thoughnotall,serviceshaveamodelofaworkeragentonallworkerhosts.But,forthemaster/servercomponents,itrequiresalittlethought.Ifyouhavethreemasternodes,thenhowdoyouspreadyourprimaryandbackupNameNodes:theYARNResourceManager,maybeHue,afewHiveservers,andanOoziemanager?Someofthesefeaturesarehighlyavailable,whileothersarenot.Asyouaddmoreandmoreservicestoyourcluster,you’llalsoseethislistofmasterservicesgrowsubstantially.
Inanidealworld,youmighthaveahostperservicemasterbutthatisonlytractableforverylargeclusters;insmallerinstallationsitisprohibitivelyexpensive.Plusitmightalwaysbealittlewasteful.Therearenohard-and-fastruleshereeither,butdolookatyouravailablehardware,andtrytospreadtheservicesacrossthenodesasmuchaspossible.Don’t,forexample,havetwonodesforthetwoNameNodesandthenputeverythingelseonathird.Thinkabouttheimpactofasinglehostfailureandmanagethelayouttominimizeit.Astheclustergrowsacrossmultipleracksofequipment,theconsiderationswillalsoneedtoconsiderhowtosurvivesingle-rackfailures.HadoopitselfhelpswiththissinceHDFSwillattempttoensureeachblockofdatahasreplicasacrossatleasttwo
![Page 412: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/412.jpg)
racks.But,thistypeofresilienceisunderminedif,forexample,allthemasternodesresideinasinglerack.
UpgradingaserviceUpgradingHadoophashistoricallybeenatime-consumingandsomewhatriskytask.Thisremainsthecaseonamanuallydeployedcluster,thatis,onenotmanagedbyatoolsuchasClouderaManager.
IfyouareusingClouderaManager,thenittakesthetime-consumingpartoutoftheactivity,butnotnecessarilytherisk.Anyupgradeshouldalwaysbeviewedasanactivitywithahighchanceofunexpectedissues,andyoushouldarrangeenoughclusterdowntimetoaccountforthissurpriseexcitement.There’sreallynosubstitutefordoingatestupgradeonatestcluster,whichunderlinestheimportanceofthinkingaboutHadoopasacomponentofyourenvironmentthatneedstobetreatedwithadeploymentlifecyclelikeanyother.
SometimesanupgraderequiresmodificationtotheHDFSmetadataormightotherwiseaffectthefilesystem.Thisis,ofcourse,wheretherealriskslie.Inadditiontorunningatestupgrade,beawareoftheabilitytosetHDFSinupgrademode,whicheffectivelymakesasnapshotofthefilesystemstatepriortotheupgradeandwhichwillberetaineduntiltheupgradeisfinalized.Thiscanbereallyhelpfulasevenanupgradethatgoesbadlywrongandcorruptsdatacanpotentiallybefullyrolledback.
![Page 413: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/413.jpg)
BuildingaclusteronEMRElasticMapReduceisaflexiblesolutionthat,dependingonrequirementsandworkloads,cansitnextto,orreplace,aphysicalHadoopcluster.Aswe’veseensofar,EMRprovidesclusterspreloadedandconfiguredwithHive,Streaming,andPigaswellaswithcustomJARclustersthatallowtheexecutionofMapReduceapplications.
Aseconddistinctiontomakeisbetweentransientandlong-runninglifecycles.AtransientEMRclusterisgeneratedondemand;dataisloadedinS3orHDFS,someprocessingworkflowisexecuted,outputresultsarestored,andtheclusterisautomaticallyshutdown.Along-runningclusteriskeptaliveoncetheworkflowterminates,andtheclusterremainsavailablefornewdatatobecopiedoverandnewworkflowstobeexecuted.Long-runningclustersaretypicallywell-suitedfordatawarehousingorworkingwithdatasetslargeenoughthatloadingandprocessingdatawouldbeinefficientcomparedtoatransientinstance.
Inamust-readwhitepaperforprospectiveusers(foundathttps://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf),Amazongivesaheuristictoestimatewhichclustertypeisabetterfitasfollows:
Ifnumberofjobsperday*(timetosetupclusterincludingAmazonS3dataloadtimeifusingAmazonS3+dataprocessingtime)<24hours,considertransientAmazonEMRclustersorphysicalinstances.Long-runninginstancesareinstantiatedbypassingthe–aliveargumenttotheElasticMapreducecommand,whichenablestheKeepAliveoptionanddisablesautotermination.
Notethattransientandlong-runningclusterssharethesamepropertiesandlimitations;inparticular,dataonHDFSisnotpersistedoncetheclusterisshutdown.
![Page 414: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/414.jpg)
ConsiderationsaboutfilesystemsInourexamplessofarweassumeddatatobeavailableinS3.Inthiscase,abucketismountedinEMRasans3nfilesystem,anditisusedasinputsourceaswellasatemporaryfilesystemtostoreintermediatedataincomputations.WithS3weintroducepotentialI/Ooverhead,operationssuchasreadsandwritesfireoffGETandPUTHTTPrequests.
NoteNotethatEMRdoesnotsupportS3blockstorage.Thes3URImapstos3n.
AnotheroptionwouldbetoloaddataintotheclusterHDFSandrunprocessingfromthere.Inthiscase,wedohavefasterI/Oanddatalocality,butwewouldlosepersistence.Whentheclusterisshutdown,ourdatadisappears.Asaruleofthumb,ifyouarerunningatransientcluster,itmakessensetouseS3asabackend.Inpractice,oneshouldmonitorandtakedecisionsbasedontheworkflowcharacteristics.Iterative,multi-passMapReducejobswouldgreatlybenefitfromHDFS;onecouldarguethatforthosetypesofworkflows,anexecutionenginelikeTezorSparkwouldbemoreappropriate.
![Page 415: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/415.jpg)
GettingdataintoEMRWhencopyingdatafromHDFStoS3,itisrecommendedtouses3distcp(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.htmlinsteadofApachedistcporHadoopdistcp.ThisapproachissuitablealsototransferdatawithinEMRandfromS3toHDFS.TomoveverylargeamountsofdatafromthelocaldiskintoS3,AmazonrecommendsparallelizingtheworkloadusingJets3torGNUParallel.Ingeneral,it’simportanttobeawarethatPUTrequeststoS3arecappedat5GBperfile.Touploadlargerfiles,oneneedstorelyonMultipartUpload(https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/),anAPIthatallowssplittinglargefilesintosmallerpartsandreassemblesthemwhenuploaded.FilescanalsobecopiedwithtoolssuchastheAWSCLIorthepopularS3CMDutility,butthesedonothavetheparallelismadvantagesofass3distcp.
![Page 416: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/416.jpg)
EC2instancesandtuningThesizeofanEMRclusterdependsonthedatasetsize,thenumberoffilesandblocks(determinesthenumberofsplits)andthetypeofworkload(trytoavoidspillingtodiskwhenataskrunsoutofmemory).Asaruleofthumb,agoodsizeisonethatmaximizesparallelism.ThenumberofmappersandreducersperinstanceaswellasheapsizeperJVMdaemonisgenerallyconfiguredbyEMRwhentheclusterisprovisionedandtunedintheeventofchangesintheavailableresources.
![Page 417: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/417.jpg)
ClustertuningInadditiontothepreviouscommentsspecifictoaclusterrunonEMR,therearesomegeneralthoughtstokeepinmindwhenrunningworkloadsonanytypeofcluster.Thiswill,ofcourse,bemoreexplicitwhenrunningoutsideofEMRasitoftenabstractssomeofthedetails.
![Page 418: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/418.jpg)
JVMconsiderationsYoushouldberunningthe64-bitversionofaJVMandusingtheservermode.Thiscantakelongertoproduceoptimizedcode,butitalsousesmoreaggressivestrategiesandwillre-optimizecodeovertime.Thismakesitamuchbetterfitforlong-runningservices,suchasHadoopprocesses.
EnsurethatyouallocateenoughmemorytotheJVMtopreventoverly-frequentGarbageCollection(GC)pauses.Theconcurrentmark-and-sweepcollectoriscurrentlythemosttestedandrecommendedforHadoop.TheGarbageFirst(G1)collectorhasbecometheGCoptionofchoiceinnumerousotherworkloadssinceitsintroductionwithJDK7,soit’sworthmonitoringrecommendedbestpracticeasitevolves.TheseoptionscanbeconfiguredascustomJavaargumentswithineachservice’sconfigurationsectionofClouderaManager.
ThesmallfilesproblemHeapallocationtoJavaprocessesonworkernodeswillbesomethingyouconsiderwhenthinkingaboutserviceco-location.ButthereisaparticularsituationregardingtheNameNodeyoushouldbeawareof:thesmallfilesproblem.
Hadoopisoptimizedforverylargefileswithlargeblocksizes.ButsometimesparticularworkloadsordatasourcespushmanysmallfilesontoHDFS.Thisismostlikelysuboptimalasitsuggestseachtaskprocessingablockatatimewillreadonlyasmallamountofdatabeforecompleting,causinginefficiency.
HavingmanysmallfilesalsoconsumesmoreNameNodememory;itholdsin-memorythemappingfromfilestoblocksandconsequentlyholdsmetadataforeachfileandblock.Ifthenumberoffilesandhenceblocksincreasesquickly,thensowilltheNameNodememoryusage.Thisislikelytoonlyhitasubsetofsystemsas,atthetimeofwritingthis,1GBofmemorycansupport2millionfilesorblocks,butwithadefaultheapsizeof2or4GB,thislimitcaneasilybereached.IftheNameNodeneedstostartveryaggressivelyrunninggarbagecollectionoreventuallyrunsoutofmemory,thenyourclusterwillbeveryunhealthy.ThemitigationistoassignmoreheaptotheJVM;thelonger-termapproachistocombinemanysmallfilesintoasmallernumberoflargerones.Ideally,compressedwithasplittablecompressioncodec.
![Page 419: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/419.jpg)
MapandreduceoptimizationsMappersandreducersbothprovideareasforoptimizingperformance;hereareafewpointerstoconsider:
Thenumberofmappersdependsonthenumberofsplits.Whenfilesaresmallerthanthedefaultblocksizeorcompressedusinganonsplittableformat,thenumberofmapperswillequalthenumberoffiles.Otherwise,thenumberofmappersisgivenbythetotalsizeofeachfiledividedbytheblocksize.CompressmappersoutputtoreducewritestodiskandincreaseI/O.LZOisagoodformatforthistask.Avoidspilltodisk:themappersshouldhaveenoughmemorytoretainasmuchdataaspossible.NumberofReducers:itisrecommendedthatyouusefewerreducersthanthetotalreducercapacity(thisavoidsexecutionwaits).
![Page 420: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/420.jpg)
SecurityOnceyoubuiltacluster,thefirstthingyouthoughtaboutwashowtosecureit,right?Don’tworry,mostpeopledon’t.But,asHadoophasmovedonfrombeingsomethingrunningin-houseanalysisintheresearchdepartmenttodirectlydrivingcriticalsystems,it’snotsomethingtoignorefortoolong.
SecuringHadoopisnotsomethingtobedoneonawhimorwithoutsignificanttesting.Wecannotgivedetailedadviceonthistopicandcannotstressstronglyenoughtheneedtotakethistopicseriouslyanddoitproperly.Itmightconsumetime,itmightcostmoney,butweighthisagainstthecostofhavingyourclustercompromised.
SecurityisalsoamuchbiggertopicthanjusttheHadoopcluster.We’llexploresomeofthesecurityfeaturesavailableinHadoop,butyoudoneedacoherentsecuritystrategyintowhichthesediscretecomponentsfit.
![Page 421: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/421.jpg)
EvolutionoftheHadoopsecuritymodelInHadoop1,therewaseffectivelynosecurityprotectionastheprovidedsecuritymodelhadobviousattackvectors.TheUnixuserIDwithwhichyouconnectedtotheclusterwasassumedtobevalid,andyouhadalltheprivilegesofthatuser.Plainly,thismeantthatanyonewithadministrativeaccessonahostthatcouldaccesstheclustercouldeffectivelyimpersonateanyotheruser.
Thisledtothedevelopmentoftheso-called“headnode”accessmodel,wherebytheHadoopclusterwasfirewalledofffromeveryhostexceptone,theheadnode,andallaccesstotheclusterwasmediatedthroughthiscentrally-controllednode.Thiswasaneffectivemitigationforthelackofarealsecuritymodelandcanstillbeusefulinsituationsevenwhenrichersecurityschemesareutilized.
![Page 422: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/422.jpg)
BeyondbasicauthorizationCoreHadoophashadadditionalsecurityfeaturesadded,whichaddressthepreviousconcerns.Inparticular,theyaddressthefollowing:
AclustercanrequireausertoauthenticateviaKerberosandprovetheyarewhotheysaytheyare.Insecuremode,theclustercanalsouseKerberosforallnode-nodecommunications,ensuringthatallcommunicatingnodesareauthenticatedandpreventingmaliciousnodesfromattemptingtojointhecluster.Toeasemanagement,userscanbecollectedintogroupsagainstwhichdata-accessprivilegescanbedefined.ThisiscalledRoleBasedAccessControl(RBAC)andisaprerequisiteforasecureclusterwithmorethanahandfulofusers.Theuser-groupmappingscanberetrievedfromcorporatesystems,suchasLDAPoractivedirectory.HDFScanapplyACLstoreplacethecurrentUnix-inspiredowner/group/worldmodel.
ThesecapabilitiesgiveHadoopasignificantlystrongersecurityposturethaninthepast,butthecommunityismovingfastandadditionaldedicatedApacheprojectshaveemergedtoaddressspecificareasofsecurity.
ApacheSentryhttps://sentry.incubator.apache.orgisasystemtoprovidemuchfiner-grainedauthorizationtoHadoopdataandservices.OtherservicesbuildSentrymappings,andthisallows,forexample,specificrestrictionstobeplacednotonlyonparticularHDFSdirectories,butalsoonentitiessuchasHivetables.
WhereasSentryfocusesonprovidingmuchrichertoolsfortheinternal,fine-grainedaspectsofHadoopsecurity,ApacheKnox(http://knox.apache.org)providesasecuregatewaytoHadoopthatintegrateswithexternalidentitymanagementsystemsandprovidesaccesscontrolmechanismstoallowordisallowaccesstospecificHadoopservicesandoperations.ItdoesthisbypresentingaREST-onlyinterfacetoHadoopandsecuringallcallstothisAPI.
![Page 423: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/423.jpg)
ThefutureofHadoopsecurityTherearemanyotherdevelopmentshappeningintheHadoopworld.CoreHadoop2.5addedextendedfileattributestoHDFS,whichcanbeusedasthebasisofadditionalaccesscontrolmechanisms.Futureversionswillincorporatecapabilitiesforbettersupportofencryptionfordataintransitaswellasatrest,andtheProjectRhinoinitiativeledbyIntel(https://github.com/intel-hadoop/project-rhino/)isbuildingoutrichersupportforfilesystemcryptographicmodules,asecurefilesystem,and,atsomepoint,afullerkey-managementinfrastructure.
TheHadoopdistributionvendorsaremovingfasttoaddthesecapabilitiestotheirreleases,soifyoucareaboutsecurity(youdo,don’tyou!),thenconsultthedocumentationforthelatestreleaseofyourdistribution.Newsecurityfeaturesarebeingaddedeveninpointupdatesandaren’tbeingdelayeduntilmajorupgrades.
![Page 424: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/424.jpg)
ConsequencesofusingasecuredclusterAfterteasingyouwithallthesecuritygoodnessthatisnowavailableandthatwhichiscoming,it’sonlyfairtogivesomewordsofwarning.Securityisoftenhardtodocorrectly,andoftenthefeelingofsecuritywronglyemployedwithabuggydeploymentisworsethanknowingyouhavenosecurity.
However,evenifyoudoitright,thereareconsequencestorunningasecurecluster.Itmakesthingsharderfortheadministratorscertainlyandoftentheusers,sothereisdefinitelyanoverhead.SpecificHadooptoolsandserviceswillalsoworkdifferentlydependingonwhatsecurityisemployedonacluster.
Oozie,whichwediscussedinChapter8,DataLifecycleManagement,usesitsowndelegationtokensbehindthescenes.Thisallowstheoozieusertosubmitjobsthatarethenexecutedonbehalfoftheoriginallysubmittinguser.Inaclusterusingonlythebasicauthorizationmechanism,thisisveryeasilyconfigured,butusingOozieinasecureclusterwillrequireadditionallogictobeaddedtotheworkflowdefinitionsandthegeneralOozieconfiguration.Thisisn’taproblemwithHadooporOozie;however,similarlyaswiththeadditionalcomplexityresultingfromthemuchbetterHAfeaturesofHDFSinHadoop2,bettersecuritymechanismswillsimplyhavecostsandconsequencesthatyouneedtakeintoconsideration.
![Page 425: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/425.jpg)
MonitoringEarlierinthischapter,wediscussedClouderaManagerasavisualmonitoringtoolandhintedthatitcouldalsobeprogrammaticallyintegratedwithothermonitoringsystems.ButbeforepluggingHadoopintoanymonitoringframework,it’sworthconsideringjustwhatitmeanstooperationallymonitoraHadoopcluster.
![Page 426: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/426.jpg)
Hadoop–wherefailuresdon’tmatterTraditionalsystemsmonitoringtendstobequiteabinarytool;generallyspeaking,eithersomethingisworkingoritisn’t.Ahostisaliveordead,andawebserverisrespondingoritisn’t.ButintheHadoopworld,thingsarealittledifferent;theimportantthingisserviceavailability,andthiscanstillbetreatedasliveevenifparticularpiecesofhardwareorsoftwarehavefailed.NoHadoopclustershouldbeintroubleifasingleworkernodefails.AsofHadoop2,eventhefailureoftheserverprocesses,suchastheNameNodeshouldn’treallybeaconcernifHAisconfigured.So,anymonitoringofHadoopneedstotakeintoaccounttheservicehealthandnotthatofspecifichostmachines,whichshouldbeunimportant.Operationspeopleon24/7pagerarenotgoingtobehappygettingpagedat3AMtodiscoverthatoneworkernodeinaclusterof10,000hasfailed.Indeed,oncethescaleoftheclusterincreasesbeyondacertainpoint,thefailureofindividualpiecesofhardwarebecomesanalmostcommonplaceoccurrence.
![Page 427: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/427.jpg)
MonitoringintegrationYouwon’tbebuildingyourownmonitoringtools;instead,youmightlikelywanttointegratewithexistingtoolsandframeworks.Forpopularopensourcemonitoringtools,suchasNagiosandZabbix,therearemultiplesampletemplatestointegrateHadoop’sservice-wideandnode-specificmetrics.
Thiscangivethesortofseparationhintedpreviously;thefailureoftheYARNResourceManagerwouldbeahigh-criticalityeventthatshouldmostlikelycausealertstobesenttooperationsstaff,butahighloadonspecifichostsshouldonlybecapturedandnotcausealertstobefired.Thisthenprovidesthedualityoffiringalertswhenbadthingshappeninadditiontocapturingandprovidingtheinformationneededtodelveintosystemdataovertimetodotrendanalysis.
ClouderaManagerprovidesaRESTinterface,whichisanotherpointofintegrationagainstwhichtoolssuchasNagioscanintegrateandpulltheClouderaManager-definedservice-levelmetricsinsteadofhavingtodefineitsown.
Forheavier-weightenterprise-monitoringinfrastructurebuiltonframeworks,suchasIBMTivoliorHPOpenView,ClouderaManagercanalsodelivereventsviaSNMPtrapsthatwillbecollectedbythesesystems.
![Page 428: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/428.jpg)
Application-levelmetricsAttimes,youmightalsowantyourapplicationstogathermetricsthatcanbecentrallycapturedwithinthesystem.Themechanismsforthiswilldifferfromonecomputationalmodeltoanother,butthemostwell-knownaretheapplicationcountersavailablewithinMapReduce.
WhenaMapReducejobcompletes,itoutputsanumberofcounters,gatheredbythesystemthroughoutthejobexecution,thatdealwithmetricssuchasthenumberofmaptasks,byteswritten,failedtasks,andsoon.Youcanalsowriteapplication-specificmetricsthatwillbeavailablealongsidethesystemcountersandwhichareautomaticallyaggregatedacrossthemap/reduceexecution.FirstdefineaJavaenum,andnameyourdesiredmetricswithinit,asfollows:
publicenumAppMetrics{
MAX_SEEN,
MIN_SEEN,
BAD_RECORDS
};
Then,withinthemap,reduce,setup,andcleanupmethodsofyourMaporReduceimplementations,youcandosomethinglikethefollowingtoincrementacounterbyone:
Context.getCounter(AppMetrics.BAD_RECORDS).increment(1);
RefertotheJavaDocoftheorg.apache.hadoop.mapreduce.Counterinterfaceformoredetailsofthismechanism.
![Page 429: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/429.jpg)
TroubleshootingMonitoringandloggingcountersoradditionalinformationisallwellandgood,butitcanbeintimidatingtoknowhowtoactuallyfindtheinformationyouneedwhentroubleshootingaproblemwithanapplication.Inthissection,wewilllookathowHadoopstoreslogsandsysteminformation.Wecandistinguishthreetypologiesoflogs,asfollows:
YARNapplications,includingMapReducejobsDaemonlogs(NameNodeandResourceManager)Servicesthatlognon-distributedworkloads,forexample,HiveServer2loggingto/var/log
Nexttotheselogtypologies,Hadoopexposesanumberofmetricsatfilesystem(thestorageavailability,replicationfactor,andnumberofblocks)andsystemlevel.Asmentioned,bothApacheAmbariandClouderaManager,whichcentralizeaccesstodebuginformation,doanicejobasthefrontend.However,underthehood,eachservicelogstoeitherHDFSorthesingle-nodefilesystem.Furthermore,YARN,MapReduce,andHDFSexposetheirlogfilesandmetricsviawebinterfacesandprogrammaticAPIs.
![Page 430: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/430.jpg)
LogginglevelsHadooplogsmessagestoLog4jbydefault.Log4jisconfiguredvialog4j.propertiesintheclasspath.Thisfiledefineswhatisloggedandwithwhichlayout:
log4j.rootLogger=${root.logger}
root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss}%p
%c{2}:%m%n
ThedefaultrootloggerisINFO,console,whichlogsallmessagesatthelevelINFOandabovetotheconsole’sstderr.SingleapplicationsdeployedonHadoopcanshiptheirownlog4j.propertiesandsetthelevelandotherpropertiesoftheiremittedlogsasrequired.
HadoopdaemonshaveawebpagetogetandsettheloglevelforanyLog4jproperty.Thisinterfaceisexposedbythe/LogLevelendpointineachservicewebUI.ToenabledebugloggingfortheResourceManagerclass,wewillvisithttp://resourcemanagerhost:8088/LogLevel,andthescreenshotcanbeseenasfollows:
GettingandsettingtheloglevelonResourceManager
Alternatively,theYARNdaemonlog<host:port>commandinterfaceswiththeservice/LogLevelendpoint.Wecaninspectthelevelassociatedwithmapreduce.map.log.levelfortheResourceManagerclassusingthe–getlevel<property>parameter,asfollows:
$hadoopdaemonlog-getlevellocalhost.localdomain:8088
mapreduce.map.log.level
Connectingtohttp://localhost.localdomain:8088/logLevel?
log=mapreduce.map.log.levelSubmittedLogName:mapreduce.map.log.levelLog
Class:org.apache.commons.logging.impl.Log4JLoggerEffectivelevel:INFO
Theeffectivelevelcanbemodifiedusingthe-setlevel<property><level>option:
$hadoopdaemonlog-setlevellocalhost.localdomain:8088
mapreduce.map.log.levelDEBUG
Connectingtohttp://localhost.localdomain:8088/logLevel?
log=mapreduce.map.log.level&level=DEBUG
![Page 431: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/431.jpg)
SubmittedLogName:mapreduce.map.log.level
LogClass:org.apache.commons.logging.impl.Log4JLogger
SubmittedLevel:DEBUG
SettingLeveltoDEBUG…
Effectivelevel:DEBUG
NotethatthissettingwillaffectalllogsproducedbytheResourceManagerclass.Thisincludessystem-generatedentriesaswellastheonesgeneratedbyapplicationsrunningonYARN.
![Page 432: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/432.jpg)
AccesstologfilesLogfilelocationsandnamingconventionsarelikelytodifferbasedonthedistribution.ApacheAmbariandClouderaManagercentralizeaccesstologfiles,bothforservicesandsingleapplications.OnCloudera’sQuickStartVM,anoverviewofthecurrentlyrunningprocessesandlinkstotheirlogfiles,thestderrandstdoutchannelscanbefoundathttp://localhost.localdomain:7180/cmf/hardware/hosts/1/processes,andthescreenshotcanbeseenasfollows:
AccesstologresourcesinClouderaManager
AmbariprovidesasimilaroverviewviatheServicesdashboardfoundathttp://127.0.0.1:8080/#/main/servicesontheHDPSandbox,andthescreenshotcanbeseenasfollows:
![Page 433: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/433.jpg)
AccesstologresourcesonApacheAmbari
Non-distributedlogsareusuallyfoundunder/var/log/<service>oneachclusternode.YARNcontainersandMRv2logslocationsalsodependonthedistribution.OnCDH5theseresourcesareavailableinHDFSunder/tmp/logs/<user>.
Thestandardmodalitytoaccessdistributedlogsiseitherviacommand-linetoolsorusingserviceswebUIs.
Forinstance,thecommandisasfollows:
$yarnapplication-list-appStatesALL
TheprecedingcommandwilllistallrunningandretriedYARNapplications.TheURLinthetaskcolumnpointstoawebinterfacethatexposesthetasklog,asfollows:
14/08/0314:44:38INFOclient.RMProxy:ConnectingtoResourceManagerat
localhost.localdomain/127.0.0.1:8032Totalnumberofapplications
(application-types:[]andstates:[NEW,NEW_SAVING,SUBMITTED,ACCEPTED,
RUNNING,FINISHED,FAILED,KILLED]):4Application-Id
Application-NameApplication-TypeUserQueue
StateFinal-StateProgress
Tracking-URLapplication_1405630696162_0002PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002
application_1405630696162_0004PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0004
application_1405630696162_0003PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
![Page 434: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/434.jpg)
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0003
application_1405630696162_0005PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0005
Forinstance,http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002,alinktoataskbelongingtousercloudera,isafrontendtothecontentstoredunderhdfs:///tmp/logs/cloudera/logs/application_1405630696162_0002/.
Inthefollowingsections,wewillgiveanoverviewoftheavailableUIsfordifferentservices.
NoteProvisioninganEMRclusterwiththe–log-uris3://<bucket>optionwillensurethatHadooplogsarecopiedintothes3://<bucket>location.
![Page 435: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/435.jpg)
ResourceManager,NodeManager,andApplicationManagerOnYARNtheResourceManagerwebUIprovidesinformationandgeneraljobstatisticsoftheHadoopcluster,running/completed/failedjobs,andajobhistorylogfile.Bydefault,theUIisexposedathttp://<resourcemanagerhost>:8088/andcanbeseeninthefollowingscreenshot:
ResourceManager
ApplicationsOntheleft-handsidebar,itispossibletoreviewtheapplicationstatusofinterest:NEW,SUBMITTED,ACCEPTED,RUNNING,FINISHING,FINISHED,FAILED,orKILLED.Dependingontheapplicationstatus,thefollowinginformationisavailable:
TheapplicationIDThesubmittinguserTheapplicationnameTheschedulerqueueinwhichtheapplicationisplacedStart/finishtimesandstateLinktotheTrackingUIforapplicationhistory
Inaddition,theClusterMetricsviewgivesyouinformationonthefollowing:
OverallapplicationstatusNumberofrunningcontainersMemoryusageNodestatus
NodesTheNodesviewisafrontendtotheNodeManagerservicemenu,whichshowshealthandlocationinformationonthenode’srunningapplications,asfollows:
![Page 436: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/436.jpg)
Nodesstatus
EachindividualnodeoftheclusterexposesfurtherinformationandstatisticsathostlevelviaitsownUI.TheseincludewhichversionofHadoopisrunningonthenode,howmuchmemoryisavailableonthenode,thenodestatus,andalistofrunningapplicationsandcontainers,asshowninthefollowingscreenshot:
Singlenodeinfo
SchedulerThefollowingscreenshotshowstheSchedulerwindow:
Scheduler
MapReduceThoughthesameinformationandloggingdetailsareavailableinMapReducev1and
![Page 437: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/437.jpg)
MapReducev2,theaccessmodalityisslightlydifferent.
MapReducev1ThefollowingscreenshotshowstheMapReduceJobTrackerUI:
TheJobTrackerUI
TheJobTrackerUI,availablebydefaultathttp://<jobtracker>:50070,exposesinformationonallcurrentlyrunningaswellasretiredMapReducejobs,asummaryoftheclusterresourcesandhealth,aswellasschedulinginformationandcompletionpercentage,asshowninthefollowingscreenshot:
Jobdetails
![Page 438: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/438.jpg)
Foreachrunningandretiredjob,detailsareavailable,includingitsID,owner,priority,taskassignment,andtasklaunchforthemapper.Clickingonajobidlinkwillleadtoajobdetailspage—thesameURLexposedbythemapredjob–listcommand.Thisresourcegivesdetailsaboutboththemapandreducetasksaswellasgeneralcounterstatisticsatthejob,filesystem,andMapReducelevels;theseincludethememoryused,numberofread/writeoperations,andthenumberofbytesreadandwritten.
ForeachMapandReduceoperation,theJobTrackerexposesthetotal,pending,running,completed,andfailedtasks,asshowninthefollowingscreenshot:
Jobtasksoverview
ClickingonthelinksintheJobtablewillleadtoafurtheroverviewatthetaskandtask-attemptlevels,asshowninthefollowingscreenshot:
Taskattempts
Fromthislastpage,wecanaccessthelogsofeachtaskattempt,bothforsuccessfulandfailed/killedtasksoneachindividualTaskTrackerhost.ThislogcontainsthemostgranularinformationaboutthestatusoftheMapReducejob,includingtheoutputofLog4jappendersaswellasoutputpipedtothestdoutandstderrchannelsandsyslog,asshowninthefollowingscreenshot:
![Page 439: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/439.jpg)
TaskTrackerlogs
MapReducev2(YARN)AswehaveseeninChapter3,Processing–MapReduceandBeyond,withYARN,MapReduceisonlyoneofmanyprocessingframeworksthatcanbedeployed.RecallfrompreviouschaptersthattheJobTrackerandTaskTrackerserviceshavebeenreplacedbytheResourceManagerandNodeManager,respectively.Assuch,boththeserviceUIsandthelogfilesfromYARNaremoregenericthanMapReducev1.
Theapplication_1405630696162_0002nameshowninResourceManagercorrespondstoaMapReducejobwiththejob_1405630696162_0002ID.ThatapplicationIDbelongstothetaskrunninginsidethecontainer,andclickingonitwillrevealanoverviewoftheMapReducejobandallowadrill-downtotheindividualtasksfromeitherphaseuntilthesingle-tasklogisreached,asshowninthefollowingscreenshot:
![Page 440: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/440.jpg)
AYARNapplicationcontainingaMapReducejob
JobHistoryServerYARNshipswithaJobHistoryRESTservicethatexposesdetailsonfinishedapplications.Currently,itonlysupportsMapReduceandprovidesinformationonfinishedjobs.ThisincludesthejobfinalstatusSUCCESSFULorFAILED,whosubmittedthejob,thetotalnumberofmapandreducetasks,andtiminginformation.
AUIisavailableathttp://<jobhistoryhost>:19888/jobhistory,asshowninthefollowingscreenshot:
JobHistoryUI
ClickingoneachjobIDwillleadtotheMapReducejobUIshownintheYARNapplicationscreenshot.
![Page 441: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/441.jpg)
NameNodeandDataNodeThewebinterfacefortheHadoopDistributedFileSystem(HDFS)showsinformationabouttheNameNodeitselfaswellasthefilesystemgenerally.
Bydefault,itislocatedathttp://<namenodehost>:50070/,asshowninthefollowingscreenshot:
NameNodeUI
TheOverviewmenuexposesNameNodeinformationaboutDFScapacityandusageandtheblockpoolstatus,anditgivesasummaryofthestatusofDataNodehealthandavailability.Theinformationcontainedinthispageisforthemostpartequivalenttowhatisshownatthecommand-lineprompt:
$hdfsdfsadmin–report
TheDataNodesmenugivesmoredetailedinformationaboutthestatusofeachnodeandoffersadrill-downatthesingle-hostlevel,bothforavailableanddecommissionednodes,asshowninthefollowingscreenshot:
![Page 442: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/442.jpg)
DatanodeUI
![Page 443: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/443.jpg)
SummaryThishasbeenquiteawhistle-stoptouraroundtheconsiderationsofrunninganoperationalHadoopcluster.Wedidn’ttrytoturndevelopersintoadministrators,buthopefully,thebroaderperspectivewillhelpyoutohelpyouroperationsstaff.Inparticular,wecoveredthefollowingtopics:
HowHadoopisanaturalfitforDevOpsapproachesasitsmultilayeredcomplexitymeansit’snotpossibleordesirabletohavesubstantialknowledgegapsbetweendevelopmentandoperationsstaffClouderaManager,andhowitcanbeagreatmanagementandmonitoringtool;itmightcauseintegrationproblemsthough,ifyouhaveotherenterprisetools,anditcomeswithavendorlock-inriskAmbari,theApacheopensourcealternativetoClouderaManager,andhowitisusedintheHortonworksdistributionHowtothinkaboutselectinghardwareforaphysicalHadoopcluster,andhowthisnaturallyfitsintotheconsiderationsofhowthemultipleworkloadspossibleintheworldofHadoop2canpeacefullycoexistonsharedresourcesThedifferentconsiderationsforfiringupandusingEMRclustersandhowthiscanbebothanadjunctto,aswellasanalternativeto,aphysicalclusterTheHadoopsecurityecosystem,howitisaveryfastmovingarea,andhowthefeaturesavailabletodayarevastlybetterthansomeyearsagoandthereisstillmucharoundthecornerMonitoringofaHadoopcluster,consideringwhateventsareimportantintheHadoopmodelofembracingfailure,andhowthesealertsandmetricscanbeintegratedintootherenterprise-monitoringframeworksHowtotroubleshootissueswithaHadoopcluster,bothintermsofwhatmighthavehappenedandhowtofindtheinformationtoinformyouranalysisAquicktourofthevariouswebUIsprovidedbyHadoop,whichcangiveverygoodoverviewsofhappeningswithinvariouscomponentsinthesystem
ThisconcludesourtreatmentofHadoopindepth.Inthefinalchapter,wewillexpresssomethoughtsonthebroaderHadoopecosystem,givesomepointersforusefulandinterestingtoolsandproductsthatwedidn’thaveachancetocoverinthebook,andsuggesthowtogetinvolvedwiththecommunity.
![Page 444: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/444.jpg)
Chapter11.WheretoGoNextInthepreviouschapterswehaveexaminedmanypartsofHadoop2andtheecosystemaroundit.However,wehavenecessarilybeenlimitedbypagecount;someareaswedidn’tgetintoasmuchdepthaswaspossible,otherareaswereferredtoonlyinpassingordidnotmentionatall.
TheHadoopecosystem,withdistributions,Apacheandnon-Apacheprojects,isanincrediblyvibrantandhealthyplacetoberightnow.Inthischapter,wehopetocomplementthepreviouslydiscussedmoredetailedmaterialwithatravelguide,ifyouwill,forotherinterestingdestinations.Inthischapter,wewilldiscussthefollowingtopics:
HadoopdistributionsOthersignificantApacheandnon-ApacheprojectsSourcesofinformationandhelp
Ofcourse,notethatanyoverviewoftheecosystemisbothskewedbyourinterestsandpreferences,andisoutdatedthemomentitiswritten.Inotherwords,don’tforamomentthinkthisisallthat’savailable,consideritinsteadawhettingoftheappetite.
![Page 445: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/445.jpg)
AlternativedistributionsWe’vegenerallyusedtheClouderadistributionforHadoopinthisbook,buthaveattemptedtokeepthecoveragedistributionindependentasmuchaspossible.We’vealsomentionedtheHortonworksDataPlatform(HDP)throughoutthisbookbutthesearecertainlynottheonlydistributionchoicesavailabletoyou.
Beforetakingalookaround,let’sconsiderwhetheryouneedadistributionatall.ItiscompletelypossibletogototheApachewebsite,downloadthesourcetarballsoftheprojectsinwhichyouareinterested,thenworktobuildthemalltogether.However,givenversiondependencies,thisislikelytoconsumemoretimethanyouwouldexpect.Potentially,vastlymoreso.Inaddition,theendproductwilllikelylacksomepolishintermsoftoolsorscriptsforoperationaldeploymentandmanagement.Formostusers,theseareasarewhyemployinganexistingHadoopdistributionisthenaturalchoice.
Anoteonfreeandcommercialextensions—beinganopensourceprojectwithaquiteliberallicense,distributioncreatorsarealsofreetoenhanceHadoopwithproprietaryextensionsthataremadeavailableeitherasfreeopensourceorcommercialproducts.
Thiscanbeacontroversialissueassomeopensourceadvocatesdislikeanycommercializationofsuccessfulopensourceprojects;tothem,itappearsthatthecommercialentityisfreeloadingbytakingthefruitsoftheopensourcecommunitywithouthavingtobuilditforthemselves.OthersseethisasahealthyaspectoftheflexibleApachelicense;thebaseproductwillalwaysbefree,andindividualsandcompaniescanchoosewhethertogowithcommercialextensionsornot.Wedon’tgivejudgmenteitherway,butbeawarethatthisisanotherofthecontroversiesyouwillalmostcertainlyencounter.
Soyouneedtodecideifyouneedadistributionandifsoforwhatreasons,whichspecificaspectswillbenefityoumostaboverollingyourown?Doyouwishforafullyopensourceproductorareyouwillingtopayforcommercialextensions?Withthesequestionsinmind,let’slookatafewofthemaindistributions.
![Page 446: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/446.jpg)
ClouderaDistributionforHadoopYouwillbefamiliarwiththeClouderadistribution(http://www.cloudera.com)asithasbeenusedthroughoutthisbook.CDHwasthefirstwidelyavailablealternativedistributionanditsbreadthofavailablesoftware,provenlevelofquality,anditsfreecosthasmadeitaverypopularchoice.
Recently,ClouderahasbeenactivelyextendingtheproductsitaddstoitsdistributionbeyondthecoreHadoopprojects.InadditiontoClouderaManagerandImpala(bothCloudera-developedproducts),ithasalsoaddedothertoolssuchasClouderaSearch(basedonApacheSolr)andClouderaNavigator(adatagovernancesolution).WhileCDHversionspriorto5werefocusedmoreontheintegrationbenefitsofadistribution,version5(andpresumablybeyond)isaddingmoreandmorecapabilityatopthebaseApacheHadoopprojects.
Clouderaalsoofferscommercialsupportforitsproductsinadditiontotrainingandconsultancyservices.Detailscanbefoundonthecompanywebpage.
![Page 447: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/447.jpg)
HortonworksDataPlatformIn2011,theYahoo!divisionresponsibleforsomuchofthedevelopmentofHadoopwasspunoffintoanewcompanycalledHortonworks.Theyhavealsoproducedtheirownpre-integratedHadoopdistributioncalledtheHortonworksDataPlatform(HDP),availableathttp://hortonworks.com/products/hortonworksdataplatform/.
HDPisconceptuallysimilartoCDHbutbothproductshavedifferencesintheirfocus.HortonworksmakesmuchofthefactHDPisfullyopensource,includingthemanagementtoolAmbari,whichwediscussedbrieflyinChapter10,RunningaHadoopCluster.TheyhavealsopositionedHDPasakeyintegrationplatformthroughitssupportfortoolssuchasTalendOpenStudio.Hortonworksdoesnotofferproprietarysoftware;itsbusinessmodelfocusesinsteadonofferingprofessionalservicesandsupportfortheplatform.
BothClouderaandHortonworksareventure-backedcompanieswithsignificantengineeringexpertise;bothcompaniesemploymanyofthemostprolificcontributorstoHadoop.Theunderlyingtechnologyis,however,comprisedofthesameApacheprojects;thedistinguishingfactorsarehowtheyarepackaged,theversionsemployed,andtheadditionalvalue-addedofferingsprovidedbythecompanies.
![Page 448: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/448.jpg)
MapRAdifferenttypeofdistributionisofferedbyMapRTechnologies,althoughthecompanyanddistributionareusuallyreferredtosimplyasMapR.Thedistributionavailablefromhttp://www.mapr.comisbasedonHadoop,buthasaddedanumberofchangesandenhancements.
ThefocusoftheMapRdistributionisonperformanceandavailability.Forexample,itwasthefirstdistributiontoofferahigh-availabilitysolutionfortheHadoopNameNodeandJobTracker,whichyouwillrememberfromChapter2,Storage,wasasignificantweaknessincoreHadoop1.ItalsoofferednativeintegrationwithNFSfilesystemslongbeforeHadoop2,whichmakesprocessingofexistingdatamucheasier.Toachievethesefeatures,MapRreplacedHDFSwithafullPOSIXcompliantfilesystemthatalsofeaturesnoNameNode,resultinginatruedistributedsystemwithnomaster,andaclaimofmuchbetterhardwareutilizationthanApacheHDFS.
MapRprovidesbothacommunityandenterpriseeditionofitsdistribution;notalltheextensionsareavailableinthefreeproduct.Thecompanyalsoofferssupportservicesaspartoftheenterpriseproductsubscriptioninadditiontotrainingandconsultancy.
![Page 449: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/449.jpg)
Andtherest…Hadoopdistributionsarenotjusttheterritoryofyoungstart-ups,noraretheyastaticmarketplace.Intelhaditsowndistributionuntilearly2014whenitdecidedtofolditschangesintoCDHinstead.IBMhasitsowndistributioncalledIBMInfosphereBigInsightsavailableinbothfreeandcommercialeditions.Therearealsovariousstoriesofnumerouslargeenterprisesrollingtheirowndistributions,someofwhicharemadeopenlyavailablewhileothersarenot.Youwillhavenoshortageofoptionswithsomanyhigh-qualitydistributionsavailable.
![Page 450: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/450.jpg)
ChoosingadistributionThisraisesthequestion:howtochooseadistribution?Ascanbeseen,theavailabledistributions(andwedidn’tcoverthemall)rangefromconvenientpackagingandintegrationoffullyopensourceproductsthroughtoentirebespokeintegrationandanalysislayersatopthem.Thereisnooverallbestdistribution;thinkcarefullyaboutyourrequirementsandconsiderthealternatives.Sincealltheseofferafreedownloadofatleastabasicversion,it’sgoodtosimplyplayandexperiencetheoptionsforyourself.
![Page 451: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/451.jpg)
OthercomputationalframeworksWe’vefrequentlydiscussedthemyriadpossibilitiesbroughttotheHadoopplatformbyYARN.Wewentintodetailsoftwonewmodels,SamzaandSpark.Additionally,othermoreestablishedframeworkssuchasPigarealsobeingportedtotheframework.
Togiveaviewofthemuchbiggerpictureinthissection,wewillillustratethebreadthofprocessingpossibleusingYARNbypresentingasetofcomputationalmodelsthatarecurrentlybeingportedtoHadoopontopofYARN.
![Page 452: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/452.jpg)
ApacheStormStorm(http://storm.apache.org)isadistributedcomputationframeworkwritten(mainly)intheClojureprogramminglanguage.Itusescustom-createdspoutsandboltstodefineinformationsourcesandmanipulationstoallowdistributedprocessingofstreamingdata.AStormapplicationisdesignedasatopologyofinterfacesthatcreatesastreamoftransformations.ItprovidessimilarfunctionalitytoaMapReducejobwiththeexceptionthatthetopologywilltheoreticallyrunindefinitelyuntilitismanuallyterminated.
ThoughinitiallybuiltdistinctfromHadoop,aYARNportisbeingdevelopedbyYahoo!andcanbefoundathttps://github.com/yahoo/storm-yarn.
![Page 453: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/453.jpg)
ApacheGiraphGiraphoriginatedastheopensourceimplementationofGoogle’sPregelpaper(whichcanbefoundathttp://kowshik.github.io/JPregel/pregel_paper.pdf).BothGiraphandPregelareinspiredbytheBulkSynchronousParallel(BSP)modelofdistributedcomputationintroducedbyValiantin1990.Giraphaddsseveralfeaturesincludingmastercomputation,shardedaggregators,edge-orientedinput,andout-of-corecomputation.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/GIRAPH-13.
![Page 454: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/454.jpg)
ApacheHAMAHamaisatop-levelApacheprojectthataims,likeothermethodswe’veencounteredsofar,toaddresstheweaknessofMapReducewithregardtoiterativeprogramming.SimilartotheaforementionedGiraph,HamaimplementstheBSPtechniquesandhasbeenheavilyinspiredbythePregelpaper.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/HAMA-431.
![Page 455: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/455.jpg)
OtherinterestingprojectsWhetheryouuseabundleddistributionorstickwiththebaseApacheHadoopdownload,youwillencountermanyreferencestootherrelatedprojects.We’vecoveredseveralofthesesuchasHive,Samza,andCrunchinthisbook;we’llnowhighlightsomeoftheothers.
Notethatthiscoverageseekstopointoutthehighlights(fromtheauthors’perspective)aswellasgiveatasteofthebreadthoftypesofprojectsavailable.Asmentionedearlier,keeplookingout,astherewillbenewoneslaunchingallthetime.
![Page 456: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/456.jpg)
HBasePerhapsthemostpopularApacheHadoop-relatedprojectthatwedidn’tcoverinthisbookisHBase(http://hbase.apache.org).BasedontheBigTablemodelofdatastoragepublicizedbyGoogleinanacademicpaper(soundfamiliar?),HBaseisanonrelationaldatastoresittingatopHDFS.
WhilebothMapReduceandHivefocusonbatch-likedataaccesspatterns,HBaseinsteadseekstoprovideverylow-latencyaccesstodata.ConsequentlyHBasecan,unliketheaforementionedtechnologies,directlysupportuser-facingservices.
TheHBasedatamodelisnottherelationalapproachthatwasusedinHiveandallotherRDBMSs,nordoesitofferthefullACIDguaranteesthataretakenforgrantedwithrelationalstores.Instead,itisakey-valueschema-lesssolutionthattakesacolumn-orientedviewofdata;columnscanbeaddedatruntimeanddependonthevaluesinsertedintoHBase.Eachlookupoperationisthenveryfast,asitiseffectivelyakey-valuemappingfromtherowkeytothedesiredcolumn.HBasealsotreatstimestampsasanotherdimensiononthedatasoonecandirectlyretrievedatafromapointintime.
Thedatamodelisverypowerfulbutdoesnotsuitallusecasesjustastherelationalmodelisn’tuniversallyapplicable.Butifyouhavearequirementforstructuredlow-latencyviewsonlarge-scaledatastoredinHadoop,thenHBaseisabsolutelysomethingyoushouldlookat.
![Page 457: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/457.jpg)
SqoopInChapter7,HadoopandSQL,welookedattoolsforpresentingarelational-likeinterfacetodatastoredonHDFS.Often,suchdataeitherneedstoberetrievedfromanexistingrelationaldatabaseortheoutputofitsprocessingneedstobestoredback.
ApacheSqoop(http://sqoop.apache.org)providesamechanismfordeclarativelyspecifyingdatamovementbetweenrelationaldatabasesandHadoop.IttakesataskdefinitionandfromthisgeneratesMapReducejobstoexecutetherequireddataretrievalorstorage.ItwillalsogeneratecodetohelpmanipulaterelationalrecordswithcustomJavaclasses.Inaddition,itcanintegratewithHBaseandHcatalog/Hiveanditprovidesaveryrichsetofintegrationpossibilities.
Atthetimeofwriting,Sqoopisslightlyinflux.Itsoriginalversion,Sqoop1,wasapureclient-sideapplication.MuchliketheoriginalHivecommand-linetool,Sqoop1hasnoserverandgeneratesallcodeontheclient.Thisunfortunatelymeansthateachclientneedstoknowalotofdetailsaboutphysicaldatasources,includingexacthostnamesaswellasauthenticationcredentials.
Sqoop2providesacentralizedSqoopserverthatencapsulatesallthesedetailsandoffersthevariousconfigureddatasourcestotheconnectingclients.Itisasuperiormodelbutatthetimeofwriting,thegeneralcommunityrecommendationistostickwithSqoop1untilthenewversionevolvesfurther.Checkonthecurrentstatusifyouareinterestedinthistypeoftool.
![Page 458: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/458.jpg)
WhirWhenlookingtousecloudservicessuchasAmazonAWSforHadoopdeployments,itisusuallyaloteasiertouseahigherlevelservicesuchasElasticMapReduceasopposedtosettingupyourownclusteronEC2.Thoughtherearescriptstohelp,thefactisthattheoverheadofHadoop-baseddeploymentsoncloudinfrastructurescanbeinvolved.That’swhereApacheWhir(https://whirr.apache.org/)comesin.
Whirisn’tfocusedonHadoop;it’saboutsupplier-independentinstantiationofcloudservicesofwhichHadoopisasingleexample.WhiraimstoprovideaprogrammaticwayofspecifyingandcreatingHadoop-baseddeploymentsoncloudinfrastructuresinawaythathandlesalltheunderlyingserviceaspectsforyou.Itdoesthisinaprovider-independentfashionsothatonceyou’velaunchedonsayEC2thenyoucanusethesamecodetocreatetheidenticalsetuponanotherprovidersuchasRightscaleorEucalyptus.Thismakesvendorlock-in,oftenaconcernwithclouddeployments,lessofanissue.
Whirisn’tquitethereyet.Today,itislimitedinservicesitcancreateandprovidersitsupports,however,ifyouareinterestedinclouddeploymentwithlesspainthenit’sworthwatchingitsprogress.
NoteIfyouarebuildingoutyourfullinfrastructureonAmazonWebServicesthenyoumightfindcloudformationgivesmuchofthesameabilitytodefineapplicationrequirements,thoughobviouslyinanAWS-specificfashion.
![Page 459: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/459.jpg)
MahoutApacheMahout(http://mahout.apache.org/)isacollectionofdistributedalgorithms,Javaclasses,andtoolsforperformingadvancedanalyticsontopofHadoop.SimilartoSpark’sMLLibbrieflymentionedinChapter5,IterativeComputationwithSpark,Mahoutshipswithanumberofalgorithmsforcommonusecases:recommendation,clustering,regression,andfeatureengineering.Althoughthesystemisfocusedonnaturallanguageprocessingandtext-miningtasks,itsbuildingblocks(linearalgebraoperations)aresuitabletobeappliedtoanumberofdomains.AsofVersion0.9,theprojectisbeingdecoupledfromtheMapReduceframeworkinfavorofricherprogrammingmodelssuchasSpark.Thecommunityendgoalistoobtainaplatform-independentlibrarybasedonaScalaDSL.
![Page 460: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/460.jpg)
HueInitiallydevelopedbyClouderaandmarketedasthe“UserInterfaceforHadoop”,Hue(http://gethue.com/)isacollectionofapplications,bundledtogetherunderacommonwebinterface,thatactasclientsforcoreservicesandanumberofcomponentsoftheHadoopecosystem:
TheHueQueryEditorforHive
Hueleveragesmanyofthetoolswediscussedinpreviouschaptersandprovidesanintegratedinterfaceforanalyzingandvisualizingdata.Therearetwocomponentsthatareremarkablyinteresting.Ononehand,thereisaqueryeditorthatallowstheusertocreateandsaveHive(orImpala)queries,exporttheresultsetinCSVorMicrosoftOfficeExcelformataswellasplotitinthebrowser.TheeditorfeaturesthecapabilityofsharingbothHiveQLandresultsets,thusfacilitatingcollaborationwithinanorganization.Ontheotherhand,thereisanOozieworkflowandcoordinatoreditorthatallowsausertocreateanddeployOoziejobsmanually,automatingthegenerationofXMLconfigurationsandboilerplate.
BothClouderaandHortonworksdistributionsshipwithHueandtypicallyincludethefollowing:
AfilemanagerforHDFSAJobBrowserforYARN(MapReduce)AnApacheHBasebrowserAHivemetastoreexplorerQueryeditorsforHiveandImpalaAscripteditorforPigAjobeditorforMapReduceandSpark
![Page 461: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/461.jpg)
AneditorforSqoop2jobsAnOozieworkfloweditoranddashboardAnApacheZooKeeperbrowser
Ontopofthis,HueisaframeworkwithanSDKthatcontainsanumberofwebassets,APIs,andpatternsfordevelopingthird-partyapplicationsthatinteractwithHadoop.
![Page 462: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/462.jpg)
OtherprogrammingabstractionsHadoopisn’tjustextendedbyadditionalfunctionality,therearetoolstoprovideentirelydifferentparadigmsforwritingthecodeusedtoprocessyourdatawithinHadoop.
![Page 463: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/463.jpg)
CascadingDevelopedbyConcurrent,andopensourcedunderanApachelicense,Cascading(http://www.cascading.org/)isapopularframeworkthatabstractsthecomplexityofMapReduceawayandallowsustocreatecomplexworkflowsontopofHadoop.Cascadingjobscancompileto,andbeexecutedon,MapReduce,Tez,andSpark.Conceptually,theframeworkissimilartoApacheCrunch,coveredinChapter9,MakingDevelopmentEasier,thoughpracticallytherearedifferencesintermsofdataabstractionsandendgoals.Cascadingadoptsatupledatamodel(similartoPig)ratherthanarbitraryobjects,andencouragestheusertorelyonahigherlevelDSL,powerfulbuilt-intypes,andtoolstomanipulatedata.
Putinsimpleterms,CascadingistoPigLatinandHiveQLwhatCrunchistoauser-definedfunction.
LikeMorphlines,whichwealsosawinChapter9,MakingDevelopmentEasier,theCascadingdatamodelfollowsasource-pipe-sinkapproach,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink,readytobepickedupbyanotherapplication.
CascadingencouragesdeveloperstowritecodeinanumberofJVMlanguages.PortsoftheframeworkexistforPython(PyCascading),JRuby(Cascading.jruby),Clojure(Cascalog),andScala(Scalding).CascalogandScaldinginparticularhavegainedalotoftractionandspawnedofftheirveryownecosystems.
AnareawhereCascadingexcelsisdocumentation.TheprojectprovidescomprehensivejavadocsoftheAPI,extensivetutorials(http://www.cascading.org/documentation/tutorials/)andaninteractiveexercise-basedlearningenvironment(https://github.com/Cascading/Impatient).
AnotherstrongsellingpointofCascadingisitsintegrationwiththird-partyenvironments.AmazonEMRsupportsCascadingasafirst-classprocessingframeworkandallowsustolaunchCascadingclustersbothwiththecommandlineandwebinterfaces(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.htmlPluginsfortheSDKexistforboththeIntelliJIDEAandEclipseintegrateddevelopmentenvironments.Oneoftheframework’stopprojects,CascadingPatterns,acollectionofmachine-learningalgorithms,featuresautilityfortranslatingPredictiveModelMarkupLanguage(PMML)documentsintoapplicationsonApacheHadoop,thusfacilitatinginteroperabilitywithpopularstatisticalenvironmentsandscientifictoolssuchasR(http://cran.r-project.org/web/packages/pmml/index.html).
![Page 464: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/464.jpg)
AWSresourcesManyHadooptechnologiescanbedeployedonAWSaspartofaself-managedcluster.However,justasAmazonofferssupportforElasticMapReduce,whichhandlesHadoopasamanagedservice,thereareafewotherservicesthatareworthmentioning.
![Page 465: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/465.jpg)
SimpleDBandDynamoDBForsometime,AWShasofferedSimpleDBasahostedserviceprovidinganHBase-likedatamodel.
Ithas,however,largelybeensupersededbyamorerecentservicefromAWS,DynamoDB,locatedathttp://aws.amazon.com/dynamodb.ThoughitsdatamodelisverysimilartothatofSimpleDBandHBase,itisaimedataverydifferenttypeofapplication.WhereSimpleDBhasquitearichsearchAPIbutisverylimitedintermsofsize,DynamoDBprovidesamoreconstrainedthoughconstantlyevolvingAPI,butwithaserviceguaranteeofnear-unlimitedscalability.
TheDynamoDBpricingmodelisparticularlyinteresting;insteadofpayingforacertainnumberofservershostingtheservice,youallocateacertaincapacityforread-and-writeoperations,andDynamoDBmanagestheresourcesrequiredtomeetthisprovisionedcapacity.Thisisaninterestingdevelopmentasitisamorepureservicemodel,wherethemechanismofdeliveringthedesiredperformanceiskeptcompletelyopaquetotheserviceuser.HavealookatDynamoDBbutifyouneedamuchlargerscaleofdatastorethanSimpleDBcanoffer;however,doconsiderthepricingmodelcarefullyasprovisioningtoomuchcapacitycanbecomeveryexpensiveveryquickly.AmazonprovidessomegoodbestpracticesforDynamoDBatthefollowingURLthatillustratethatminimizingtheservicecostscanresultinadditionalapplication-layercomplexity:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html.
NoteOfcoursethediscussionofDynamoDBandSimpleDBassumesanon-relationaldatamodel;thereistheAmazonRelationalDatabaseService(AmazonRDS)forarelationaldatabaseinthecloudservice.
![Page 466: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/466.jpg)
KinesisJustasEMRishostedHadoopandDynamoDBhassimilaritiestoahostedHBase,itwasn’tsurprisingtoseeAWSannounceKinesis,ahostedstreamingdataservicein2013.Thiscanbefoundathttp://aws.amazon.com/kinesisandithasverysimilarconceptualbuildingblockstothestackofSamzaatopKafka.KinesisprovidesapartitionedviewofmessagesasastreamofdataandanAPItohavecallbacksthatexecutewhenmessagesarrive.AswithmostAWSservices,thereistightintegrationwithotherservicesmakingiteasytogetdataintoandoutoflocationssuchasS3.
![Page 467: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/467.jpg)
DataPipelineThefinalAWSservicethatwe’llmentionisDataPipeline,whichcanbefoundathttp://aws.amazon.com/datapipeline.Asthenamesuggests,itisaframeworkforbuildingupdata-processingjobsthatinvolvemultiplesteps,datamovements,andtransformations.IthasquiteaconceptualoverlapwithOozie,butwithafewtwists.Firstly,DataPipelinehastheexpecteddeepintegrationwithmanyotherAWSservices,enablingeasydefinitionofdataworkflowsthatincorporatediverserepositoriessuchasRDS,S3,andDynamoDB.Inadditionhowever,DataPipelinedoeshavetheabilitytointegrateagentsinstalledonlocalinfrastructure,providinganinterestingavenueforbuildingworkflowsthatspanacrosstheAWSandon-premisesenvironments.
![Page 468: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/468.jpg)
SourcesofinformationYoudon’tjustneednewtechnologiesandtools—eveniftheyarecool.Sometimes,alittlehelpfromamoreexperiencedsourcecanpullyououtofahole.Inthisregard,youarewellcovered,astheHadoopcommunityisextremelystronginmanyareas.
![Page 469: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/469.jpg)
SourcecodeIt’ssometimeseasytooverlook,butHadoopandalltheotherApacheprojectsareafterallfullyopensource.Theactualsourcecodeistheultimatesource(pardonthepun)ofinformationabouthowthesystemworks.Becomingfamiliarwiththesourceandtracingthroughsomeofthefunctionalitycanbehugelyinformative.Nottomentionhelpfulwhenyouarehittingunexpectedbehavior.
![Page 470: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/470.jpg)
MailinglistsandforumsAlmostalltheprojectsandserviceslistedinthischapterhavetheirownmailinglistsand/orforums;checkoutthehomepagesforthespecificlinks.Mostdistributionsalsohavetheirownforumsandothermechanismstoshareknowledgeandget(non-commercial)helpfromthecommunity.Additionally,ifusingAWS,makesuretocheckouttheAWSdeveloperforumsathttps://forums.aws.amazon.com.
Alwaysremembertoreadpostingguidelinescarefullyandunderstandtheexpectedetiquette.Thesearetremendoussourcesofinformation;thelistsandforumsareoftenfrequentlyvisitedbythedevelopersoftheparticularproject.ExpecttoseethecoreHadoopdevelopersontheHadooplists,HivedevelopersontheHivelist,EMRdevelopersontheEMRforums,andsoon.
![Page 471: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/471.jpg)
LinkedIngroupsThereareanumberofHadoopandrelatedgroupsontheprofessionalsocialnetworkLinkedIn.Doasearchforyourparticularareasofinterest,butagoodstartingpointmightbethegeneralHadoopusers’groupathttp://www.linkedin.com/groups/Hadoop-Users-988957.
![Page 472: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/472.jpg)
HUGsIfyouwantmoreface-to-faceinteractionthenlookforaHadoopUserGroup(HUG)inyourarea,mostofwhichwillbelistedathttp://wiki.apache.org/hadoop/HadoopUserGroups.Thesetendtoarrangesemi-regularget-togethersthatcombinethingssuchasqualitypresentations,theabilitytodiscusstechnologywithlike-mindedindividuals,andoftenpizzaanddrinks.
NoHUGnearwhereyoulive?Considerstartingone.
![Page 473: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/473.jpg)
ConferencesThoughsomeindustriestakedecadestobuildupaconferencecircuit,Hadoopalreadyhassomesignificantconferenceactioninvolvingtheopensource,academic,andcommercialworlds.EventssuchastheHadoopSummitandStrataareprettybig;theseandsomeotherarelinkedfromhttp://wiki.apache.org/hadoop/Conferences.
![Page 474: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/474.jpg)
SummaryInthischapter,wetookaquickgalloparoundthebroaderHadoopecosystem,lookingatthefollowingtopics:
WhyalternativeHadoopdistributionsexistandsomeofthemorepopularonesOtherprojectsthatprovidecapabilities,extensions,orHadoopsupportingtoolsAlternativewaysofwritingorcreatingHadoopjobsSourcesofinformationandhowtoconnectwithotherenthusiasts
Now,gohavefunandbuildsomethingamazing!
![Page 475: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/475.jpg)
IndexA
additionaldata,collectingabout/Collectingadditionaldataworkflows,scheduling/SchedulingworkflowsOozietriggers/OtherOozietriggers
addMappermethod,argumentsjob/Textcleanupusingchainmapperclass/TextcleanupusingchainmapperinputKeyClass/TextcleanupusingchainmapperinputValueClass/TextcleanupusingchainmapperoutputKeyClass/TextcleanupusingchainmapperoutputValueClass/TextcleanupusingchainmappermapperConf/Textcleanupusingchainmapper
alternativedistributionsabout/AlternativedistributionsClouderaDistribution/ClouderaDistributionforHadoopHortonworksDataPlatform(HDP)/HortonworksDataPlatformMapR/MapRselecting/Choosingadistribution
Amazonaccountreferencelink/CreatinganAWSaccount
AmazonCLIreferencelink/TheAWScommand-lineinterface
AmazonEMRabout/AmazonEMRAWSaccount,creating/CreatinganAWSaccountrequiredservices,signingup/Signingupforthenecessaryservices
AmazonRelationalDatabaseService(AmazonRDS)/SimpleDBandDynamoDBAmazonWebServices
Hive,workingwith/HiveandAmazonWebServicesAmbari
about/Ambari–theopensourcealternativeURL/Ambari–theopensourcealternative
AMPLabatUCBerkeley,URL/ApacheSpark
ApacheAvroabout/AvroURL/Avro
ApacheCrunchabout/ApacheCrunchURL/ApacheCrunch
![Page 476: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/476.jpg)
JARs/Gettingstartedlibraries/Gettingstartedconcepts/ConceptsPCollection<T>interface/ConceptsPTable<Key,Value>interface/Conceptsdataserialization/Dataserializationdataprocessingpatterns/DataprocessingpatternsPipelinesimplementation/Pipelinesimplementationandexecutionexecution/Pipelinesimplementationandexecutionexamples/CrunchexamplesKiteMorphlines/KiteMorphlines
ApacheDataFureferencelink/ContributedUDFs,ApacheDataFuabout/ApacheDataFu
ApacheGiraphabout/ApacheGiraphURL/ApacheGiraph
ApacheHAMAabout/ApacheHAMA
ApacheKafkaURL/ApacheSamza,Samza’sbestfriend–ApacheKafkaabout/Samza’sbestfriend–ApacheKafkaTwitterdata,gettinginto/GettingTwitterdataintoKafka
ApacheKnoxabout/BeyondbasicauthorizationURL/Beyondbasicauthorization
ApacheSentryURL/Beyondbasicauthorization
ApacheSparkabout/ApacheSpark,GettingstartedwithSparkURL/ApacheSpark,GettingstartedwithSparkclustercomputing,withworkingsets/ClustercomputingwithworkingsetsResilientDistributedDatasets(RDDs)/ResilientDistributedDatasets(RDDs)actions/Actionsdeployment/DeploymentonYARN/SparkonYARNonEC2/SparkonEC2standaloneapplications,writing/WritingandrunningstandaloneapplicationsScalaAPI/ScalaAPIJavaAPI/JavaAPIWordCount,inJava/WordCountinJavaPythonAPI/PythonAPIdata,processing/ProcessingdatawithApacheSpark
ApacheSpark,ecosystem
![Page 477: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/477.jpg)
about/TheSparkecosystemSparkStreaming/SparkStreamingGraphX/GraphXMLLib/MLlibSparkSQL/SparkSQL
ApacheStormabout/ApacheStormURL/ApacheStorm
ApacheThriftabout/ThriftURL/Thrift
ApacheTikaabout/MultijobworkflowsURL/Multijobworkflows
ApacheTwillURL/Thinkinginlayers
ApacheZooKeeperabout/ApacheZooKeeper–adifferenttypeoffilesystemURL/ApacheZooKeeper–adifferenttypeoffilesystemdistributedlock,implementingwithsequentialZNodes/ImplementingadistributedlockwithsequentialZNodesgroupmembership,implementing/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesleaderelection,implementingwithephemeralZNodes/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesJavaAPI/JavaAPIblocks,building/Buildingblocksused,forenablingautomaticNameNodefailover/AutomaticNameNodefailover
applicationdevelopmentframework,selecting/Choosingaframework
ApplicationManagerabout/ResourceManager,NodeManager,andApplicationManager
ApplicationMaster(AM)about/AnatomyofaYARNapplication
architecturalprinciples,HDFSandMapReduce/CommonbuildingblocksArraywrapperclasses
about/ArraywrapperclassesautomaticNameNodefailover
enabling/AutomaticNameNodefailoverAvro
about/AvroAvroschemaevolution,using
thoughts/FinalthoughtsonusingAvroschemaevolution
![Page 478: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/478.jpg)
additivechanges,making/Onlymakeadditivechangesschemaversions,managingexplicitly/Manageschemaversionsexplicitlyschemadistribution/Thinkaboutschemadistribution
Avroschemasabout/UsingtheJavaAPI
AvroSerdeURL/Avroabout/Avro
AWSabout/DistributionsofApacheHadoop,AWS–infrastructureondemandfromAmazonSimpleStorageService(S3)/SimpleStorageService(S3)ElasticMapReduce(EMR)/ElasticMapReduce(EMR)
AWScommand-lineinterfaceabout/TheAWScommand-lineinterfacereferencelink/TheAWScommand-lineinterface
AWScredentialsabout/AWScredentialsaccountID/AWScredentialsaccesskey/AWScredentialssecretaccesskey/AWScredentialskeypairs/AWScredentialsreferencelink/AWScredentials
AWSdeveloperforumsURL/Mailinglistsandforums
AWSresourcesabout/AWSresourcesSimpleDB/SimpleDBandDynamoDBDynamoDB/SimpleDBandDynamoDBDataPipeline/DataPipeline
![Page 479: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/479.jpg)
Bblockreplication
about/BlockreplicationBulkSynchronousParallel(BSP)model
about/ApacheGiraph
![Page 480: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/480.jpg)
CCascading
about/CascadingURL/Cascadingreferencelinks/Cascading
ClouderaURL/DistributionsofApacheHadoopURL,fordocumentation/ClouderaManagerURL,forblogpost/Sharingresources
Clouderadistributionabout/ClouderaDistributionforHadoopURL/ClouderaDistributionforHadoop
ClouderaHadoopDistribution(CDH)about/ClouderaManager
ClouderaKittenURL/Thinkinginlayers
ClouderaManagerabout/ClouderaManagerpayment,forsubscriptionservices/Topayornottopayclustermanagement,performing/ClustermanagementusingClouderaManagerintegrating,withsystemsmanagementtools/ClouderaManagerandothermanagementtoolsmonitoringwith/MonitoringwithClouderaManagerlogfiles,finding/Findingconfigurationfiles
ClouderaManagerAPIabout/ClouderaManagerAPI
ClouderaManagerlock-inabout/ClouderaManagerlock-in
ClouderaQuickstartVMabout/ClouderaQuickStartVMadvantages/ClouderaQuickStartVM
clusterbuilding,onEMR/BuildingaclusteronEMR
cluster,APacheSparkcomputing,withworkingsets/Clustercomputingwithworkingsets
cluster,onEMRfilesystem,considerations/Considerationsaboutfilesystemsdata,obtainingintoEMR/GettingdataintoEMREC2instances/EC2instancesandtuningEC2tuning/EC2instancesandtuning
clustermanagementperforming,ClouderaManagerused/ClustermanagementusingClouderaManager
![Page 481: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/481.jpg)
clusterstartup,HDFSabout/ClusterstartupNameNodestartup/NameNodestartupDataNodestartup/DataNodestartup
clustertuningabout/ClustertuningJVMconsiderations/JVMconsiderationsmapoptimization/Mapandreduceoptimizationsreduceoptimization/Mapandreduceoptimizations
column-orienteddataformatsabout/Column-orienteddataformatsRCFile/RCFileORC/ORCParquet/ParquetAvro/AvroJavaAPI,using/UsingtheJavaAPI
columnarabout/Columnarstores
columnarstores/Columnarstorescombinerclass,JavaAPItoMapReduce
about/CombinercombineValuesoperation
about/Conceptscommand-lineaccess,HDFSfilesystem
about/Command-lineaccesstotheHDFSfilesystemhdfscommand/Command-lineaccesstotheHDFSfilesystemdfscommand/Command-lineaccesstotheHDFSfilesystemdfsadmincommand/Command-lineaccesstotheHDFSfilesystem
Comparableinterfaceabout/TheComparableandWritableComparableinterfaces
complexdatatypesmap/Pigdatatypestuple/Pigdatatypesbag/Pigdatatypes
complexeventprocessing(CEP)about/HowSamzaworks
components,Hadoopabout/ComponentsofHadoopcommonbuildingblocks/Commonbuildingblocksstorage/Storagecomputation/Computation
components,YARNabout/ThecomponentsofYARNResourceManager(RM)/ThecomponentsofYARN
![Page 482: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/482.jpg)
NodeManager(NM)/ThecomponentsofYARNcomputation
about/Computationcomputation,Hadoop2
about/ComputationinHadoop2computationalframeworks
about/OthercomputationalframeworksApacheStorm/ApacheStormApacheGiraph/ApacheGiraph,ApacheHAMA
conferencesabout/Conferencesreferencelink/Conferences
configurationfile,Samzaabout/Theconfigurationfile
containersabout/SerializationandContainers
contributedUDFsabout/ContributedUDFsPiggybank/PiggybankElephantBird/ElephantBirdApacheDataFu/ApacheDataFu
create.hqlscriptreferencelink/ExtractingdataandingestingintoHive
Crunchexamplesabout/Crunchexampleswordco-occurrence/Wordco-occurrenceTF-IDF/TF-IDF
Curatorprojectreferencelink/Buildingblocks
![Page 483: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/483.jpg)
Ddata,managing
about/ManagingandserializingdataWritableinterface/TheWritableinterfacewrapperclasses/IntroducingthewrapperclassesArraywrapperclasses/ArraywrapperclassesComparableinterface/TheComparableandWritableComparableinterfacesWritableComparableinterface/TheComparableandWritableComparableinterfaces
data,Pigworkingwith/WorkingwithdataFILTERoperator/Filteringaggregation/AggregationFOREACHoperator/ForeachJOINoperator/Join
data,storingabout/Storingdataserializationfileformat/SerializationandContainerscontainersfileformat/SerializationandContainersfilecompression/Compressiongeneral-purposefileformats/General-purposefileformatscolumn-orienteddataformats/Column-orienteddataformats
Datacoreabout/DataCore
DataCrunchabout/DataCrunch
DataHCatalogabout/DataHCatalog
DataHiveabout/DataHive
datalifecyclemanagementabout/Whatdatalifecyclemanagementisimportance/Importanceofdatalifecyclemanagementtools/Toolstohelp
DataMapReduceabout/DataMapReduce
DataNode/NameNodeandDataNodeDataNodes
about/StorageinHadoop2DataNodestartup
about/DataNodestartupDataPipeline
about/DataPipeline
![Page 484: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/484.jpg)
referencelink/DataPipelinedataprocessing
about/DataprocessingwithHadoopdataset,generatingfromTwitter/WhyTwitter?dataset,building/Buildingourfirstdatasetprogrammaticaccess,withPython/ProgrammaticaccesswithPython
dataprocessing,ApacheSparkabout/ProcessingdatawithApacheSparkexamples,running/Buildingandrunningtheexamplesexamples,building/Buildingandrunningtheexamplesexamples,runningonYARN/RunningtheexamplesonYARNpopulartopics,finding/Findingpopulartopicssentiment,assigningtotopics/Assigningasentimenttotopicsonstreams/Dataprocessingonstreamsstatemanagement/Statemanagementdataanalysis,withSparkSQL/DataanalysiswithSparkSQLSQL,ondatastreams/SQLondatastreams
dataprocessingpatterns,Crunchabout/Dataprocessingpatternsaggregationandsorting/Aggregationandsortingjoiningdata/Joiningdata
dataserialization,Crunchabout/Dataserialization
dataset,buildingwithTwitterabout/BuildingourfirstdatasetmultipleAPIs,using/Oneservice,multipleAPIsanatomy,ofTweet/AnatomyofaTweetTwittercredentials/Twittercredentials
DataSparkabout/DataSpark
datatypes,Hivenumeric/Datatypesdateandtime/Datatypesstring/Datatypescollections/Datatypesmisc/Datatypes
datatypes,Pigscalardatatypes/Pigdatatypescomplexdatatypes/Pigdatatypes
DDLstatements,Hive/DDLstatementsdecayFactorfunction/StatemanagementDEFINEoperator
about/ExtendingPig(UDFs)deriveddata,producing
![Page 485: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/485.jpg)
about/Producingderiveddatamultipleactions,performinginparallel/Performingmultipleactionsinparallelsubworkflow,calling/Callingasubworkflowglobalsettings,adding/Addingglobalsettings
DevOpspractices/HadoopandDevOpspractices
directedacyclicgraph(DAG)about/YARN
documentfrequencyabout/Calculatedocumentfrequencycalculating,TF-IDFused/Calculatedocumentfrequency
DrillURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond
Driverclass,JavaAPItoMapReduceabout/TheDriverclass
dynamicinvokersabout/Dynamicinvokersreferencelink/Dynamicinvokers
DynamoDBURL/SimpleDBandDynamoDBabout/SimpleDBandDynamoDB
![Page 486: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/486.jpg)
EEC2
ApacheSparkon/SparkonEC2EC2key-valuepair
referencelink/TheAWScommand-lineinterfaceElasticMapReduce
Hive,usingwith/HiveonElasticMapReduceElasticMapReduce(EMR)
about/DistributionsofApacheHadoop,ElasticMapReduce(EMR)URL/ElasticMapReduce(EMR)using/UsingElasticMapReduce
ElephantBirdreferencelink/ContributedUDFs,ElephantBird
EMRcluster,buildingon/BuildingaclusteronEMRURL,forbestpractices/BuildingaclusteronEMR
EMRdocumentationURL/HiveonElasticMapReduce
entitiesabout/Tweetmetadata
ephemeralZNodesabout/ImplementinggroupmembershipandleaderelectionusingephemeralZNodes
evalfunctions,PigAVG(expression)/EvalCOUNT(expression)/EvalCOUNT_STAR(expression)/EvalIsEmpty(expression)/EvalMAX(expression)/EvalMIN(expression)/EvalSUM(expression)/EvalTOKENIZE(expression)/Eval
examplesrunning/Runningtheexamples
examples,MapReduceprogramsreferencelink/Runningtheexampleslocalcluster/LocalclusterElasticMapReduce/ElasticMapReduce
examplesandsourcecodedownloadlink/Gettingstarted
ExecutionEngineinterface/AnoverviewofPigexternaldata,challenges
about/Challengesofexternaldata
![Page 487: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/487.jpg)
datavalidation/Datavalidationvalidationactions/Validationactionsformatchanges,handling/Handlingformatchangesschemaevolution,handlingwithAvro/HandlingschemaevolutionwithAvro
EXTERNALkeyword/DDLstatementsExtract-Transform-Load(ETL)/DDLstatementsextract_for_hive.pig
URL,forsourcecode/Prerequisites
![Page 488: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/488.jpg)
FFalcon
URL/Othertoolstohelpabout/Othertoolstohelp
fileformat,Hiveabout/FileformatsandstorageJSON/JSON
FileFormatclasses,HiveTextInputFormat/FileformatsandstorageHiveIgnoreKeyTextOutputFormat/FileformatsandstorageSequenceFileInputFormat/FileformatsandstorageSequenceFileOutputFormat/Fileformatsandstorage
filesystemmetadata,HDFSprotecting/ProtectingthefilesystemmetadataSecondaryNameNode,demerits/SecondaryNameNodenottotherescueHadoop2NameNodeHA/Hadoop2NameNodeHAclientconfiguration/Clientconfigurationfailover,working/Howafailoverworks
FILTERoperatorabout/Filtering
FlumeJavareferencelink/ApacheCrunch
FOREACHoperatorabout/Foreach
forknodeabout/Performingmultipleactionsinparallel
functions,Pigabout/Pigfunctionsbuilt-infunctions/Pigfunctionsreferencelink,forbuilt-infunctions/Pigfunctionsload/storefunctions/Load/storeeval/Evaltuple/Thetuple,bag,andmapfunctionsbag/Thetuple,bag,andmapfunctionsmap/Thetuple,bag,andmapfunctionsstring/Themath,string,anddatetimefunctionsmath/Themath,string,anddatetimefunctionsdatetime/Themath,string,anddatetimefunctionsdynamicinvokers/Dynamicinvokersmacros/Macros
![Page 489: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/489.jpg)
GGarbageCollection(GC)/JVMconsiderationsGarbageFirst(G1)collector/JVMconsiderationsgeneral-purposefileformats
about/General-purposefileformatsTextfiles/General-purposefileformatsSequenceFile/General-purposefileformats
generalavailability(GA)/AnoteonversioningGoogleChubbysystem
referencelink/ApacheZooKeeper–adifferenttypeoffilesystemGoogleFileSystem(GFS)
referencelink/ThebackgroundofHadoopGradle
URL/RunningtheexamplesGraphX
about/GraphXURL/GraphX
groupByKey()method/AggregationandsortinggroupByKey(GroupingOptionsoptions)method/AggregationandsortinggroupByKey(intnumPartitions)method/AggregationandsortinggroupByKeyoperation
about/ConceptsGROUPoperator
about/AggregationGrunt
about/Grunt–thePiginteractiveshellshcommand/Grunt–thePiginteractiveshellhelpcommand/Grunt–thePiginteractiveshell
GuavalibraryURL/TheTopNpattern
![Page 490: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/490.jpg)
HHadoop
versioning/Anoteonversioningbackground/ThebackgroundofHadoopcomponents/ComponentsofHadoopdualapproach/Adualapproachabout/Gettingstartedusing/GettingHadoopupandrunningEMR,using/HowtouseEMRAWScredentials/AWScredentialsdataprocessing/DataprocessingwithHadooppractices/HadoopandDevOpspracticesalternativedistributions/Alternativedistributionscomputationalframeworks/Othercomputationalframeworksinterestingprojects/Otherinterestingprojectsprogrammingabstractions/OtherprogrammingabstractionsAWSresources/AWSresourcessourcesofinformation/Sourcesofinformation
Hadoop-providedInputFormat,MapReducejobabout/Hadoop-providedInputFormatFileInputFormat/Hadoop-providedInputFormatSequenceFileInputFormat/Hadoop-providedInputFormatTextInputFormat/Hadoop-providedInputFormatKeyValueTextInputFormat/Hadoop-providedInputFormat
Hadoop-providedMapperandReducerimplementations,JavaAPItoMapReduceabout/Hadoop-providedmapperandreducerimplementationsmappers/Hadoop-providedmapperandreducerimplementationsreducers/Hadoop-providedmapperandreducerimplementations
Hadoop-providedOutputFormat,MapReducejobabout/Hadoop-providedOutputFormatFileOutputFormat/Hadoop-providedOutputFormatNullOutputFormat/Hadoop-providedOutputFormatSequenceFileOutputFormat/Hadoop-providedOutputFormatTextOutputFormat/Hadoop-providedOutputFormat
Hadoop-providedRecordReader,MapReducejobabout/Hadoop-providedRecordReaderLineRecordReader/Hadoop-providedRecordReaderSequenceFileRecordReader/Hadoop-providedRecordReader
Hadoop2about/Hadoop2–what’sthebigdeal?storage/StorageinHadoop2computation/ComputationinHadoop2diagrammaticrepresentation,architecture/ComputationinHadoop2
![Page 491: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/491.jpg)
referencelink/Gettingstartedoperations/OperationsintheHadoop2world
Hadoop2NameNodeHAabout/Hadoop2NameNodeHAenabling/Hadoop2NameNodeHAkeeping,insync/KeepingtheHANameNodesinsync
HadoopDistributedFileSystem(HDFS)/NameNodeandDataNodeHadoopdistributions
about/DistributionsofApacheHadoopHortonworks/DistributionsofApacheHadoopCloudera/DistributionsofApacheHadoopMapR/DistributionsofApacheHadoopreferencelink/DistributionsofApacheHadoop
Hadoopfilesystemsabout/Hadoopfilesystemsreferencelink/HadoopfilesystemsHadoopinterfaces/Hadoopinterfaces
Hadoopinterfacesabout/HadoopinterfacesJavaFileSystemAPI/JavaFileSystemAPILibhdfs/LibhdfsApacheThrift/Thrift
Hadoopoperationsabout/I’madeveloper–Idon’tcareaboutoperations!
Hadoopsecurityfuture/ThefutureofHadoopsecurity
Hadoopsecuritymodelevolution/EvolutionoftheHadoopsecuritymodeladditionalsecurityfeatures/Beyondbasicauthorization
Hadoopstreamingabout/Hadoopstreamingwordcount,streaminginPython/StreamingwordcountinPythondifferencesinjobs/Differencesinjobswhenusingstreamingimportanceofwords,determining/Findingimportantwordsintext
HadoopUIURL/Othertoolstohelpabout/Othertoolstohelp
HadoopUserGroup(HUG)/HUGshashtagRegExp/Trendingtopicshashtags
about/SentimentofhashtagsHBase
about/HBaseURL/HBase
![Page 492: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/492.jpg)
HCatalogabout/IntroducingHCatalogusing/UsingHCatalog
HCatCLItoolabout/UsingHCatalog
hcatutilityabout/UsingHCatalog
HDFSabout/ComponentsofHadoop,Storage,SamzaandHDFScharacteristics/Storagearchitecture/TheinnerworkingsofHDFSNameNode/TheinnerworkingsofHDFSDataNodes/TheinnerworkingsofHDFSclusterstartup/Clusterstartupblockreplication/Blockreplication
HDFSandMapReducemerits/Bettertogether
HDFSfilesystemcommand-lineaccess/Command-lineaccesstotheHDFSfilesystemexploring/ExploringtheHDFSfilesystem
HDFSsnapshotsabout/HDFSsnapshots
HelloSamzaabout/HelloSamza!URL/HelloSamza!
high-availability(HA)about/StorageinHadoop2
HighPerformanceComputing(HPC)/ComputationinHadoop2Hive
about/Hive-on-tezURL/Hive-on-tezoverview/OverviewofHivedatatypes/DatatypesDDLstatements/DDLstatementsfileformats/Fileformatsandstoragestorage/Fileformatsandstoragequeries/Queriesscripts,writing/Writingscriptsworking,withAmazonWebServices/HiveandAmazonWebServicesusing,withS3/HiveandS3using,withElasticMapReduce/HiveonElasticMapReduceURL,forsourcecodeofJDBCclient/JDBCURL,forsourcecodeofThriftclient/Thrift
Hive-JSON-Serde
![Page 493: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/493.jpg)
URL/JSONhive-jsonmodule
URL/JSONabout/JSON
Hive-on-tezabout/Hive-on-tez
Hive0.13about/Hive-on-tez
Hivearchitectureabout/Hivearchitecture
HiveQLabout/WhySQLonHadoop,Queriesextending/ExtendingHiveQL
HiveServer2about/HivearchitectureURL/Hivearchitecture
Hivetablesabout/ThenatureofHivetablesstructuring,fromworkloads/StructuringHivetablesforgivenworkloads
Hortonwork’sHDPURL/SparkonYARN
HortonworksURL/DistributionsofApacheHadoop
HortonworksDataPlatform(HDP)about/Alternativedistributions,HortonworksDataPlatformURL/HortonworksDataPlatform
Hueabout/HueURL/Hue
HUGsabout/HUGsreferencelink/HUGs
![Page 494: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/494.jpg)
IIAMconsole
URL/HiveandS3IBMInfosphereBigInsights
about/Andtherest…IdentityandAccessManagement(IAM)/AWScredentialsImpala
about/Impalareferences/Impala,Co-existingwithHivearchitecture/ThearchitectureofImpalaco-existing,withHive/Co-existingwithHive
in-syncreplicas(ISR)about/GettingTwitterdataintoKafka
indicesattribute,entityabout/Tweetmetadata
input/output,MapReducejobabout/Input/Output
InputFormat,MapReducejobabout/InputFormatandRecordReader
![Page 495: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/495.jpg)
JJava
WordCount/WordCountinJavaJavaAPI
about/JavaAPIandScalaAPI,differences/JavaAPI
JavaAPItoMapReduceabout/JavaAPItoMapReduceMapperclass/TheMapperclassReducerclass/TheReducerclassDriverclass/TheDriverclasscombinerclass/Combinerpartitioning/PartitioningHadoop-providedMapperandReducerimplementations/Hadoop-providedmapperandreducerimplementationsreferencedata,sharing/Sharingreferencedata
JavaFileSystemAPIabout/JavaFileSystemAPI
JDBCabout/JDBC
JobTrackermonitoring,MapReducejobabout/OngoingJobTrackermonitoring
joinnodeabout/Performingmultipleactionsinparallel
JOINoperatorabout/Join
/QueriesJSON
about/JSONJSONSimple
URL/BuildingatweetparsingjobJVMconsiderations,clustertuning
about/JVMconsiderationssmallfilesproblem/Thesmallfilesproblem
![Page 496: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/496.jpg)
Kkite-morphlines-avrocommand/Morphlinecommandskite-morphlines-core-stdiocommand/Morphlinecommandskite-morphlines-core-stdlibcommand/Morphlinecommandskite-morphlines-hadoop-corecommand/Morphlinecommandskite-morphlines-hadoop-parquet-avrocommand/Morphlinecommandskite-morphlines-hadoop-rcfilecommand/Morphlinecommandskite-morphlines-hadoop-sequencefilecommand/Morphlinecommandskite-morphlines-jsoncommand/MorphlinecommandsKiteData
about/KiteDataDatacore/DataCoreDatacore/DataCoreDataHCatalog/DataHCatalogDataHive/DataHiveDataMapReduce/DataMapReduceDataSpark/DataSparkDataCrunch/DataCrunch
Kiteexamplesreferencelink/KiteData
KiteJARsreferencelink/KiteData
KiteMorphlinesabout/KiteMorphlinesconcepts/ConceptsRecordabstractions/Conceptscommands/Morphlinecommands
KiteSDKURL/KiteData
KVMreferencelink/ClouderaQuickStartVM
![Page 497: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/497.jpg)
LLambdasyntax
URL/PythonAPILibhdfs
about/LibhdfsLinkedIngroups
about/LinkedIngroupsURL/LinkedIngroups
Log4jabout/Logginglevels
logfilesaccessingto/Accesstologfiles
logginglevelsabout/Logginglevels
![Page 498: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/498.jpg)
MMachineLearning(ML)
about/MLlibmacros
about/MacrosMahout
about/MahoutURL/Mahout
mapoptimization,clustertuningconsiderations/Mapandreduceoptimizations
Mapperclass,JavaAPItoMapReduceabout/TheMapperclass
mapperexecution,MapReducejobabout/Mapperexecution
mapperinput,MapReducejobabout/Mapperinput
mapperoutput,MapReducejobabout/Mapperoutputandreducerinput
mappers,MapperandReducerimplementationsInverseMapper/Hadoop-providedmapperandreducerimplementationsTokenCounterMapper/Hadoop-providedmapperandreducerimplementationsIdentityMapper/Hadoop-providedmapperandreducerimplementations
MapRURL/DistributionsofApacheHadoop,MapRabout/MapR
MapReducereferencelink/ThebackgroundofHadoop,MapReduceabout/MapReduceMapphase/MapReduce
MapReduceAPIabout/ComponentsofHadoop,Computation
MapReducedriversourcecodereferencelink/Morphlinecommands
MapReducejobabout/WalkingthrougharunofaMapReducejobstartup/Startupinput,splitting/Splittingtheinputtaskassignment/Taskassignmenttaskstartup/TaskstartupJobTrackermonitoring/OngoingJobTrackermonitoringmapperinput/Mapperinputmapperexecution/Mapperexecutionmapperoutput/Mapperoutputandreducerinput
![Page 499: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/499.jpg)
reducerinput/Reducerinputreducerexecution/Reducerexecutionreduceroutput/Reduceroutputshutdown/Shutdowninput/output/Input/OutputInputFormat/InputFormatandRecordReaderRecordReader/InputFormatandRecordReaderHadoop-providedInputFormat/Hadoop-providedInputFormatHadoop-providedRecordReader/Hadoop-providedRecordReaderOutputFormat/OutputFormatandRecordWriterRecordWriter/OutputFormatandRecordWriterHadoop-providedOutputFormat/Hadoop-providedOutputFormatsequencefiles/Sequencefiles
MapReduceprogramswriting/WritingMapReduceprograms,Gettingstartedexamples,running/RunningtheexamplesWordCountexample/WordCount,theHelloWorldofMapReducewordco-occurrences/Wordco-occurrencessocialnetworktopics/Trendingtopicsreferencelink,forHashTagCountexamplesourcecode/TrendingtopicsTopNpattern/TheTopNpatternreferencelink,forTopTenHashTagsourcecode/TheTopNpatternhashtags/Sentimentofhashtagsreferencelink,forHashTagSentimentsourcecode/Sentimentofhashtagstextcleanup,chainmapperused/Textcleanupusingchainmapperreferencelink,forHashTagSentimentChainsourcecode/Textcleanupusingchainmapper
MassivelyParallelProcessing(MPP)about/ThearchitectureofImpala
MemPipelineabout/MemPipeline
MessagePassingInterface(MPI)/ComputationinHadoop2MLLib
about/MLlibmonitoring
about/MonitoringHadoop/Hadoop–wherefailuresdon’tmatterapplication-levelmetrics/Application-levelmetrics
monitoringtoolsabout/Monitoringintegration
MoprhlineDrviersourcecodereferencelink/Morphlinecommands
Morphlinecommandskite-morphlines-core-stdio/Morphlinecommands
![Page 500: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/500.jpg)
kite-morphlines-core-stdlib/Morphlinecommandskite-morphlines-avro/Morphlinecommandskite-morphlines-json/Morphlinecommandskite-morphlines-hadoop-parquet-avro/Morphlinecommandskite-morphlines-hadoop-sequencefile/Morphlinecommandskite-morphlines-hadoop-rcfile/Morphlinecommandsreferencelink/Morphlinecommands
MRExecutionEngine/AnoverviewofPigMultipartUpload
URL/GettingdataintoEMR
![Page 501: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/501.jpg)
NNameNode
about/StorageinHadoop2/NameNodeandDataNodeNameNodeHA
about/StorageinHadoop2NameNodestartup
about/NameNodestartupNFSshare/KeepingtheHANameNodesinsyncNodeManager
about/ResourceManager,NodeManager,andApplicationManagerNodeManager(NM)
about/ThecomponentsofYARN
![Page 502: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/502.jpg)
OOozie
about/IntroducingOozieURL/IntroducingOoziefeatures/IntroducingOozieactionnodes/IntroducingOozieHDFSfilepermissions/AnoteonHDFSfilepermissionsdevelopment,makingeasier/Makingdevelopmentalittleeasierdata,extracting/ExtractingdataandingestingintoHivedata,ingestingintoHive/ExtractingdataandingestingintoHiveworkflowdirectorystructure/AnoteonworkflowdirectorystructureHCatalog/IntroducingHCatalogsharelib/TheOoziesharelibHCatalogandpartitionedtables/HCatalogandpartitionedtablesusing/Pullingitalltogether
Oozietriggers/OtherOozietriggersOozieworkflow
about/IntroducingOozie/IntroducingOozieoperations,Hadoop2
about/OperationsintheHadoop2worldopinionlexicon
URL/SentimentofhashtagsOptimizedRowColumnarfileformat(ORC)
about/ORCreferencelink/ORC
ORCURL/Columnarstores
org.apache.zookeeper.ZooKeeperclassabout/JavaAPI
OutputFormat,MapReducejobabout/OutputFormatandRecordWriter
![Page 503: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/503.jpg)
PparallelDooperation
about/ConceptsPARALLELoperator
about/AggregationParquet
referencelink/Parquetabout/ParquetURL/Columnarstores
partitioning,JavaAPItoMapReduceabout/Partitioningoptionalpartitionfunction/Theoptionalpartitionfunction
PCollection<T>interface,Crunchabout/Concepts
physicalclusterbuilding/Buildingaphysicalcluster
physicalcluster,considerationsabout/Physicallayoutrackawareness/Rackawarenessservicelayout/Servicelayoutservice,upgrading/Upgradingaservice
Pigoverview/AnoverviewofPigusecases/AnoverviewofPigabout/Gettingstarted,WhySQLonHadooprunning/RunningPigreferencelink,forsourcecodeandbinarydistributions/RunningPigGrunt/Grunt–thePiginteractiveshellElasticMapReduce/ElasticMapReducefundamentals/FundamentalsofApachePigreferencelink,forparallelfeature/FundamentalsofApachePigreferencelink,formulti-queryimplementation/FundamentalsofApachePigprogramming/ProgrammingPigdatatypes/Pigdatatypesfunctions/Pigfunctionsdata,workingwith/Workingwithdata
Piggybankabout/Piggybank
PigLatin/AnoverviewofPigPigUDFs
extending/ExtendingPig(UDFs)contributedUDFs/ContributedUDFs
pipelinesimplementation,ApacheCrunch
![Page 504: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/504.jpg)
about/PipelinesimplementationandexecutionSparkPipeline/SparkPipelineMemPipeline/MemPipeline
positive_wordsoperatorabout/Join
pre-requisitesabout/Prerequisites
PredictiveModelMarkupLanguage(PMML)/Cascadingprocessingmodels,YARN
ClouderaKitten/ThinkinginlayersApacheTwill/Thinkinginlayers
programmaticinterfacesabout/ProgrammaticinterfacesJDBC/JDBCThrift/Thrift
ProjectRhinoURL/ThefutureofHadoopsecurity
PTable<Key,Value>interface,Crunchabout/Concepts
Pythonused,forprogrammaticaccess/ProgrammaticaccesswithPython
PythonAPIabout/PythonAPI
![Page 505: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/505.jpg)
QQJMmechanism
about/KeepingtheHANameNodesinsyncqueries,Hive/Queries
![Page 506: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/506.jpg)
RRDDs
about/Clustercomputingwithworkingsets,ResilientDistributedDatasets(RDDs)
RDDs,operationsmap/Actionsfilter/Actionsreduce/Actionscollect/Actionsforeach/ActionsgroupByKey/ActionssortByKey/Actions
Recordabstractionsimplementing/Concepts
RecordReader,MapReducejobabout/InputFormatandRecordReader
RecordWriter,MapReducejobabout/OutputFormatandRecordWriter
Reducefunctionabout/MapReduce
reduceoptimization,clustertuningconsiderations/Mapandreduceoptimizations
Reducerclass,JavaAPItoMapReduceabout/TheReducerclass
reducerexecution,MapReducejobabout/Reducerexecution
reducerinput,MapReducejobabout/Reducerinput
reduceroutput,MapReducejobabout/Reduceroutput
reducers,MapperandReducerimplementationsIntSumReducer/Hadoop-providedmapperandreducerimplementationsLongSumReducer/Hadoop-providedmapperandreducerimplementationsIdentityReducer/Hadoop-providedmapperandreducerimplementations
referencedata,JavaAPItoMapReducesharing/Sharingreferencedata
REGISTERoperatorabout/ExtendingPig(UDFs)
requiredservices,AWSSimpleStorageService(S3)/SigningupforthenecessaryservicesElasticMapReduce/SigningupforthenecessaryservicesElasticComputeCloud(EC2)/Signingupforthenecessaryservices
ResourceManager
![Page 507: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/507.jpg)
about/ResourceManager,NodeManager,andApplicationManagerapplications/ApplicationsNodesview/NodesSchedulerwindow/SchedulerMapReduce/MapReduceMapReducev1/MapReducev1MapReducev2(YARN)/MapReducev2(YARN)JobHistoryServer/JobHistoryServer
resourcessharing/Sharingresources
RoleBasedAccessControl(RBAC)/BeyondbasicauthorizationRowColumnarFile(RCFile)
about/RCFilereferencelink/RCFile
![Page 508: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/508.jpg)
SS3
Hive,usingwith/HiveandS3s3distcp
URL/GettingdataintoEMRs3n/HadoopfilesystemsSamza
about/ApacheSamzaURL/ApacheSamza,StreamprocessingwithSamzaYARN-independentframeworks/YARN-independentframeworksused,forstreamprocessing/StreamprocessingwithSamzaworking/HowSamzaworksarchitecture/Samzahigh-levelarchitectureApacheKafka/Samza’sbestfriend–ApacheKafkaintegrating,withYARN/YARNintegrationindependentmodel/AnindependentmodelHelloSamza/HelloSamza!tweetparsingjob,building/Buildingatweetparsingjobconfigurationfile/TheconfigurationfileURL,forconfigurationoptions/TheconfigurationfileTwitterdata,gettingintoApacheKafka/GettingTwitterdataintoKafkaHDFS/SamzaandHDFSwindowfunction,adding/Windowingfunctionsmultijobworkflows/Multijobworkflowstweetsentimentanalysis,performing/Tweetsentimentanalysistasksprocessing/StatefultasksandSparkStreaming,comparing/ComparingSamzaandSparkStreaming
Samza,layersstreaming/Samzahigh-levelarchitectureexecution/Samzahigh-levelarchitectureprocessing/Samzahigh-levelarchitecture
Samzajobexecuting/RunningaSamzajob
sbtURL/GettingstartedwithSpark
ScalaandJavasourcecode,examplesURL/Buildingandrunningtheexamples
ScalaAPIabout/ScalaAPI
scalardatatypesint/Pigdatatypeslong/Pigdatatypesfloat/Pigdatatypes
![Page 509: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/509.jpg)
double/Pigdatatypeschararray/Pigdatatypesbytearray/Pigdatatypesboolean/Pigdatatypesdatetime/Pigdatatypesbiginteger/Pigdatatypesbigdecimal/Pigdatatypes
ScalasourcecodeURL/Dataprocessingonstreams
SecondaryNameNodeabout/SecondaryNameNodenottotherescuedemerits/SecondaryNameNodenottotherescue
securedclusterusing,consequences/Consequencesofusingasecuredcluster
securityabout/Security
sentimentanalysisabout/Sentimentofhashtags
SequenceFileabout/General-purposefileformats
SequenceFileclass,MapReducejobabout/Sequencefiles
sequencefiles,MapReducejobabout/Sequencefilesadvantages/Sequencefiles
SerDeclasses,HiveMetadataTypedColumnsetSerDe/FileformatsandstorageThriftSerDe/FileformatsandstorageDynamicSerDe/Fileformatsandstorage
serializationabout/SerializationandContainers
sharelib,Oozieabout/TheOoziesharelib
SimpleDBabout/SimpleDBandDynamoDB
SimpleStorageService(S3),AWSabout/SimpleStorageService(S3)URL/SimpleStorageService(S3)
sourcesofinformation,Hadoopabout/Sourcesofinformationsourcecode/Sourcecodemailinglists/Mailinglistsandforumsforums/MailinglistsandforumsLinkedIngroups/LinkedIngroups
![Page 510: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/510.jpg)
HUGs/HUGsconferences/Conferences
Sparkabout/ApacheSparkURL/ApacheSpark
SparkContextobject/ScalaAPISparkPipeline
about/SparkPipelineSparkSQL
about/SparkSQLdataanalysiswith/DataanalysiswithSparkSQL
SparkStreamingURL/SparkStreamingabout/SparkStreamingandSamza,comparing/ComparingSamzaandSparkStreaming
specializedjoinreferencelink/Join
speedofthoughtanalysis/AdifferentphilosophySQL
ondatastreams/SQLondatastreamsondatastreams,URL/SQLondatastreams
SQL-on-Hadoopneedfor/WhySQLonHadoopsolutions/OtherSQL-on-Hadoopsolutions
Sqoopabout/SqoopURL/Sqoop
Sqoop1about/Sqoop
Sqoop2about/Sqoop
standaloneapplications,ApacheSparkwriting/Writingandrunningstandaloneapplicationsrunning/Writingandrunningstandaloneapplications
statementsabout/FundamentalsofApachePig
Stingerinitiativeabout/Stingerinitiative
storageabout/Storage
storage,Hadoop2about/StorageinHadoop2
storage,Hiveabout/Fileformatsandstorage
![Page 511: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/511.jpg)
columnarstores/ColumnarstoresStorm
URL/HowSamzaworksabout/HowSamzaworks
stream.pyreferencelink/ProgrammaticaccesswithPython
streamprocessingwithSamza/StreamprocessingwithSamza
streamsdata,processingon/Dataprocessingonstreams
systemsmanagementtoolsClouderaManager,integratingwith/ClouderaManagerandothermanagementtools
![Page 512: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/512.jpg)
Ttablepartitioning
about/Partitioningatabledata,overwriting/Overwritingandupdatingdatadata,updating/Overwritingandupdatingdatabucketing/Bucketingandsortingsorting/Bucketingandsortingdata,sampling/Samplingdata
TajoURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond
tasksprocessing,Samzaabout/Statefultasks
termfrequencyabout/Calculatetermfrequencycalculating,withTF-IDF/Calculatetermfrequency
textattribute,entityabout/Tweetmetadata
Textfilesabout/General-purposefileformats
Tezabout/TezURL/Tez,Stingerinitiativereferencelink,forcanonicalWordCountexample/TezHive-on-tez/Hive-on-tez
/AnoverviewofPigTF-IDF
about/Findingimportantwordsintextdefinition/Findingimportantwordsintexttermfrequency,calculating/Calculatetermfrequencydocumentfrequency,calculating/Calculatedocumentfrequencyimplementing/Puttingitalltogether–TF-IDF
Thriftabout/Thrift
TOBAG(expression)function/Thetuple,bag,andmapfunctionsTOMAP(expression)function/Thetuple,bag,andmapfunctionstools,datalifecyclemanagement
orchestrationservices/Toolstohelpconnectors/Toolstohelpfileformats/Toolstohelp
TOP(n,column,relation)function/Thetuple,bag,andmapfunctionsTOTUPLE(expression)function/Thetuple,bag,andmapfunctionstroubleshooting
![Page 513: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/513.jpg)
about/Troubleshootingtuples
about/FundamentalsofApachePigTweet,structure
referencelink/AnatomyofaTweettweetanalysiscapability
building/Buildingatweetanalysiscapabilitytweetdata,obtaining/GettingthetweetdataOozie/IntroducingOoziederiveddata,producing/Producingderiveddata
tweetsentimentanalysisperforming/Tweetsentimentanalysisbootstrapstreams/Bootstrapstreams
Twitterused,forgeneratingdataset/DataprocessingwithHadoopURL/DataprocessingwithHadoopabout/WhyTwitter?signuppage/Twittercredentialswebform/Twittercredentials
Twitterdata,propertiesunstructured/WhyTwitter?structured/WhyTwitter?graph/WhyTwitter?geolocated/WhyTwitter?realtime/WhyTwitter?
TwitterSearchURL/Trendingtopics
Twitterstreamanalyzing/AnalyzingtheTwitterstreamprerequisites/Prerequisitesdatasetexploration/Datasetexplorationtweetmetadata/Tweetmetadatadatapreparation/Datapreparationtopnstatistics/Topnstatisticsdatetimemanipulation/Datetimemanipulationsessions/Sessionsusers’interaction,capturing/Capturinguserinteractionslinkanalysis/Linkanalysisinfluentialusers,identifying/Influentialusers
![Page 514: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/514.jpg)
Uunionoperation
about/ConceptsupdateFuncfunction/StatemanagementUserDefinedAggregateFunctions(UDAFs/ExtendingHiveQLUserDefinedFunctions(UDFs)/AnoverviewofPig,ExtendingHiveQL
about/FundamentalsofApachePigUserDefinedTableFunctions(UDTF)/ExtendingHiveQL
![Page 515: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/515.jpg)
Vversioning,Hadoop
about/AnoteonversioningVirtualBox
referencelink/ClouderaQuickStartVMVMware
referencelink/ClouderaQuickStartVM
![Page 516: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/516.jpg)
WWhir
about/WhirURL/Whir
WhotoFollowservicereferencelink/Influentialusers
windowfunctionadding/Windowingfunctions
WordCountinJava/WordCountinJava
WordCountexample,MapReduceprogramsabout/WordCount,theHelloWorldofMapReducereferencelink,forsourcecode/Wordco-occurrences
workflow-appabout/IntroducingOozie
workflow.xmlfilereferencelink/ExtractingdataandingestingintoHive
workflowsbuilding,Oozieused/Pullingitalltogether
workloadsHivetables,structuringfrom/StructuringHivetablesforgivenworkloads
wrapperclassesabout/Introducingthewrapperclasses
WritableComparableinterfaceabout/TheComparableandWritableComparableinterfaces
Writableinterfaceabout/TheWritableinterface
![Page 517: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/517.jpg)
YYARN
about/ComputationinHadoop2,YARN,YARNintherealworld–ComputationbeyondMapReducearchitecture/YARNarchitecturecomponents/ThecomponentsofYARNprocessingframeworks/Thinkinginlayersprocessingmodels/Thinkinginlayersissues,withMapReduce/TheproblemwithMapReduceTez/TezApacheSpark/ApacheSparkApacheSamza/ApacheSamzafuture/YARNtodayandbeyondpresentsituation/YARNtodayandbeyondSamza,integrating/YARNintegrationApacheSparkon/SparkonYARNexamples,runningon/RunningtheexamplesonYARNURL/RunningtheexamplesonYARN
YARNAPIabout/Thinkinginlayers
YARNapplicationanatomy/AnatomyofaYARNapplicationApplicationMaster(AM)/AnatomyofaYARNapplicationlifecycle/LifecycleofaYARNapplicationfault-tolerance/Faulttoleranceandmonitoringmonitoring/Faulttoleranceandmonitoringexecutionmodels/Executionmodels
![Page 518: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,](https://reader034.vdocuments.mx/reader034/viewer/2022042601/5f6d81b6e74f844b7d70c95b/html5/thumbnails/518.jpg)
ZZooKeeperFailoverController(ZKFC)/AutomaticNameNodefailoverZooKeeperquorum/AutomaticNameNodefailover