an introduction to hadoop and cloudera: nashville cloudera user group, 10/23/14

Download An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

Post on 01-Jul-2015

220 views

Category:

Documents

1 download

Embed Size (px)

DESCRIPTION

An introduction to the Hadoop ecosystem, and Cloudera. Presented to the Nashville Cloudera User Group on October 23, 2014

TRANSCRIPT

  • 1. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.1AnIntroducAontoHadoopandClouderaNashvilleClouderaUserGroup,10/23/14IanWrigley,Director,EducaAonalCurriculumian@cloudera.com@iwrigley201405

2. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.2PresentaAonTopicsAnIntroduc-ontoHadoopandCloudera TheMo-va-onforHadoop CoreHadoop:HDFSandMapReduce CDHandtheHadoopEcosystem DataStorage:HBase DataIntegraAon:FlumeandSqoop DataProcessing:Spark DataAnalysis:Hive,Pig,andImpala DataExploraAon:ClouderaSearch ManagingEverything:ClouderaManager Conclusion 3. TradiAonalLarge-ScaleComputaAonCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.3 Tradi-onally,computa-onhasbeenprocessor-bound RelaAvelysmallamountsofdata Lotsofcomplexprocessing Theearlysolu-on:biggercomputers Fasterprocessor,morememory Buteventhiscouldntkeepup 4. DistributedSystemsCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.4 ThebeDersolu-on:morecomputers DistributedsystemsusemulAplemachinesforasinglejobInpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldntbudgealog,wedidnttrytogrowalargerox.Weshouldntbetryingforbiggercomputers,butformoresystemsofcomputers.GraceHopperDatabase Hadoop Cluster 5. DistributedSystems:ChallengesCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.5 Challengeswithdistributedsystems Programmingcomplexity Keepingdataandprocessesinsync Finitebandwidth ParAalfailures 6. DistributedSystems:TheDataBo>leneck(1)Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.6 Tradi-onally,dataisstoredinacentralloca-on Dataiscopiedtoprocessorsatrun-me Fineforlimitedamountsofdata 7. DistributedSystems:TheDataBo>leneck(2)Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.7 Modernsystemshavemuchmoredata terabytes+aday petabytes+total Weneedanewapproach 8. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.8 Aradicalnewapproachtodistributedcompu-ng Distributedatawhenthedataisstored RuncomputaAonwherethedataisstoredHadoop 9. Hadoop:VeryHigh-LevelOverviewSlaveNodesCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.9 Dataissplitintoblockswhenloaded Eachtasktypicallyworksonasingleblock Manyruninparallel AmasterprogrammanagestasksLorem ipsum dolor sitamet, consectetur sedadipisicing elit, ado leieiusmod tempor etmaincididunt ut libore tuadolore magna alli quiout enim ad minim veniveniam, quis nostrudaexercitation ul laco essed laboris nisi ut eresaliquip ex eaco modaiconsequat. Duis honairure dolor in repre siehonerit in ame mina lovoluptate elit esse odacillum le dolore eu fugigia nulla aria tur. Enteculpa qui officia ledeaun mollit anim id est olaborum ame elita tu amagna omnibus et.Lorem ipsum dolor sitamet, consectetur sedadipisicing elit, ado leieiusmod tempor etmaincididunt ut libore tuadolore magna alli quiout enim ad minim veniveniam, quis nostrudaexercitation ul laco essed laboris nisi ut eresaliquip ex eaco modaiconsequat. Duis honairure dolor in repre siehonerit in ame mina lovoluptate elit esse odacillum le dolore eu fugigia nulla aria tur. Enteculpa qui officia ledeaun mollit anim id est olaborum ame elita tu amagna omnibus et.Master 10. CoreHadoopConceptsCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.10 Applica-onsarewriDeninhigh-levelcode NodestalktoeachotherasliDleaspossible Dataisdistributedinadvance BringthecomputaAontothedata Dataisreplicatedforincreasedavailabilityandreliability Hadoopisscalableandfault-tolerant 11. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.11Scalability Addingnodesaddscapacitypropor-onally Increasingloadresultsinagracefuldeclineinperformance NotfailureofthesystemNumberofNodesCapacity 12. FaultToleranceCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.12 Nodefailureisinevitable Whathappens? SystemconAnuestofuncAon Masterre-assignstaskstoadifferentnode DatareplicaAon=nolossofdata NodeswhichrecoverrejointheclusterautomaAcallyFailureisthedefiningdifferencebetweendistributedandlocalprogramming,soyouhavetodesigndistributedsystemswiththeexpectaAonoffailure.KenArnold(CORBAdesigner) 13. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.13PresentaAonTopicsAnIntroduc-ontoHadoopandCloudera TheMoAvaAonforHadoop CoreHadoop:HDFSandMapReduce CDHandtheHadoopEcosystem DataStorage:HBase DataIntegraAon:FlumeandSqoop DataProcessing:Spark DataAnalysis:Hive,Pig,andImpala DataExploraAon:ClouderaSearch ManagingEverything:ClouderaManager Conclusion 14. TheHadoopDistributedFileSystem(HDFS)isafilesystemwriDeninJava Sitsontopofana-vefilesystem Providesstorageformassiveamountsofdata Scalable Faulttolerant SupportsefficientprocessingwithMapReduce,Spark,andothertoolsCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.14HadoopClusterHDFSBasicConceptsHDFS 15. HowFilesareStored(1)Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.15 DatafilesaresplitintoblocksanddistributedtodatanodesBlock1Block2Block3VeryLargeDataFile 16. HowFilesareStored(2)Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.16 DatafilesaresplitintoblocksanddistributedtodatanodesBlock1Block2Block3Block1Block1Block1VeryLargeDataFile 17. HowFilesareStored(3)Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.17 Datafilesaresplitintoblocksanddistributedtodatanodes Eachblockisreplicatedonmul-plenodes(default3x)Block1Block2Block3Block1Block3Block2Block3Block1Block3Block1Block2Block2VeryLargeDataFile 18. HowFilesareStored(4)Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.18 Datafilesaresplitintoblocksanddistributedtodatanodes Eachblockisreplicatedonmul-plenodes(default3x) NameNodestoresmetadataNameNodeBlock1Block2Block3Block1Block3Block2Block3Block1Block3Block1Block2Block2Metadata:informaAonaboutfilesandblocksVeryLargeDataFile 19. 21 3NodeB1/logs/041213.log?B4,B5Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.19Example:StoringandRetrievingFiles(1)MetadataNameNode/logs/031512.log: B1,B2,B3/logs/041213.log: B4,B5B1: A,B,DB2: B,D,EB3: A,B,CB4: A,B,EB5: C,E,D/logs/031512.log1/logs/041213.log2345NodeC3 5NodeE54NodeA4234NodeD152Client 20. 21 3NodeB1/logs/041213.log?B4,B5Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.20Example:StoringandRetrievingFiles(2)MetadataNameNode/logs/031512.log: B1,B2,B3/logs/041213.log: B4,B5B1: A,B,DB2: B,D,EB3: A,B,CB4: A,B,EB5: C,E,D/logs/031512.log1/logs/041213.log2345NodeC3 5NodeE54NodeA4234NodeD152Client 21. ImportantNotesAboutHDFSCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.21 HDFSperformsbestwithamodestnumberoflargefiles Millions,ratherthanbillions,offiles Eachfiletypically100MBormore FilesinHDFSarewriteonce Filescanbereplacedbutnotchanged 22. ShuffleandSortCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.22MapReduce TheMapper EachMaptask(typically)operatesonasingleHDFSblock Maptasks(usually)runonthenodewheretheblockisstored ShuffleandSort Sortsandconsolidatesintermediatedatafromallmappers HappensamerallMaptasksarecompleteandbeforeReducetasksstart TheReducer Operatesonshuffled/sortedintermediatedata(Maptaskoutput) ProducesfinaloutputMapReduce 23. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.23PresentaAonTopicsAnIntroduc-ontoHadoopandCloudera TheMoAvaAonforHadoop CoreHadoop:HDFSandMapReduce CDHandtheHadoopEcosystem DataStorage:HBase DataIntegraAon:FlumeandSqoop DataProcessing:Spark DataAnalysis:Hive,Pig,andImpala DataExploraAon:ClouderaSearch ManagingEverything:ClouderaManager Conclusion 24. TheHadoopEcosystem(1)SqoopImpalaHivePigHBaseFlumeOozieMapReduceCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.24HadoopDistributedFileSystemHadoopEcosystemHadoopCoreComponentsCDH 25. TheHadoopEcosystem(2)HBaseFlumeOozieHadoopEcosystemCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.25SqoopImpalaHivePig CDHincludesmanyHadoopEcosystemcomponents Followingaremoredetailsonsomeofthekeycomponents 26. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.26CDH CDH(ClouderasDistribu-on,includingApacheHadoop) 100%opensource,enterprise-readydistribuAonofHadoopandrelatedprojects Themostcomplete,tested,andwidely-deployeddistribuAonofHadoop IntegratesallkeyHadoopecosystemprojects 27. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.27PresentaAonTopicsAnIntroduc-ontoHadoopandCloudera TheMoAvaAonforHadoop CoreHadoop:HDFSandMapReduce CDHandtheHadoopEcosystem DataStorage:HBase DataIntegraAon:FlumeandSqoop DataProcessing:Spark DataAnalysis:Hive,Pig,andImpala DataExploraAon:ClouderaSearch ManagingEverything:ClouderaManager Conclusion 28. HBase:TheHadoopDatabaseHDFSCopyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.28 HBase:databaselayeredontopofHDFS ProvidesinteracAveaccesstodata Storesmassiveamountsofdata Petabytes+ Highthroughput Thousandsofwritespersecond(pernode) Handlessparsedatawell Nowastedspaceforarowwithemptycolumns Limitedaccessmodel OpAmizedforlookupofarowbykeyratherthanfullqueries NotransacAons:singlerowoperaAonsonly 29. Copyright2010-2014Cloudera.Allrightsreserved.Nottobereproducedwithoutpriorwri>enconsent.29HBasevsRDBMSRDBMS HBaseTransactions Yes Single row onlyQuery language SQL get/p

Recommended

View more >