introduction to impala

Download Introduction to Impala

Post on 26-Jan-2015

107 views

Category:

Engineering

3 download

Embed Size (px)

DESCRIPTION

Impala presentation by Mark Grover at GlueCon in Broomfield, CO on May 22, 2014

TRANSCRIPT

  • 1. Impala: A Modern, Open-Source SQL Engine for Hadoop MarkGrover So+wareEngineer,Cloudera May22nd,2014 Twi

2. WhatisHadoop? WhatisImpala? Use-casesforImpala ArchitectureofImpala Impalacomparisonsandperformance Demo(FmepermiRng) Agenda 3. IntrotoHadoop 4. WhatisApacheHadoop? Has the Flexibility to Store and Mine Any Type of Data Ask questions across structured and unstructured data that were previously impossible to ask or solve Not bound by a single schema Excels at Processing Complex Data Scale-out architecture divides workloads across multiple nodes Flexible file system eliminates ETL bottlenecks Scales Economically Can be deployed on commodity hardware Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is Distributed Fault tolerant Scalable CORE HADOOP SYSTEM COMPONENTS 5. MapReduce-thegoodandthebad TheGood VersaFle Flexible Scalable TheBad Highlatency Batchoriented Notallparadigmstvery well Onlyfordevelopers 6. MRishardandonlyfordevelopers Higherlevelpla]ormsforconverFngdeclaraFve syntaxtoMapReduce SQLHive workowlanguagePig BuildontopofMapReduce(althoughtheyarebeing mademorepluggablenow) ButjobsaresFllasslowasMapReduce WhatareHiveandPig? 7. Impala 8. General-purposeSQLengine Real-FmequeriesinApacheHadoop BetaversionreleasedsinceOctober2012 Generalavailability(v1.0)releaseoutsinceApril2013 OpensourceunderApachelicense Latestrelease(v1.3.1)releasedonMay1st,2014 WhatisImpala? 9. ImpalaOverview:Goals General-purposeSQLqueryengine: WorksforbothforanalyFcalandtransacFonal/single-row workloads Supportsqueriesthattakefrommillisecondstohours RunsdirectlywithinHadoop: readswidelyusedHadoopleformats talkstowidelyusedHadoopstoragemanagers runsonsamenodesthatrunHadoopprocesses Highperformance: C++insteadofJava runFmecodegeneraFon completelynewexecuFonengineNoMapReduce 10. UserViewofImpala:Overview Runsasadistributedserviceincluster:oneImpaladaemonon eachnodewithdata Highlyavailable:nosinglepointoffailure 11. UserViewofImpala:Overview ThereisnoImpalaformat! Supportedleformats: uncompressed/lzo-compressedtextles sequencelesandRCFilewithsnappy/gzipcompression Avrodatales Parquetcolumnarformat(moreonthatlater) HBase 12. UserViewofImpala:SQL SQLsupport: essenFallySQL-92,minuscorrelatedsubqueries onlyequi-joins;nonon-equijoins,nocrossproducts OrderByrequiresLimit (Limited)DDLsupport SQL-styleauthorizaFonviaApacheSentry(incubaFng) UDFsandUDAFsaresupported 13. UserViewofImpala:SQL FuncFonallimitaFons: Noleformats,SerDes nobeyondSQL(buckets,samples,transforms,arrays, structs,maps,xpath,json) BroadcastjoinsandparFFonedhashjoinssupported SmallertablehastotinaggregatememoryofallexecuFng nodes 14. UseCasesofImpala 15. ImpalaUseCases Interactive BI/analytics on more data Asking new questions exploration, ML Data processing with tight SLAs Query-able archive w/full fidelity Cost-effective, ad hoc query environment that offloads/replaces the data warehouse for: 16. GlobalFinancialServicesCompany Saved 90% on incremental EDW spend & improved performance by 5x Offload data warehouse for query-able archive Store decades of data cost-effectively Process & analyze on the same system Improved capabilities through interactive query on more data 17. DigitalMediaCompany 20x performance improvement for exploration & data discovery Easily identify new data sets for modeling Interact with raw data directly to test hypotheses Avoid expensive DW schema changes Accelerate time to answer 18. ArchitectureofImpala 19. ImpalaArchitecture Threebinaries:impalad,statestored,catalogd Impaladaemon(impalad)Ninstances handlesclientrequestsandallinternalrequestsrelatedto queryexecuFon Statestoredaemon(statestored)1instance ProvidesnameserviceandmetadatadistribuFon Catalogdaemon(catalogd)1instance Relaysmetadatachangestoallimpalads 20. ImpalaArchitecture:QueryExecuFon Requestarrivesviaodbc/jdbc QueryPlanner QueryExecutor HDFSDN HBase SQLApp ODBC QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase QueryPlanner QueryExecutor HDFSDN HBase SQL request QueryCoordinator QueryCoordinator HiveMeta store HDFSNN Statestore + Catalogd 21. ImpalaArchitecture:QueryExecuFon PlannerturnsrequestintocollecFonsofplanfragments CoordinatoriniFatesexecuFononremoteimpalad's QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase SQLApp ODBC QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase HiveMeta store HDFSNN Statestore + Catalogd 22. ImpalaArchitecture:QueryExecuFon Intermediateresultsarestreamedbetweenimpalad'sQuery resultsarestreamedbacktoclient QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase SQLApp ODBC QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase QueryPlanner QueryCoordinator QueryExecutor HDFSDN HBase query results HiveMeta store HDFSNN Statestore + Catalogd 23. QueryPlanning:Overview 2-phaseplanningprocess: single-nodeplan:le+-deeptreeofplanoperators planparFFoning:parFFonsingle-nodeplantomaximizescanlocality, minimizedatamovement ParallelizaFonofoperators: Allqueryoperatorsarefullydistributed 24. Query Planning:Single-NodePlan Planoperators:Scan,HashJoin,HashAggregaFon,Union, TopN,Exchange 25. Single-NodePlan:ExampleQuery SELECTt1.cusFd, SUM(t2.revenue)ASrevenue FROMLargeHdfsTablet1 JOINLargeHdfsTablet2ON(t1.id1=t2.id) JOINSmallHbaseTablet3ON(t1.id2=t3.id) WHEREt3.category='Online' GROUPBYt1.cusFd ORDERBYrevenueDESCLIMIT10; 26. Query Planning:Single-NodePlan HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg Single-nodeplanforexample: 27. QueryPlanning:DistributedPlans HashJoinScan: t1 Scan: t3 Scan: t2 HashJoin TopN Pre-Agg MergeAgg TopN Broadcast Broadcast hash t2.idhash t1.id1 hash t1.custid at HDFS DN at HBase RS at coordinator 28. MetadataHandling Impalametadata: Hivesmetastore:logicalmetadata(tabledeniFons, columns,CREATETABLEparameters) HDFSNamenode:directorycontentsandblockreplica locaFons HDFSDataNode:blockreplicasvolumeIDs 29. ImpalaExecuFonEngine Wri