combine query language and data flow language for data science · apache spark is a fast, in-memory...
TRANSCRIPT
![Page 1: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/1.jpg)
1 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience
JeffZhang([email protected])May16,2017
![Page 2: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/2.jpg)
2 ©HortonworksInc.2011–2016.AllRightsReserved
WhoamI
à ASFMember,workinASFforalmost8years
à CommiRerofApacheTez,Pig&Zeppelin
à WorksinHortonworks
![Page 3: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/3.jpg)
3 ©HortonworksInc.2011–2016.AllRightsReserved
DataScience
DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.
à Describewhathappens
à Explainwhathappens
à Predictwhatwillhappen
![Page 4: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/4.jpg)
4 ©HortonworksInc.2011–2016.AllRightsReserved
DataScience
CollectData
DataMunging
DataAnalysisInsight
Product
online offline
![Page 5: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/5.jpg)
5 ©HortonworksInc.2011–2016.AllRightsReserved
DataMunging
§ CollectandTransformServerLog• UserAgentNormalizaYon• RobotDetecYon• Sessionize
§ MovedatafromDatabasetoHDFS
§ CollectandTransformSocialMediaData
![Page 6: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/6.jpg)
6 ©HortonworksInc.2011–2016.AllRightsReserved
DataMunging
BeforeDataMunging AcerDataMunging
![Page 7: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/7.jpg)
7 ©HortonworksInc.2011–2016.AllRightsReserved
DataAnalysis
à CombinedifferentsourcesofdataandapplystaYsYcsmethod,BItoolstogetinsight– WebTrafficMetrics– UserSegmentaYonAnalysis– A/BTest
![Page 8: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/8.jpg)
8 ©HortonworksInc.2011–2016.AllRightsReserved
DataMungingvsDataAnalysis
DataMunging DataAnalysisDataSource Messy
Structured/UnstructuredUnorganized
CleanStructuredOrganized
Stability Regular,Stable Ad-hoc
Tools Python,Spark,Hadoopandetc.
R,Python,SQLandetc.
Doyouhavetobefullstackbigdataengineertododatascience?
Whatifyouareadataanalystwithoutmuchprogrammingskills?
![Page 9: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/9.jpg)
9 ©HortonworksInc.2011–2016.AllRightsReserved
DataScienceInfrastructure
![Page 10: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/10.jpg)
10 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisSpark
ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.
![Page 11: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/11.jpg)
11 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisApachePig
à ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsjobsinMapReduce,ApacheTez,orApacheSpark
• Easeofprogramming
• OpYmizaYonopportuniYes
• Extensibility
![Page 12: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/12.jpg)
12 ©HortonworksInc.2011–2016.AllRightsReserved
WordCount
Load
ForEach Group ForEach Order
StoreUsingSQL?
![Page 13: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/13.jpg)
13 ©HortonworksInc.2011–2016.AllRightsReserved
Pig-La.nvsSQL
SQL Pig-La.nLanguageType QueryLanguage
• defactorstandard• unreadableforlongscript
DataFlowLanguagemorereadableforlongscripts
DataSource StructuredData Structured/UnstructuredIntegra.on IntegratedwithmostofBITools VeryfewBItoolsintegratedwith
Pig-LaYn
Conclusion• Pig-La.nforDataMunging• SQLforDataAnalysis
![Page 14: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/14.jpg)
14 ©HortonworksInc.2011–2016.AllRightsReserved
Pig-La.n+SparkSQL
SparkDataFrameTable
SparkSQL
Load Store
DataMunging
DataAnalysis
![Page 15: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/15.jpg)
15 ©HortonworksInc.2011–2016.AllRightsReserved
SparkTable(bank)
PigLaYn
SQL
![Page 16: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/16.jpg)
16 ©HortonworksInc.2011–2016.AllRightsReserved
IntegrateSparkintoPig
LogicPlan
PhysicalPlan
Execu.onPlan
Execu.onEngine
Pig-La.n
![Page 17: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/17.jpg)
17 ©HortonworksInc.2011–2016.AllRightsReserved
WheretorunPig-La.n&SparkSQL(Zeppelin)
ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.
![Page 18: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/18.jpg)
18 ©HortonworksInc.2011–2016.AllRightsReserved
JVM
ZeppelinServer
PigInterpreterGroup
Pig-LaYn SparkSQL
JVM
JVM
SparkInterpreterGroup
Scala Python R
ZeppelinArchitecture
![Page 19: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/19.jpg)
19 ©HortonworksInc.2011–2016.AllRightsReserved
Demo
![Page 20: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/20.jpg)
20 ©HortonworksInc.2011–2016.AllRightsReserved
DataScienceInfrastructure(Recap)
![Page 21: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/21.jpg)
21 ©HortonworksInc.2011–2016.AllRightsReserved
CurrentStatus&What’sNext
à Status– PIG-5080(Supportstorealiasassparktable)– ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)
à Next– IntegrateSparkMLlibinPig– UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig– IntegratePigwithotherSparkAPIs,likeR,Python
![Page 22: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/22.jpg)
22 ©HortonworksInc.2011–2016.AllRightsReserved
Q&A
![Page 23: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed2245f5e0ec842bd789929/html5/thumbnails/23.jpg)
23 ©HortonworksInc.2011–2016.AllRightsReserved
ThankYou