ë Ü ¼ spark 1 o à - pic.huodongjia.com€¦ · spark 1 2 e • [´ w 1 _ Ä r x Û • §uc...
TRANSCRIPT
Spark
(CISL)
Spark
• • UCBerkeley,AMPLab MateiZaharia • 2010 BSD • 2014 Apache • 863
Spark
• ResilientDistributedDatasets(RDDs)• • /
• RDDs • RDDs map,filter,join,reduce,groupBy,…• RDD lineage • RDDs
Spark
•
vallogs=spark.textFile(“hdfs://…”)
valerrMsgs=lines.map(_.split(“,”)) .filter(_(0)==“ERROR”) .map(_(1))
errMsgs.cache()
errMsgs.filter(_contains“foo”).count()
//header:LEVEL,MSG//INFO,msg1//ERROR,msg2//…//…
Driver
Executor
Executor
RDD
Transforms
tasks
tasks
logs1
logs2
msgs1
msgs2
results
results
Spark
• SparkSQL: • SparkStreaming: • MLlib: • GraphX:
Spark SQL • DataFrames valerrMsgs=sqlCtx.read.format(“csv”)
.load(“hdfs://…”) .where(“LEVEL=‘error’”) .select(“MSG”)
errMsgs.cache()
errMsgs.where(“MSGlike‘foo’”).count()
//header:LEVEL,MSG//INFO,msg1//ERROR,msg2//…//…
• Hadoop Yarn Mesos standalone • HDFS, Cassandra, Azure, S3
Spark
Spark
• AzureDataLake• spark • HDInsight Spark
AzureDataLake
analyWcsservice Clusters(HDInsight)
unstructured semi-structured structured
Store
Analy,cs
YARN
WebHDFS
C#
Spark
• AzureDataLake• spark • HDInsight Spark
• spark • .NET
Bing
• (“FastSML”)• TB•
• • • (OperaWonalIntelligence)• …
Bing --FastSML
Click
UI-Layout
…
Kaea
C C C C
U U U U
… … … …
MergedEvent
RawEvents Databus EventMergePipeline
10-minuteApp-TimeWindow10
Kaea
Databus1 2 3 4
FastSML+Spark?
ApacheStorm(SCP.Net)+Kaea+Microsoh’s
• Spark ?• FastSML ?
• C#
Spark + .NET
.NET • C# Spark • .NET • Spark .NET
Spark + .NET = Mobius!!!
Mobius
• 2015 8 • CISL ASG(Bing) 5 • 2015 11 • MIT
• V1.5.2 V.1.6.0• 4 V1.6.1
Mobius@github • ApacheSparkWiki hmp://github.com/Microsoh/Mobius• 758 ,2 • –131 ,4K
Mobius
C# Spark • • • • •
Word Count
Scala
C#
valtextFile=spark.textFile(“hdfs://…”)valcounts=textFile.flatMap(line=>line.split(“”))
.map(word=>(word,1)) .reduceByKey(_+_)
counts.saveAsTextFile(“hdfs://…”)
vartextFile=sparkContext.textFile(@“hdfs://…”)varcounts=textFile.FlatMap(line=>line.split(“”))
.Map(word=>newKeyValuePair<string,int>(word,1)) .ReduceByKey((x,y)=>x+y) .Map(wc=>string.Format(“{0},{1}”,wc.Key,wc.Value));
counts.saveAsTextFile(@“hdfs://…”);
Mobius
• spark • JVM–CLR(.NETVM)
• Spark JVM• C# CLR
• PySpark SparkR
C#Worker
CLR
IPCSockets
C#Worker
CLR
IPCSockets
C#Worker
CLR
IPCSockets
C#Driver
CLR
IPCSockets
SparkExecutor
SparkExecutor
SparkExecutor
JavaDriver
JVM
JVM
JVM
JVM
Workers
Driver
Method
Result
Method
Method
Method
Result
Result
Result
CSharpRunner
Calledbysparkclr-submit.cmd
JVM
Java/Scalacomponent
C#component
CSharpBackendLaunchesNemyservercreaWngproxyforJVMcalls1
Driver(usercode)LaunchesC#
sub-process
2SqlContext
Init
3
InvokesJVM-methodtocreatecontext
4
SqlContext(Spark)
create 5
createDF
6
InvokesJVM-methodtocreateDF
7
DataFrame(Spark)
Usejsc&createDFinJVM8
DataFrame
9
C#DFhasreferencetoDFinJVM
SqlContexthasreferencetoSCinJVM
12
InvokesmethodonDF
Driver-side Interop - DataFrame
C#Worker
Launchexecutableassub-process
Serializedata&user-implementedC#lambdaandsendthroughsocket
Serializeprocesseddataandsendthroughsocket
SparkExecutorSparkcallsCompute()
Scalacomponent
C#component
Executor-side Interop - RDD
• Driver • SparkR • Nemyserver JVM
• Worker • PySpark • /
Mobius
• SparkExecutor CSharpWorker•
• DataFrame CSharpWorker• SparkCore codegen
• C# / Java
Executor
C#Worker
JavaExecuto
r
C#Worker
JavaExecuto
r
TransformaWon#1 TransformaWon#2
SER/DE SER/DESER SER
Mobius
• Mobius • • •
• Mobius • •
Mobiu
• Spark C# • • •
Mobius
• Roslyn • .NET •
Mobius
• Roslyn • RoslynC#Interpreter
AssemblyService
MobiusSparkCtx
Master
Worker WorkerWorker
SparkCluster
Mobius as a service)
• Spark • • Pay-by-Job
RESTmessageprotocol
Jupyter Zeppelin Shell …
HostedService
MobiusServiceMobiusShell
Master
Worker WorkerWorker
SparkCluster
Mobius
• API
Rou,ngLayer
MobiusServiceEndpoint
MobiusServiceEndpoint
…
Mobius/SparkShell
InterpreterAPI
InterpreterAPI
HiveShell
Mobius
• (ElasWcity) (FaultTolerance)• MobiusServiceendpoints
Rou,ngLayer
MobiusServiceEndpoint
MobiusServiceEndpoint
…Interpreter
APIInterpreter
API
PersistentStore
• Mobius • Mobius Spark • @github.com/Microsoh/Mobius
• AnycontribuWoniswelcome!• :Jupyter/Zeppelin/…• :Puppet/Chef/…• …
(CISL)
• Mobius• Yarn++• ApacheREEF• Rayon• TieredStorage• …
• Mobius • Mobius Spark • @github.com/Microsoh/Mobius
• AnycontribuWoniswelcome!• :Jupyter/Zeppelin/…• :Puppet/Chef/…• …
• CISL • [email protected]
Interactive Log Analysis • MassiveCosmosloganalysis:severalPBsperday• RapiditeraWvedrill-downsforDRIstodiagnoseissues
Customer(orotherDRIs)
AutoPilotWatchdogAlerts
DRI(DesignatedResponsibleIndividual)
PerfCountersExamine1hour,2hours…14days
Architects/Developers/OtherDRIs
RDP—CosmosMachinesOpenindividualconnecWonsfortroubleshooWngVendors/SecondaryDRIs
Eavesdropanddocumentincident
ScopeStudio
AlerWng Triage
35
Spark?
C# API for Spark
ApacheSpark
C#API
Scala/JavaAPI
SparkR PySpark
SparkAppsinC#
Mobius Service Differentiators
• Bemerfaulttolerance--sessionpersistenceandreplay
Rou,ngLayer
MobiusServiceEndpoint
MobiusServiceEndpoint
…PersistentStore
Mobius Service Differentiators
• ElasWcityandFaultTolerance• Auto-scale#ofMobiusServiceendpoints
CSharpRunner
Calledbysparkclr-submit.cmd
JVM
Java/Scalacomponent
C#component
CSharpBackendLaunchesNemyservercreaWngproxyforJVMcalls1
Driver(usercode)LaunchesC#
sub-process
2SqlContext
Init
3
InvokesJVM-methodtocreatecontext
4
SqlContext(Spark)
create 5
createDF
6
InvokesJVM-methodtocreateDF
7
DataFrame(Spark)
Usejsc&createDFinJVM8
10
OperaWonDataFrame
9
C#DFhasreferencetoDFinJVM
11
InvokesJVM-method
SqlContexthasreferencetoSCinJVM
12
InvokesmethodonDF
Driver-side Interop - DataFrame
C#Worker
Launchexecutableassub-process
Serializedata&user-implementedC#lambdaandsendthroughsocket
Serializeprocesseddataandsendthroughsocket
CSharpRDDSparkcallsCompute()
Scalacomponent
C#component
Executor-side Interop - RDD
CSharpRDD is used only when customized C# code is used
in transformation
Mobius Performance Considerations 1. OneCSharpWorkerprocessforeachJVMexecutorprocess
• PySparkforksaPythonprocessforeachtaskthreadinJVMexecutorprocess• NewC#opWonofonethreadforeachtaskthreadinJVMexecutorprocess
2. C#operaWonsarepipelinedwhenpossible• Map&FilterRDDoperaWonsinC#needdatatobepassedfromJVMtoC#,incurringthecostofserializaWonanddeserializaWon
• C#operaWonsarepipelinedwhenpossibletominimizedatapassing
3. DataFrameoperaWonswithoutC#UDFsdonotrequireCSharpWorker• SameexecuWonplanopWmizaWonandcodegeneraWoninSparkCore• PerformthesameasScalaapplicaWons
Mobius Service Differentiators
• ElasWcityandFaultTolerance• Usevirtualactormodel• Auto-scale#ofMobiusServiceendpoints
Rou,ngLayer
MobiusServiceEndpoint
MobiusServiceEndpoint
…PersistentStore
Linux Support
• MonoandCoreCLR(ongoing)forMobiusonLinux• GitHubprojectusesTravisforCIinUbuntu14.04.3LTS
• Unittestsandsamples(funcWonaltests)arerun• [email protected]
CSharpRDD
• C# CLR • C# => JVM
• RDD<byte[]>• C#worker
• TransformaWonsarepipelinedwhenpossible•