silkによる並列分散ワークフロープログラミング

42
Silkによる並分散ワークフロープログラミング 藤 太 東京学 情報命科学科 [email protected] 2013 1128 1 xerial.org/silk

Upload: taro-l-saito

Post on 15-Jan-2015

795 views

Category:

Technology


5 download

DESCRIPTION

並列・分散プログラミングを手軽に行うためのフレームワーク Silkについての紹介。https://github.com/xerial/silk

TRANSCRIPT

2. Silk: Smart Cluster Computing for Data ScientistslB = A.map(f) l A fBfA lB f : l l l lA SAM -> BAM RNA-Seq -> FPKM xerial.org/silk2 3. Silk: Smart Cluster Computing for Data ScientistsfA B xerial.org/silk3 4. Silk: Smart Cluster Computing for Data Scientists l l ll lInput: FASTQ le(s) 500GB (50x coverage, 200 million entries) f: An alignment program Output: Alignment results 750GB (sequence + alignment data) Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs)InputfOutputUniversity of Tokyo Genome Browser (UTGB) xerial.org/silk4 5. Silk: Smart Cluster Computing for Data ScientistsgfA B C D E xerial.org/silk5 6. Silk: Smart Cluster Computing for Data ScientistslWormTSS: http://wormtss.utgenome.org/ l l Gaussian modelTSSmotif ChIP-Seqxerial.org/silk6 7. Silk: Smart Cluster Computing for Data Scientists 1000 Excel R, JFreeChart reviewxerial.org/silk7 8. Silk: Smart Cluster Computing for Data ScientistslTSS l l l l xerial.org/silk8 9. Silk: Smart Cluster Computing for Data Scientists lMakele l ll TSSMakele 1000Makele l l l xerial.org/silk9 10. : () Silk: Smart Cluster Computing for Data ScientistsgfA B F C G D xerial.org/silkE 10 11. Silk Silk: Smart Cluster Computing for Data Scientists l lScala llllmap, lter, reduce, join, sort UNIX llSilk[A] (A llScala (Twitter inc.JVM l llScala lWeb l Silk xerial.org/silk11 12. Silk: Smart Cluster Computing for Data Scientistsl lCPUxerial.org/silk12 13. Silk: Smart Cluster Computing for Data Scientists l lllC++ l MPI, OpenMP, thread(pthread, boost thread) l mutex, condition variables l compare-and-swap (CAS) Java l java.util.concurrent Scala l Parallel collections llActor l Mail box(buer) message-passing modelOSl ll l TCP/IP, Socket UNIX l ssh, , l l NFS, GlusterFSGFSl l l I/O lindexl l lHadoop, HDFS, HBase, etc. Paxos consensus protocol l xerial.org/silk13 14. Silk: Smart Cluster Computing for Data Scientistsl l l l l lmap(f: A => B)atMap(f: A => Seq[B]) lter(pred: A => Boolean) reduce(op: (A, A) => A) join(B, paramA, paramB), groupBy(k:A=>Key) f gA B fR gA0 B0 A1 B1 A2 B2xerial.org/silkR 14 15. Silk: Smart Cluster Computing for Data Scientistsl llCPU lxerial.org/silk15 16. Silk: Smart Cluster Computing for Data Scientistsl lScala lsamchr1l lxerial.org/silk16 17. Silk: Smart Cluster Computing for Data ScientistsProgram v1fA Bval B = A.map(f)l A f Bl l=> xerial.org/silk17 18. Silk: Smart Cluster Computing for Data ScientistsProgram v2 Program v1A fgB C val B = A.map(f) val C = B.map(g)l llProgram v1 Program v2B val C = B.map(g) xerial.org/silk18 19. Silk: Smart Cluster Computing for Data Scientistsl ll l llCPU Makele l lll l l xerial.org/silk19 20. Silk: Smart Cluster Computing for Data ScientistsProgram v2 Program v1fgA B lflB ll l l f C val fileB = result/B.obj val B = if(!fileB.exists) { val tmp = A.map(f) tmp.saveTo(fileB) tmp } else load(fileB) val fileC = result/C.obj val C = if(!fileC.exists) { }xerial.org/silk20 21. Silk: Smart Cluster Computing for Data ScientistsProgram v2 Program v1fgA B C val B = A.map(f) val C = B.map(g) lB llBB l llScala Macroxerial.org/silk21 22. Silk: Smart Cluster Computing for Data ScientistsProgram v2 Program v1fA gB C val B = A.map(f) val C = B.map(g) lSilk l llval B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g) lllval C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)Silk lCBBA.map(f) xerial.org/silk22 23. Silk Silk: Smart Cluster Computing for Data Scientists lhttps://github.com/xerial/silk l ll$HOME/.silk/hosts l silk cluster start l l lZooKeeper SilkClient SilkMaster lxerial.org/silk23 24. Cluster Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk24 25. silk-sbt plugin Silk: Smart Cluster Computing for Data Scientistslsilk-sbt plugin l lSBT (simple build tool for Scalasilk ll lmemory ( cluster lqsub (Makele xerial.org/silk25 26. silk-bootstrap Silk: Smart Cluster Computing for Data Scientistslhttps://github.com/xerial/silk-bootstrapxerial.org/silk26 27. Silk: Smart Cluster Computing for Data Scientistslsilk eval(class name):(function name)xerial.org/silk27 28. Distributed Sorting Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk28 29. Distributed Sorting in Cluster: Sampling Sort Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk29 30. Distributed Sorting Shuffle Reduce Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk30 31. In-memory sort - OutOfMemory Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk31 32. UNIX Silk: Smart Cluster Computing for Data Scientists l lMakele Silk: l llref network transfer -> Snappy decompression -> object deserialization ->reduce l reducer llSchedulerspilling l xerial.org/silk34 35. Silk: Smart Cluster Computing for Data Scientistsl => session l lsession silk.id sessionbranch lllgit, mercurialsession l l lNFS, glusterfs, local disk ZooKeeperpath) GBGB lll -> scatterlocal disk glusterfs HDFS (replication = 1) xerial.org/silk35 36. Web UI Silk: Smart Cluster Computing for Data ScientistslSilk WebUIxerial.org/silk36 37. Object-oriented Workflow Programming Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk37 38. Silk: Smart Cluster Computing for Data Scientists l lData Parallel llDataow Processing l l l l lllPig Dryad (Microsoft. SQL: Hive, Shark, DryadLINQ Dremel (Google) summingbird (Twitter) Iterative Processing (loop llMapReduce (Google 2003)Spark Dierential/Incremental computing l Nova: Continuous Pig/Hadoop Workow (C. Olston. SIGMOD2011) l Niad (McSherry, Microsoft, 2013)Silk lWorkow + Programming l l Programming distributed workowsxerial.org/silk38 39. Spark + Mesos Silk: Smart Cluster Computing for Data Scientists lSpark http://www.spark-project.org/ lllMesos http://incubator.apache.org/mesos/ (2009~) l l ll (2009) ScalaHigh-level API l CPU SparkHadoop l MesosoerConsensus problem l ll l(group membership) (leader election) -> Paxos l 2-phase commit, 3-phase commit l ZooKeeper xerial.org/silk39 40. https://github.com/xerial/silk Silk: Smart Cluster Computing for Data Scientistsxerial.org/silk40 41. Scala Cookbook Silk: Smart Cluster Computing for Data Scientists lScala l lhttp://xerial.org/scala-cookbook/ 15xerial.org/silk41 42. Silk: Smart Cluster Computing for Data ScientistslSilk ll l l ll l ll llTODO lCPUversion2 llApache mesos l xerial.org/silk42