Scala 20140715

Download Scala 20140715

Post on 26-Jan-2015

112 views

Category:

Documents

4 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

<ul><li> 1. Intro to Apache Spark: Fast cluster computing engine for Hadoop Intro to Scala: Object-oriented and functional language for the Java Virtual Machine ACM SIGKDD, 7/9/2014 Roger Huang Lead System Architect rohuang@visa.com rog4096@yahoo.com @BigDataWrangler </li></ul> <p> 2. 2Intro to Spark: Intro to Scala | 7/9/2014 About me: Roger Huang Visa Digital &amp; Mobile Products Architecture, Strategic Projects &amp; infrastructure Search infrastructure Customer segmentation Logging Framework Splunk on Hadoop (Hunk) Real-time monitoring Data PayPal Java Infrastructure 3. 3Intro to Spark: Intro to Scala | 7/9/2014 Different perspectives on an elephant Scala 4. 4Intro to Spark: Intro to Scala | 7/9/2014 Outline Spark Hadoop eco system Scala Background Why Scala? For the computer scientist For the Java / OO programmer For the Spark developer For the Big Data developer For the Big Data scientist / mathematician For the system architect 5. 5Intro to Spark: Intro to Scala | 7/9/2014 Spark in the Hadoop ecosystem 6. 6Intro to Spark: Intro to Scala | 7/9/2014 Spark Ecosystem of Software Projects Spark [Ognen] APIs: Scala, Python [Robert], Java SQL Shark (Hive + Spark) [Roger] SparkSQL (alpha) Machine Learning Library (MLlib) [Omar] Clustering Classification binary classification Linear regression recommendations Spark Streaming [Chance] GraphX [Srini] 7. 7Intro to Spark: Intro to Scala | 7/9/2014 Resilient Distributed Dataset Fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel Data sources for RDDs Parallelized collections From Scala collections Hadoop datasets From HDFS, any Hadoop supported storage system (Hbase, Amazon S3, ) Text files, SequenceFile, any Hadoop InputFormat Two types of operations Transformation takes an existing dataset and creates a new one Action takes a dataset, run a computation, and return value to driver program 8. 8Intro to Spark: Intro to Scala | 7/9/2014 (Some) RDD Operations Transformations map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) sample(withReplacement, fraction, seed) union(otherDataset) distinct() groupByKey() reduceByKey(func) sortByKey() Join(otherDataset) cogroup(otherDataset) cartesian(otherDataset) Actions reduce(func) collect() count() first() take(n) takeSample(withReplacement, num, seed) saveAsTextFile(path) saveAsSequenceFile(path) countByKey() foreach(func) 9. 9Intro to Spark: Intro to Scala | 7/9/2014 Scala background Scalable, Object oriented, functional language Version 2.11 (4/2014) Runs on the Java Virtual Machine Martin Odersky javac Java generics http://scala-lang.org/, REPL http://www.scala-lang.org/api/current http://scala-ide.org/ http://www.scala-sbt.org/, Simple build tool Whos using Scala? Twitter, LinkedIn, Powered by Scala Apache Spark, Apache Kafka, Akka, 10. 10Intro to Spark: Intro to Scala | 7/9/2014 Outline Spark Hadoop eco system Scala Background Why Scala? For the computer scientist For the Java / OO programmer For the Hadoop/Spark developer For the Big Data developer For the Big Data scientist / mathematician For the system architect 11. 11Intro to Spark: Intro to Scala | 7/9/2014 Scala for the computer scientist: functional programming (FP) 12. 12Intro to Spark: Intro to Scala | 7/9/2014 Scala for the computer scientist: functional programming (FP) Math functions, e.g., f(x) = y A function has a single responsibility A function has no side effects A function is referentially transparent A function outputs the same value for the same inputs. Functional programming expresses computation as the evaluation and composition of mathematical functions Avoid side effects and mutating state data 13. 13Intro to Spark: Intro to Scala | 7/9/2014 Why functional programming? Multi core processors Concurrency Computation as a series of independent data transformations Parallel data transformations without side effects Referential transparency 14. 14Intro to Spark: Intro to Scala | 7/9/2014 Scala for the computer scientist: functional programming Functions Lambda, closure For-comprehensions Type inference Pattern matching Higher order functions map, flatMap, foldLeft And more 15. 15Intro to Spark: Intro to Scala | 7/9/2014 FP: functions Anonymous function Function without a name lambda function Example scala&gt; List(100, 200, 300) map { _ * 10/100} res0: List[Int] = List(10, 20, 30) Closure (Wikipedia) Closure = A function, together with a referencing environment a table storing a reference to each of the non-local variables of that function. A closure allows a function to access those non-local variables even when invoked outside its immediate lexical scope. 16. 16Intro to Spark: Intro to Scala | 7/9/2014 FP: functions applyPercentage is an example of a closure scala&gt; var percentage = 10 percentage: Int = 10 scala&gt; val applyPercentage = (amount: Int) =&gt; amount * percentage / 100 applyPercentage: Int =&gt; Int = scala&gt; percentage = 20 percentage: Int = 20 scala&gt; List (100, 200, 300) map applyPercentage res1: List[Int] = List(20, 40, 60) scala&gt; 17. 17Intro to Spark: Intro to Scala | 7/9/2014 FP: functions Anonymous function Closure 18. 18Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions scala&gt; :load Person.scala Loading Person.scala... defined class Person scala&gt; val jd = new Person("John", "Doe", 17) jd: Person = Person@372a6e85 scala&gt; val rh = new Person("Roger", "Huang", 34) rh: Person = Person@611c4041 scala&gt; val people = Array(jd, rh) people: Array[Person] = Array(Person@372a6e85, Person@611c4041) scala&gt; val (minors, adults) = people partition (_.age &lt; 18) minors: Array[Person] = Array(Person@372a6e85) adults: Array[Person] = Array(Person@611c4041) scala&gt; 19. 19Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions HOF takes a function as an argument Returns a function 20. 20Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions: map Creates a new collection from an existing collection by applying a function Anonymous function scala&gt; List(1, 2, 3 ) map { (x: Int) =&gt; x + 1 } res0: List[Int] = List(2, 3, 4) Function literal scala&gt; List(1, 2, 3) map { _ + 1 } res1: List[Int] = List(2, 3, 4) Passing an existing function scala&gt; def addOne(num: Int) = num + 1 addOne: (num: Int)Int scala&gt; List(1, 2, 3) map addOne res2: List[Int] = List(2, 3, 4) 21. 21Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions: map 22. 22Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions: flatmap 23. 23Intro to Spark: Intro to Scala | 7/9/2014 FP: for-comprehension Syntax for ( | ) [yield] Types Imperative form. Does not return a value. scala&gt; val aList = List(1, 2, 3) aList: List[Int] = List(1, 2, 3) scala&gt; val bList = List(4, 5, 6) bList: List[Int] = List(4, 5, 6) scala&gt; for { a def add( a:Int, b:Int ): Int = { a + b } add: (a: Int, b: Int)Int scala&gt; numbers.foldLeft(0){ add } res0: Int = 55 scala&gt; numbers.foldLeft(0){ (acc, b) =&gt; acc + b } res1: Int = 55 scala&gt; 27. 27Intro to Spark: Intro to Scala | 7/9/2014 FP: foldLeft 28. 28Intro to Spark: Intro to Scala | 7/9/2014 FP: find the last item in an array scala&gt; val ns = Array(20, 40, 60) ns: Array[Int] = Array(20, 40, 60) scala&gt; ns.foldLeft(ns.head) {(acc, b) =&gt; b} res0: Int = 60 scala&gt; 29. 29Intro to Spark: Intro to Scala | 7/9/2014 FP: reverse an array w/ foldLeft scala&gt; val ns = Array(20, 40, 60) ns: Array[Int] = Array(20, 40, 60) scala&gt; ns.foldLeft( Array[Int]() ) { (acc, b) =&gt; b +: acc} res1: Array[Int] = Array(60, 40, 20) scala&gt; 30. 30Intro to Spark: Intro to Scala | 7/9/2014 FP: reverse an array w/ foldLeft 31. 31Intro to Spark: Intro to Scala | 7/9/2014 Outline Spark Hadoop eco system Scala Background Why Scala? For the computer scientist For the Java / OO programmer For the Spark developer For the Big Data developer For the Big Data scientist / mathematician For the system architect 32. 32Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Java / OO developer: Interoperable w/ Java Case classes Mixins with traits 33. 33Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Java / OO developer: case class Implements equals(), hashCode(), toString() Can be used in Pattern Matching 34. 34Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Java / OO developer: http://docs.oracle.com/javase/8/docs/api/java/util/stream/Str eam.html map Stream map(Function</p>