intro to apache spark and scala, austin acm sigkdd, 7/9/2014

Intro to Apache Spark:Fast cluster computing engine for Hadoop

Intro to Scala:Object-oriented and functional language for the Java Virtual Machine

ACM SIGKDD, 7/9/2014

Roger Huang

Lead System Architect

[email protected]

[email protected]

@BigDataWrangler

mailto:[email protected]


http://spark.apache.org/


2Intro to Spark: Intro to Scala | 7/9/2014

About me: Roger Huang• Visa

– Digital & Mobile Products Architecture, Strategic Projects & infrastructure

– Search infrastructure

– Customer segmentation

– Logging Framework

– Splunk on Hadoop (Hunk)

– Real-time monitoring

– Data

• PayPal

– Java Infrastructure


Different perspectives on an elephant Scala


Outline• Spark

– Hadoop eco system

• Scala

– Background

• Why Scala?

– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect


Spark in the Hadoop ecosystem


Spark Ecosystem of Software Projects

• Spark [Ognen]

– APIs: Scala, Python [Robert], Java

• “SQL”

– Shark (Hive + Spark) [Roger]

– SparkSQL (alpha)

• Machine Learning Library (MLlib) [Omar]

– Clustering

– Classification

• binary classification

• Linear regression

– recommendations

• Spark Streaming [Chance]

• GraphX [Srini]

• …


Resilient Distributed Dataset

• Fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

• Data sources for RDDs

– Parallelized collections

• From Scala collections

– Hadoop datasets

• From HDFS, any Hadoop supported storage system (Hbase, Amazon S3, …)

• Text files, SequenceFile, any Hadoop InputFormat

• Two types of operations

– Transformation

• takes an existing dataset and creates a new one

– Action

• takes a dataset, run a computation, and return value to driver program


(Some) RDD Operations• Transformations

– map(func)

– filter(func)

– flatMap(func)

– mapPartitions(func)

– mapPartitionsWithIndex(func)

– sample(withReplacement, fraction, seed)

– union(otherDataset)

– distinct()

– groupByKey()

– reduceByKey(func)

– sortByKey()

– Join(otherDataset)

– cogroup(otherDataset)

– cartesian(otherDataset)

• Actions

– reduce(func)

– collect()

– count()

– first()

– take(n)

– takeSample(withReplacement, num, seed)

– saveAsTextFile(path)

– saveAsSequenceFile(path)

– countByKey()

– foreach(func)

– …


Scala background• Scalable, Object oriented, functional language

– Version 2.11 (4/2014)

• Runs on the Java Virtual Machine

• Martin Odersky

– javac

– Java generics

• http://scala-lang.org/, REPL

• http://www.scala-lang.org/api/current

• http://scala-ide.org/

• http://www.scala-sbt.org/, Simple build tool

• Who’s using Scala?

– Twitter, LinkedIn, …

• Powered by Scala

– Apache Spark, Apache Kafka, Akka,…

http://scala-lang.org/

http://www.scala-lang.org/api/current

http://scala-ide.org/

http://www.scala-sbt.org/


Outline• Spark


• Scala

– Background

• Why Scala?



– For the Hadoop/Spark developer





Scala for the computer scientist: functional programming (FP)


Scala for the computer scientist: functional programming (FP)

• Math functions, e.g., f(x) = y

– A function has a single responsibility

– A function has no side effects

– A function is referentially transparent

• A function outputs the same value for the same inputs.

• Functional programming

– expresses computation as the evaluation and composition of mathematical functions

– Avoid side effects and mutating state data


Why functional programming?

• Multi core processors

• Concurrency

– Computation as a series of independent data transformations

– Parallel data transformations without side effects

• Referential transparency


Scala for the computer scientist: functional programming

• Functions

– Lambda, closure

• For-comprehensions

• Type inference

• Pattern matching

• Higher order functions

– map, flatMap, foldLeft

• And more …


FP: functions

• Anonymous function

– Function without a name

– lambda function

• Example

– scala> List(100, 200, 300) map { _ * 10/100}

– res0: List[Int] = List(10, 20, 30)

• Closure (Wikipedia)

– Closure = A function, together with a referencing environment – a table storing a reference to each of the non-local variables of that function.

– A closure allows a function to access those non-local variables even when invoked outside its immediate lexical scope.


FP: functions

• applyPercentage is an example of a closure

– scala> var percentage = 10

– percentage: Int = 10

– scala> val applyPercentage = (amount: Int) => amount * percentage / 100

– applyPercentage: Int => Int = <function1>

– scala> percentage = 20

– percentage: Int = 20

– scala> List (100, 200, 300) map applyPercentage

– res1: List[Int] = List(20, 40, 60)

– scala>


FP: functions


• Closure


FP: Higher order functionsscala> :load Person.scala

Loading Person.scala...

defined class Person

scala> val jd = new Person("John", "Doe", 17)

jd: Person = Person@372a6e85

scala> val rh = new Person("Roger", "Huang", 34)

rh: Person = Person@611c4041

scala> val people = Array(jd, rh)

people: Array[Person] = Array(Person@372a6e85, Person@611c4041)

scala> val (minors, adults) = people partition (_.age < 18)

minors: Array[Person] = Array(Person@372a6e85)

adults: Array[Person] = Array(Person@611c4041)

scala>


FP: Higher order functions

• HOF

– takes a function as an argument

– Returns a function


FP: Higher order functions: map

• Creates a new collection from an existing collection by applying a function


scala> List(1, 2, 3 ) map { (x: Int) => x + 1 }

res0: List[Int] = List(2, 3, 4)

• Function literal

scala> List(1, 2, 3) map { _ + 1 }


• Passing an existing function

scala> def addOne(num: Int) = num + 1

addOne: (num: Int)Int

scala> List(1, 2, 3) map addOne



FP: Higher order functions: map


FP: Higher order functions: flatmap


FP: for-comprehension

• Syntax

– for ( <generator> | <guard> ) <expression> [yield] <expression>

• Types

– Imperative form. Does not return a value.

scala> val aList = List(1, 2, 3)

aList: List[Int] = List(1, 2, 3)

scala> val bList = List(4, 5, 6)

bList: List[Int] = List(4, 5, 6)

scala> for { a <- aList; if (a < 2); b <- bList; if (b < 7) } println( a + b )

5

6

7



• Syntax

– for ( <generator> | <guard> ) <expression> [yield] <expression>

• Types

– Functional form (a.k.a., sequence comprehension) . Returns/yields a value

scala> for { a <- aList; b <- bList} yield a + b

res0: List[Int] = List(5, 6, 7, 6, 7, 8, 7, 8, 9)

scala> res0.take(1)

res1: List[Int] = List(5)

scala> for { a <- aList; if (a < 2); b <- bList } yield a + b


scala>


FP: foldLeft• scala> val numbers = 1.to(10)

• numbers: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

• scala> def add( a:Int, b:Int ): Int = { a + b }

• add: (a: Int, b: Int)Int

• scala> numbers.foldLeft(0){ add }

• res0: Int = 55

• scala> numbers.foldLeft(0){ (acc, b) => acc + b }

• res1: Int = 55

• scala>


FP: foldLeft


FP: find the last item in an array

• scala> val ns = Array(20, 40, 60)

• ns: Array[Int] = Array(20, 40, 60)

• scala> ns.foldLeft(ns.head) {(acc, b) => b}

• res0: Int = 60

• scala>


FP: reverse an array w/ foldLeft

• scala> val ns = Array(20, 40, 60)

• ns: Array[Int] = Array(20, 40, 60)

• scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc}

• res1: Array[Int] = Array(60, 40, 20)

• scala>


FP: reverse an array w/ foldLeft