introduction to spark: or how i learned to love 'big data' after all

12
Introduction to Introduction to Peadar Coyle @springcoil Luxembourg - Early 2016

Upload: peadar-coyle

Post on 12-Apr-2017

4.552 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Introduction to Spark: Or how I learned to love 'big data' after all

Introduction to Introduction to

Peadar Coyle @springcoil

Luxembourg - Early 2016

Page 2: Introduction to Spark: Or how I learned to love 'big data' after all

Aims of this talkAims of this talk

Explain what Spark is.Explain what Spark is.I'm more a data scientist than an engineer...

Page 3: Introduction to Spark: Or how I learned to love 'big data' after all

Who am I?Who am I?Math and Data nerdInterested in machine learning and data processingSpeaker at PyData/ PyCons throughout Europe

Page 4: Introduction to Spark: Or how I learned to love 'big data' after all

'Big data' so far'Big data' so far

Page 5: Introduction to Spark: Or how I learned to love 'big data' after all

Why care?Why care?big data analytics in memoryResilient Distributed Datasets (RDD)Flexible programming modelscomplements Hadoopbetter performance than Hadoophttps://github.com/springcoil/scalable_ml

Page 6: Introduction to Spark: Or how I learned to love 'big data' after all

Who uses it?Who uses it?Current tech not future tech!Current tech not future tech!

Page 7: Introduction to Spark: Or how I learned to love 'big data' after all

Supported LanguagesSupported Languages

Page 8: Introduction to Spark: Or how I learned to love 'big data' after all

val data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)

sc is 'Spark ContextHere is one RDD

CodeCode

Page 9: Introduction to Spark: Or how I learned to love 'big data' after all

https://github.com/springcoil/scalable_ml/

package scalable_ml

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.linalg.DenseVectorimport org.apache.spark.rdd.RDDimport breeze.linalg.{DenseVector => BDV}import breeze.linalg.{DenseMatrix => BDM}

class LeastSquaresRegression { def fit(dataset: RDD[LabeledPoint]): DenseVector = { val features = dataset.map { _.features }

val covarianceMatrix: BDM[Double] = features.map { v => val x = BDM(v.toArray) x.t * x }.reduce(_ + _) val featuresTimesLabels: BDV[Double] = dataset.map { xy => BDV(xy.features.toArray) * xy.label }.reduce(_ + _)

val weight = covarianceMatrix \ featuresTimesLabels

new DenseVector(weight.data) }}

Page 10: Introduction to Spark: Or how I learned to love 'big data' after all

Resilient DistributedResilient DistributedDatasets (RDD)Datasets (RDD)

Process in parallelActions on RDDs = transformations and actionspersistance: Memory, Disk, Memory and Disk

Page 11: Introduction to Spark: Or how I learned to love 'big data' after all

Spark EcosystemSpark Ecosystem

Spark streamingSpark SQL - Really the creation of a data frameMore stuff will come soon... IBM and others heavily investing in this.

Page 12: Introduction to Spark: Or how I learned to love 'big data' after all

Any questions?Any questions?