mongodb days uk: mongodb and spark

65

Upload: mongodb

Post on 16-Apr-2017

1.045 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: MongoDB Days UK: MongoDB and Spark
Page 2: MongoDB Days UK: MongoDB and Spark

Spark in the Leaf

Page 3: MongoDB Days UK: MongoDB and Spark

3

Ross LawleyJVM Software engineerOn the drivers team

Twitter: @RossC0

Page 4: MongoDB Days UK: MongoDB and Spark

4

Agenda

The data challengeSparkUse CasesConnectorsDemo

Page 5: MongoDB Days UK: MongoDB and Spark
Page 6: MongoDB Days UK: MongoDB and Spark

C 18,000 BCE

First recorded example of Humans saving data.

Tally sticks used to track trading activity and record inventory.

Page 7: MongoDB Days UK: MongoDB and Spark

1663

First recorded statistical – analysis of dataJohn Graunt started the field of demographics in an attempt to

predict the spread of the bubonic plague.

Page 8: MongoDB Days UK: MongoDB and Spark

1928

First use of magnetic tape to store data.

Fritz Pfleumer formed basis of modern digital data storage.

Page 9: MongoDB Days UK: MongoDB and Spark

1965

The start of Big Data?The US Government plans the world’s first data center to store

742 million tax returns and 175 million sets of fingerprints.

Page 10: MongoDB Days UK: MongoDB and Spark

1970

The start of accessible data

Relational Database model developed by Edgar F Codd.

Page 11: MongoDB Days UK: MongoDB and Spark

1991

The birth of the internet.

Page 12: MongoDB Days UK: MongoDB and Spark

1997

Google

Michael Lesk estimates the digital universe increasing tenfold in size every year.

Page 13: MongoDB Days UK: MongoDB and Spark

2001

Big Data challenges defined

Doug Laney defined the Three “Vs” of Big Data

Page 14: MongoDB Days UK: MongoDB and Spark

2005

Big Data taming by Elephants

Hadoop created!

Page 15: MongoDB Days UK: MongoDB and Spark

2009

MongoDB released!

Page 16: MongoDB Days UK: MongoDB and Spark

2010

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

Page 17: MongoDB Days UK: MongoDB and Spark

2014

Spark 1.0 released!

Page 18: MongoDB Days UK: MongoDB and Spark

Big Data

Big Challenge

Page 20: MongoDB Days UK: MongoDB and Spark

Apache Spark is the Taylor Swift of big data software.

“Derrick Harris, Fortune

Page 21: MongoDB Days UK: MongoDB and Spark
Page 22: MongoDB Days UK: MongoDB and Spark

22

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, …• It’s fundamentally different to what’s come before

Page 23: MongoDB Days UK: MongoDB and Spark

23

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

Page 24: MongoDB Days UK: MongoDB and Spark

A visual comparison

HadoopSpark

Page 25: MongoDB Days UK: MongoDB and Spark

25

Spark Programming Model

Resilient Distributed Datasets

• An RDD is a collection of elements that is immutable, distributed and fault-tolerant.

• Transformations can be applied to a RDD, resulting in new RDD.

• Actions can be applied to a RDD to obtain a value.

• RDD is lazy.

Page 26: MongoDB Days UK: MongoDB and Spark

26

RDD Operations

Transformations Actionsmap reduce

filter collect

flatMap count

mapPartitions save

sample lookupKey

union take

join foreach

groupByKey

reduceByKey

Page 27: MongoDB Days UK: MongoDB and Spark

27

Example: Filtering text

val searches = spark.textFile("hdfs://...") .filter(line => line.contains("Search")) .map(s => s.split("\t")(2)).cache() Driver

Worker

Worker

Worker

// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count() tasksres

ults

Block 1

Block 2

Block 3

// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))

.collect()

Cache 1Cache 2

Cache3

Page 28: MongoDB Days UK: MongoDB and Spark

28

Built in fault tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions

val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("\t")(2)).cache()

.filter(_.contains("MongoDB")) .count()

Mapped RDD

Filtered RDD

HDFS RDD

Cached RDD

Filtered RDD Count

Page 29: MongoDB Days UK: MongoDB and Spark

29

Spark higher level libraries

Spark

Spark SQL

Spark Streaming MLIB GraphX

Page 30: MongoDB Days UK: MongoDB and Spark

Spark + MongoDB

Page 31: MongoDB Days UK: MongoDB and Spark

31

MongoDB and Spark

Spark

Spark SQL Spark Streaming MLIB GraphX

Page 32: MongoDB Days UK: MongoDB and Spark

32

MongoDB and Spark

Spark

Spark SQL

Spark Streaming MLIB GraphX

Page 33: MongoDB Days UK: MongoDB and Spark

33

MongoDB and Spark

Spark

Spark SQL

Spark Streaming MLIB GraphX

Page 34: MongoDB Days UK: MongoDB and Spark

34

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

Page 35: MongoDB Days UK: MongoDB and Spark

35

Data Management

OLTPApplicationsFine grained operations

Offline Processing Analytics Data Warehousing

Page 36: MongoDB Days UK: MongoDB and Spark

Fraud Detection

I'm so in love!

Page 37: MongoDB Days UK: MongoDB and Spark

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

?

Ok, XXXX-123-zzz

$$$

Page 38: MongoDB Days UK: MongoDB and Spark

Fraud Detection

Page 39: MongoDB Days UK: MongoDB and Spark

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

Spark

Page 40: MongoDB Days UK: MongoDB and Spark

MongoDB + Spark Connectors

Page 41: MongoDB Days UK: MongoDB and Spark

Choices, choices

Hadoop Connector Stratio Connector

Page 42: MongoDB Days UK: MongoDB and Spark

MongoDB Hadoop Connector

HDFS HDFS HDFSMongoDB Hadoop

Connector

MongoDB Shard

Spark

Page 43: MongoDB Days UK: MongoDB and Spark

MongoDB Hadoop Connector

HDFS HDFS HDFSMongoDB Hadoop

Connector

MongoDB Shard

Spark

YARN

Page 44: MongoDB Days UK: MongoDB and Spark

44

MongoDB Hadoop Connector

Positive Not So Good

Battle Tested Not the fastest thing

Integrated with existing Hadoop components Not dedicated to Spark

Supports HIVE and PIG Dependent on HDFS

http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

Page 45: MongoDB Days UK: MongoDB and Spark

45

Stratio Spark-MongoDBhttp://spark-packages.org/?q=mongodb

Page 46: MongoDB Days UK: MongoDB and Spark

Stratio Spark MongoDB

MongoDB Shard

Spark

Stratio Spark-MongoDB

https://github.com/Stratio/spark-mongodb

Page 47: MongoDB Days UK: MongoDB and Spark

47

MongoDB Hadoop Connector Stratio Spark-MongoDB Connector

Machine Learning Yes Yes

SQL No Yes

Data Frames No Yes

Streaming No No

Python Yes YesSpark SQL syntax

Use MongoDB secondary indexes to filter input data Yes Yes

Compatibility with MongoDB replica sets and sharding Yes Yes

HDFS Support Yes Yes

Support for MongoDB BSON Files Yes PartialWrite only

Commercial Support YesWith MongoDB Enterprise Advanced

YesProvided by Stratio

Page 48: MongoDB Days UK: MongoDB and Spark

Spark Streaming

Page 49: MongoDB Days UK: MongoDB and Spark

49

Spark Streaming

Twitter Feed Spark

Page 50: MongoDB Days UK: MongoDB and Spark

50

Spark Streaming

Twitter Feed

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

Page 51: MongoDB Days UK: MongoDB and Spark

51

Spark Streaming{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "time": "Mon Sep 24 03:35", "freebandnames": 1}

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [

], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } }}

{ "time": "Mon Sep 24 03:35", "freebandnames": 4}

Spark

Page 52: MongoDB Days UK: MongoDB and Spark

52

Capped Collection

MongoDB and Spark Streaming future

{ "time": "Mon Sep 24 03:35", "freebandnames": 4}{ "time": "Mon Nov 5 09:40", “mongoDBLondon": 400}{ "time": "Mon Nov 5 11:50", “spark": 7556}{ "time": "Mon Nov 24 12:50", "itshappening": 100}

Tailable Cursor

Page 53: MongoDB Days UK: MongoDB and Spark

Spark SQL

Page 54: MongoDB Days UK: MongoDB and Spark

54

Demo

Spark Stratio Spark MongoDB

Page 55: MongoDB Days UK: MongoDB and Spark

55

Open High Low Close

Symbol, Timestamp, Day, Open, High, Low, Close, VolumeMSFT, 2009-08-24 09:30, 24, 24.41, 24.42, 24.31, 24.31, 683713

Page 56: MongoDB Days UK: MongoDB and Spark

MongoDB + Spark performance

Page 57: MongoDB Days UK: MongoDB and Spark

Document Design mattersdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 12.9}

Resource

Type

WhenData

Page 58: MongoDB Days UK: MongoDB and Spark

Time Seriesdb.ticks.find(){ _id: 'MSFT_12', type: 'Open', date: ISODate("2015-07-12 10:00"), volume: 1699342, minutes: { "0": 12.9, "1": 14.4, ... "59": 15.8 }}

Series

Page 59: MongoDB Days UK: MongoDB and Spark

WiredTiger

Page 60: MongoDB Days UK: MongoDB and Spark

Very High Speed

Page 61: MongoDB Days UK: MongoDB and Spark

61

Spark I/O Matters

val searches = spark.fromMongoDB(mongoDBConfig) .filter(line => line.contains("Search")) .map(s => s.split("\t")(2))

SparkDriver

Worker Worker

// Count searches mentioning MongoDBsearches.filter(_.contains("MongoDB")) .count()

App// Fetch the searches as an array of stringssearches .filter(_.contains("MongoDB"))

.collect()

.cache()

Page 62: MongoDB Days UK: MongoDB and Spark

62

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Page 63: MongoDB Days UK: MongoDB and Spark

Questions?

Ross LawleySenior [email protected]@RossC0

Page 64: MongoDB Days UK: MongoDB and Spark
Page 65: MongoDB Days UK: MongoDB and Spark

65

References

• Resources– https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb– http://spark.apache.org/docs/latest/quick-start.html– https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals– http://techanjs.org/

• Images– https://commons.wikimedia.org/wiki/File:SAM_PC_1_-_Tally_sticks_1_-_Overview.jpg– http://www.pieria.co.uk/articles/a_17th_century_spreadsheet_of_deaths_in_london– http://www.snipview.com/q/Fritz_Pfleumer– https://news.google.com/newspapers?id=ZGogAAAAIBAJ&sjid=3GYFAAAAIBAJ&dq=data-center&pg=933%2C5465131– http://www.slideshare.net/renguzi/codd– http://www.datasciencecentral.com/profiles/blogs/a-little-known-component-that-should-be-part-of-most-data-science– https://medium.com/deepend-indepth/know-your-audience-better-than-asio-4802839c3fd3#.fj0dxq99w– http://timschreiber.com/img/cardboard-tank.jpg– http://olap.com/forget-big-data-lets-talk-about-all-data/– http://www.engadget.com/2015/10/07/lexus-cardboard-electric-car/– http://cdn.theatlantic.com/static/infocus/ngt051713/n10_00203194.jpg– http://www.businessinsider.com/the-red-pill-reddit-2013-8– https://www.flickr.com/photos/dogfaceboy/2572744331/