how spark is enabling the new wave of converged applications

46
© 2016 MapR Technologies 1 © 2016 MapR Technologies 1 © 2016 MapR Technologies How Spark is Enabling the New Wave of Converged Applications Balaji Mohanam and Carol McDonald September, 2016

Upload: mapr-technologies

Post on 14-Jan-2017

122 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies

How Spark is Enabling the New Wave of Converged Applications

Balaji Mohanam and Carol McDonald

September, 2016

Page 2: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 2© 2016 MapR Technologies 2

Today’s Presenters

Carol McDonaldSolutions Architect

Balaji MohanamProduct Manager

Page 3: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 3© 2016 MapR Technologies 3

Agenda

• Market Trends

• What’s Needed for Converged Applications

• Customer Use Cases

• Demo of MapR Streams with Spark Streaming

Page 4: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 4© 2016 MapR Technologies 4

Analytics & ETL: Batch or Streaming?

V a l u e

T i m e

Page 5: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 5© 2016 MapR Technologies 5

Analytic Categories

Descriptive Predictive StreamingPrescriptiv

e

Data-At-Rest Data-In-Motion Future

• What happened

• Why did it happen

• Discovery in nature

• Batch analytics

• What will happen

• Combines historical data with rules and algorithms

• ML (Batch + Real Time)

• What + When + Why

• Suggestions to take advantage of future opportunity or mitigate risks

• Volume, velocity and variety

• Agility is key to success.

• Analyse data as it happens

• Triggers and Alarms.

• Anomaly detection

• Continuous ETL and analytics

Page 6: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 6© 2016 MapR Technologies 6

Decreasing Job Latencies

Hours Mins Secs Milli Secs

Data persistence on-disk

Data persistence in-memory

Page 7: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 7© 2016 MapR Technologies 7

It was hot at 6:05 yesterday!

Why Stream Processing?

A n a l y z e

6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°

90°90°6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°

Batch processing may be too late for some events

Page 8: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 8© 2016 MapR Technologies 8

Why Stream Processing?

6:05 P.M.: 90°Topic

Temperature

Turn on the air conditioning!

It’s becoming important to process events as they arrive

S t r e a m

Page 9: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 9© 2016 MapR Technologies 9© 2016 MapR Technologies© 2016 MapR Technologies

What’s Needed for Converged Applications

Page 10: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 10© 2016 MapR Technologies 10

The Trinity of Real Time

Topic 1Real Time Producers

Topic 2

Global Messaging System No SQL Key Value Database

Spark + MapR DB Integration

Real Time Operational

Analytics

Transformational Tier

Spark + MapR Streams

Integration

Page 11: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 11© 2016 MapR Technologies 11

Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform

Page 12: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 12© 2016 MapR Technologies 12

Use Case: Time Series Data in Oil Wells

Data for real-time monitoring

read

Sensor time-stamped

data

Spark processing

Spark Streaming

Stream

Topic

Page 13: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 13© 2016 MapR Technologies 13

Serve DataStore DataCollect Data

What Do We Need to Do ?

Process DataData Sources

? ? ? ?

Page 14: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 15© 2016 MapR Technologies 15

Scalable Messaging with MapR Streams

Topics are partitioned for throughput and scalability

Partition 1: Topic - Pressure

Partition 1: Topic - Temperature

Partition 1: Topic - Warning

Partition 2: Topic - Pressure

Partition 2: Topic - Temperature

Partition 2: Topic - Warning

Partition 3: Topic - Pressure

Partition 3: Topic - Temperature

Partition 3: Topic - Warning

Consumers

Consumers

Consumers!

Page 15: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 16© 2016 MapR Technologies 16

Continuous Analytics: Structured Streaming with Spark 2.0

valrecords=sqlContext.read.format(“json”).stream(“hdfs://input”) valcounts=records.groupBy(“user”).count() counts.write .trigger(ProcessingTime(“5sec”)) .outputMode(UpdateInPlace(“user”)) .format(“jdbc”) .startStream(“mysql://...”)

Repeated Queries

DB

User Count

User 1 10

User 2 23

User 3 16

…….. ……..

Store only the processed output instead of every single record.• Query executed repeatedly as and when the data arrives.

• Read the result from persistent storage, instead of processing the entire data set, resulting in faster access.

Page 16: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 17© 2016 MapR Technologies 17

Spark 2.0: Structured Streaming with Spark SQL

Processing Time1

Input Table

Result Table

Program Output Complete output

ORDelta output

Output for data at 1

Output for data at 2

Output for data at 3

Data upto proc. Time 1

Data upto proc. Time 2

Data upto proc. Time 3

Delta: writes the records from the query result changed from the last firing of the trigger. These are physical deltas and not logical deltas. That is to say, they specify what rows were added and removed, but not the logical difference for some row.

Append: A special case of the Delta mode that does not include removals.

Update( in place): Update the result directly in place (e.g. update a MySQL table). Similar to delta, a primary key must be specified.

Complete: For each run of the query, create a complete snapshot of the query result.

Output Modes32

Page 17: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 18© 2016 MapR Technologies 18

Serve Data

What Do We Need to Do ?

Store DataCollect Data Process DataData Sources

St ream

Topic

Page 18: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 19© 2016 MapR Technologies 19

User 1

User 2

User 3

User n

.

.

.

Sparkcontext

Query Compilation

Storage

Scheduling

Worker 1

Worker 2

Worker 3

Worker 4

Worker n

.

.

Spark Scheduling Bottleneck

Page 19: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 20© 2016 MapR Technologies 20

Latency vs. Concurrency

Type Latency Concurrency

Batch/RTS Analytics Very Low Low

Interactive Applications Very Low High/Very High

Page 20: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 21© 2016 MapR Technologies 21

MapR-DB (HBase API) is Designed to Scale

Key Range

xxxxxxxx

Key Col B Col C

val val val

xxx val val

Fast Reads and Writes by KeyData is automatically partitioned by Key Range

Key Range

xxxxxxxx

Key Col B Col C

val val val

xxx val val

Key Range

xxxxxxxx

Key Col B Col C

val val val

xxx val val

Page 21: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 22© 2016 MapR Technologies 22

Serve DataStore DataCollect Data

What Do We Exactly Need to Do ?

Process DataData Sources

St ream

Topic

Page 22: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 23© 2016 MapR Technologies 23© 2016 MapR Technologies© 2016 MapR Technologies

Customer Use Cases

Page 23: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 24© 2016 MapR Technologies 24

Customer 360 & Behavior Prediction

Website Click-Stream

Real Time/Offline ClickStream Analysis

Internal Data Sources

External Data Sources

• Prediction Modelling

• Attribution Modelling

• Cohort Analysis

• Customer Lifetime Value Analysis

• Attrition Modelling

• Response Modelling

• Churn Modelling

Eliminate latency due to data movement between clusters

Eliminate Redundant storage with MapR streams and lower the TCO

360 Degree Customer View

Customer Behavior PredictionBetter Conversion Rate and Lower attrition $$$

OfflineReal Time

HA, DR, NFS, Snapshots, Data Protection

EDH/EDL

Topic

Topic

Topic

Topic

Support Tickets

DBMSEmail

CRM

Page 24: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 25© 2016 MapR Technologies 25

Prescriptive Analytics: IoT & Auto Manufacturing

GPS

Telematic Data

Telephone Truck Fleet

Data generated from cars are stored locally

Data Modelling/Secondary ETL: Data is converted from proprietary to parquet format

• Identify emission patterns• Route optimization• Customer service requests• How does throttling affect other factors such as fuel consumption, emissions, etc.• Image and video analysis• Time series analysis for threshold breach

Topic

Topic

Topic

Topic

Page 25: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 26© 2016 MapR Technologies 26

Interactive Analytics: Risk Analysis ( Internal Users)

0-10 days old data cached in memory: 50-100 GB of data.

Data older than 10 days accessed from disk

Analytic Application to submit queries with simple to medium analytic query complexity

User 1

User 2

User 3

Concurrent requests: 3-10

Throughput: 1.5 requests per second

Latency : < 2 secondsRepresentative Queries

• List of users who have spent more than $1000 in last 3 days.

• Group users by country who spent more than $1000.

Analytic Application

Type of Users: Internal

Page 26: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 27© 2016 MapR Technologies 27

On-Demand

Pre-Computed

Interactive Analytics: External Customer Facing

Application

Sales Incentive Data

• 60 events/sec• 10 MB/event• Tabled based topics

Fast Changing DataEx: Credit dateAppend only (50% of events)

Search Application

Stale Data. Aggregates calculated using Snapshots.

Level 1 Aggregates

Level 2 Aggregates

Level 3 Aggregates

Advanced ML Analytics

Delta Aggregates

Pre-compute analytics with Spark Streaming on Data-in-motion

Topic

Topic

Topic

Topic

DB

Page 27: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 28© 2016 MapR Technologies 28© 2016 MapR Technologies© 2016 MapR Technologies

Demo

Page 28: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 29© 2016 MapR Technologies 29

What if BP had detected problems before the oil hit the water ?

1M samples/secHigh performance at scale is necessary!

Page 29: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 30© 2016 MapR Technologies 30

Use Case: Time Series Data

Data for real-time monitoring

Sensor time-stamped data

Spark processing

readSpark Streaming

Stream

Topic

Page 30: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 31© 2016 MapR Technologies 31

Use Case: Time Series Data

Sensor time-stamped data

Stream

Topic

COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94

COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66

COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79

Data: PumpId, Date,Time , pressure and flow measurements

Page 31: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 32© 2016 MapR Technologies 32

Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats

Row keyCF data CF alerts CF stats

hz … psi psi … hz_avg … psi_min

COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0

Row Key contains oil pump name, date, and a time stamp

Page 32: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 33© 2016 MapR Technologies 33

Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats

Row keyCF data CF alerts CF stats

hz … psi psi … hz_avg … psi_min

COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0

Page 33: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 34© 2016 MapR Technologies 34

Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats

Row keyCF data CF alerts CF stats

hz … psi psi … hz_avg … psi_min

COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0

Page 34: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 35© 2016 MapR Technologies 35

Serve Data

What Do We Need to Do ?

Data Sources Store DataCollect Data Process Data

St ream

Topic

Page 35: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 36© 2016 MapR Technologies 36

readSpark Streaming

Stream

Topic

Use Case Example Code

Data for real-time monitoring

Sensor time-stamped data Spark processing

Page 36: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 37© 2016 MapR Technologies 37

KafkaProducerString topic=“/streams/pump:warning”;public static KafkaProducer producer;//1 configure KafkaProducer properties Properties properties = new Properties();properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");//2 Create KafkaProducer with propertieskafkaProducer = new KafkaProducer<String, String>(properties);String txt = “msg text”;//3 Create producer records with topic and message ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, txt);//4 use kafka producer to send recordskafkaProducer.send(record);

Page 37: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 38© 2016 MapR Technologies 38

readSpark Streaming

Stream

Topic

Use Case Example Code

Data for real-time monitoring

Sensor time-stamped data Spark processing

Page 38: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 39© 2016 MapR Technologies 39

Create a DStream

DStream: a sequence of RDDs representing a stream of data

val ssc = new StreamingContext(sparkConf, Seconds(5))// create an input Stream for set of topicsval dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet)

batchtime 0 to 1

batch time 1 to 2

batch time 2 to 3

dStream

Stored in memory as an RDD

Page 39: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 40© 2016 MapR Technologies 40

Message Data to Sensor Object

case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double)// Parse CSV Strings into Sensor objects def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble)}

Page 40: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 41© 2016 MapR Technologies 41

Process DStream// Parse message values into Sensor objects val sensorDStream = dStream.map(_._2).map(parseSensor)

dStream RDDs

batch time 2 to 3

batch time 1 to 2

batchtime 0 to 1

sensorDStream RDDs

New RDDs created for every batch

map map map

Page 41: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 42© 2016 MapR Technologies 42

DataFrame and SQL Operations// for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) // convert RDD to DataFrame rdd.toDF().registerTempTable("sensor") // get the avg max min for pump values val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date”) res.show()}

Page 42: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 43© 2016 MapR Technologies 43

Streaming Application Output

Page 43: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 44© 2016 MapR Technologies 44

Save to HBaserdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)

linesRDD DStream

sensorRDD DStream

output operation: persist data to external storage

Put objects written to HBase

batch time 2-3

batch time 1 to 2

batchtime 0 to 1

mapmap map

savesave save

Page 44: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 45© 2016 MapR Technologies 45

Start Receiving Data

sensorDStream.foreachRDD { rdd => . . .

}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()

Page 45: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 46© 2016 MapR Technologies 46

Stream Processing

Building a Complete Data Architecture

MapR File System (MapR-FS)

MapR Converged Data Platform

MapR Database (MapR-DB)MapR Streams

Sources/Apps Bulk Processing

Page 46: How Spark is Enabling  the New Wave of Converged Applications

© 2016 MapR Technologies 47© 2016 MapR Technologies 47

Q & AEngage with us!

1. Read explanation of and Download code– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db– https://www.mapr.com/blog/spark-streaming-hbase

2. Get Started: MapR Converged Data Platform https://www.mapr.com/get-started-with-mapr

3. Get Answers: MapR Converge Community https://community.mapr.com/community/answers

4. Get Trained: MapR On-Demand Training https://learn.mapr.com