spark streaming and iot by mike freedman

35
Spark Streaming and IoT Michael J. Freedman iobeam

Upload: spark-summit

Post on 21-Apr-2017

2.533 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Spark Streaming and IoT by Mike Freedman

Spark Streaming and IoT

Michael J. Freedman iobeam

Page 2: Spark Streaming and IoT by Mike Freedman

Technology confluence in IoT

UBIQUITOUS SENSORS

REAL-TIME SYSTEMS

MACHINE LEARNING

DATA ANALYSIS

INTERSECTION OF 3 MAJOR TRENDS

Page 3: Spark Streaming and IoT by Mike Freedman

Data analysis is the killer app

CASE STUDY: PREDICTIVE MAINTENANCEPredicting motor failure through

analysis of vibration data

CASE STUDY: HEALTH & FITNESSExercise identification based on

3D motion data analysis

CASE STUDY: SMART CITIESTraffic and air quality monitoring via

GPS and environmental sensor

CASE STUDY: SMART GRIDDemand-response optimizations on

supply-side capacity, spot prices

Page 4: Spark Streaming and IoT by Mike Freedman

Challenges in applying Spark to IoT

REQUIREMENTS

2 Devices send data at varying delays and rates

2 Handling delayed data transparently

3Processing many low-volume, independent streams

1 One IoT app performs tasks at different time intervals 1 Supporting full spectrum of

batch to real-time analysis

3 Within org, multiple IoT apps run concurrently

4Multi-tenancy with low-volume apps and high utilization

CHALLENGES

Potential economic impact of IoT is >$11 trillion per year, even while 99% of IoT data goes unused today.

— 2015 McKinsey study

Page 5: Spark Streaming and IoT by Mike Freedman

Required: Programming + data infra abstractions

Page 6: Spark Streaming and IoT by Mike Freedman

Supporting full spectrum of batch to real-time analysis1

Page 7: Spark Streaming and IoT by Mike Freedman

IoT analysis spans many intervals

BATCH PROCESSING (HOURS, NIGHTLY)

STREAM PROCESSING (REAL-TIME)

Fire / Hazard Detection

Immediately

Bus Locatio

n Updates

15 sec

Traffic Conditio

ns

1 min

Environmental Conditions

15 minTraffi

c Optimizatio

ns

Daily

Page 8: Spark Streaming and IoT by Mike Freedman

Spark simplifies programming across intervals

val readings = iobeamInterface.getInputStreamRecords()

// Trigger temperatures that fall outside acceptable conditions

val bad_temps = readings.filter(t => t > highTempThreshold || t < lowTempThreshold)

val triggers = new TriggerEventStream(bad_temps.map(t => new TriggerEvent("bad_temperature", t)))

// Compute mean temperatures over 5 min windows

val windows = readings.groupByKeyAndWindow(Seconds( 300 ), Seconds(60))

val mean_temps = new TimeSeriesStream("mean_temperature", windows.map(t => t.sum / t.length))

new OutputStreams(mean_temps, triggers)

30

1800

Page 9: Spark Streaming and IoT by Mike Freedman

But programming != data abstractions

DATA STREAMS (KAFKA, FLUME, SOCKETS, ETC.)

DATA FILES (HDFS, ETC.)

BATCH PROCESSING (HOURS, NIGHTLY)

STREAM PROCESSING (REAL-TIME)

Page 10: Spark Streaming and IoT by Mike Freedman

Programming != data abstractions

Traffic Conditio

ns

30 sec

Traffic Conditio

ns

1 hour

Frequencies change as products evolve1

Page 11: Spark Streaming and IoT by Mike Freedman

Programming != data abstractions

Joining real-time with historical data2

Frequencies change as products evolve1

5 min mean vs. trailing - hourly mean - hourly mean from yesterday - hourly mean from last week

Page 12: Spark Streaming and IoT by Mike Freedman

Programming != data abstractions

Joining real-time with historical data2

Supporting backfill for delayed data3

Frequencies change as products evolve1

Page 13: Spark Streaming and IoT by Mike Freedman

Programming != data abstractions

Joining real-time with historical data2

Supporting backfill for delayed data3

Frequencies change as products evolve1

Data Series Abstraction

Page 14: Spark Streaming and IoT by Mike Freedman

Handling delayed data transparently2

Page 15: Spark Streaming and IoT by Mike Freedman

Windows in streaming DBs

Tumbling windows

titjtk

Page 16: Spark Streaming and IoT by Mike Freedman

Windows in streaming DBs

Sliding windows

• Defined over # of tuples

• Defined over time period

…using arrival_time of tuples

titjtk

Page 17: Spark Streaming and IoT by Mike Freedman

But IoT data is often delayed

Seconds due to network congestion

Minutes due to duty cycling for energy savings

Minutes to hours due to intermittent connectivity

Windowing data by arrival time has no semantic meaning

titjtk

Page 18: Spark Streaming and IoT by Mike Freedman

Wanted: Data generation time, not arrival time

titjtk

filter by timestamp

Data semantics defined over timestamp

e.g., aggregation

JOIN ( , historical data)

Page 19: Spark Streaming and IoT by Mike Freedman

Wanted: Backfill does not change semantics

titjtk

filter by timestamp

Data semantics defined over timestamp

e.g., aggregation

JOIN ( , historical data)

…from recent streaming data… …from historical archive…

Page 20: Spark Streaming and IoT by Mike Freedman

Wanted: Better data infra abstractions

titjtk

filter by timestamp

e.g., aggregation

JOIN ( , historical data)

Data Series Abstraction

…from recent streaming data… …from historical archive…

Page 21: Spark Streaming and IoT by Mike Freedman

Processing many low-volume, independent streams3

IoT Device Streams

Page 22: Spark Streaming and IoT by Mike Freedman

Wanted: Maintain state across batches

Map to good/bad conditions

titjtk

Alert on condition transition

Page 23: Spark Streaming and IoT by Mike Freedman

Spark: Share state through RDDs

Click streamsAd impressions

Market feeds

Shared state b/w stream partitions

‣ Transforms RDD, makes state available across cluster

‣ Many great uses, e.g., learning parameters in iterative ML

‣ But increases data lineage increases checkpointing cost

Maintain shared state via updateKeyByState()

Page 24: Spark Streaming and IoT by Mike Freedman

IoT: Many independent streams

‣ Each worker handles 1+ streams, not multiple workers per stream

‣ Use language data structures (e.g., Java Map) to maintain state within worker

‣ No RDD transform no lineage increase no increased checkpointing cost

Independent state per stream

IoT Device Streams

Often only need to maintain state within individual streams

Page 25: Spark Streaming and IoT by Mike Freedman

Multi-tenancy with low-volume apps and high utilization4

Page 26: Spark Streaming and IoT by Mike Freedman

Multi-tenancy for batch processing

Job Queue

Server Cores

Spark: 1 worker = 1 server coreGoal: Minimize time-to-completion

Page 27: Spark Streaming and IoT by Mike Freedman

Multi-tenancy for batch processing

Job Queue

Server Cores

Spark: 1 worker = 1 server coreGoal: Minimize time-to-completion

Page 28: Spark Streaming and IoT by Mike Freedman

Multi-tenancy for stream processing

Job Queue

Server Cores

Spark: 1 worker = 1 server coreProblem: Low utilization with low-rate apps

Page 29: Spark Streaming and IoT by Mike Freedman

Multi-tenancy for stream processing

Job Queue

Server Cores

Virtual Cores(e.g., resource-limited containers)

1 worker = 1 virtual coreN workers = 1 server core

Goal: Improve utilization with low-rate apps

Page 30: Spark Streaming and IoT by Mike Freedman

Multi-tenancy for stream processing

Job Queue

Server Cores

Virtual Cores(e.g., resource-limited containers)

1 worker = 1 virtual coreN workers = 1 server core

Goal: Improve utilization with low-rate apps

Page 31: Spark Streaming and IoT by Mike Freedman

Spark + Unified Data Infrastructure

Page 32: Spark Streaming and IoT by Mike Freedman

Required: Programming + data infra abstractions

Page 33: Spark Streaming and IoT by Mike Freedman

Required: Programming + data infra abstractions

Page 34: Spark Streaming and IoT by Mike Freedman

Device-Model-Infra (DMI) framework for IoT

\

Page 35: Spark Streaming and IoT by Mike Freedman

Questions?

Developers: docs.iobeam.com

Whitepaper: www.iobeam.com/docs/iobeam-DMI.pdf