streaming customer insights with datastax cassandra & apache kafta at british gas connected...

Streaming Customer InsightsWith DSE Cassandra and Apache Kafka

At British Gas Connected Homes

Josep Casals | @jcasals | 2016 London

Data Sources• Gas and electricity meter readings

• Thermostat temperature data

• Connected boiler data

• Real time energy consumption data

• Introducing motion sensors, window and door sensors, etc.

2Josep Casals | @jcasals | 2016 London

Meter Data

• Millions of gas and electricity customers

• 2 Million smart meters

• Readings every 30 minutes from smart meters


Machine Learning applied to Meter Data

• Energy disaggregation

• Similar homes comparison

• Smart meters used in indirect algorithms for non-smart customers


Connected Thermostats

• 300k Connected Thermostats

• Temperature data time series


Boiler IQ

• Proactive maintenance

• Failure detection


In Home Displays in a mobile App

• Data every 10 seconds

• Still needs an access device connected to the router

• Allows real time mobile alerts


Connected Home’s Streaming architecture

What real time looks like

• Temperature updates via web socket

• We plot them on a map using postal codes

• Updates for 25 out of 100 partitions

Use Case High Consumption Alerts

• The red dot on top is what we want to detect

• The green bottom dots are the baseline plus the fridge


High Consumption Alerts Data Ingest

• Very high volume of messages (every 10 seconds)

• Kafka partitions help us cope with volume

• Often we miss reads, the Samza job also does basic interpolation


High Consumption Alerts Spark Streaming with Cassandra

• Real time data comes from Kafka

• Cassandra stores historical usage information

• A Spark Streaming job combines both and applies a machine learning algorithm to generate high usage alerts

12Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Design tips Cassandra (+ Spark)• Partition your customers using buckets

• Use consistent partitioning across Spark & Cassandra as much as possible

• Don’t make your C* nodes too big (< 1TB) otherwise operations become painful.

• Don’t put all your tables inside one schema (it’s good to have flexibility setting replication factors)

Cassandra data modeling with buckets• Using a hashing function that is uniform and deterministic we can cope

with time series data of any amount of customers

• One of our preferred strategies is to use buckets


h(k) = ⌊m * frac(kA)⌋• Multiplicative hashing is our preferred simple partitioning algorithm

• m= Number of partitions

• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)

• Online example: jsfiddle.net/joscas/yfp72fq5


http://jsfiddle.net/joscas/yfp72fq5

Design Tips Kafka + Spark Streaming

• Keep our own offsets (don’t rely on Spark checkpointing)

• Avro makes learning curve steeper but it’s worth the effort. (convert into binary + schema as soon as possible)

• Kafka producers are expensive if created for each RDD for each Spark Streaming micro batch

Design Tips Kafka + Spark Streaming

• Beware: Offset Out of Range Exception - ooore :-(

• Kafka manager is very useful

• Schema registry is a weak spot (log.cleanup.policy = compact)

Kafka producer factory for Spark Streaming

Thank [email protected]

@jcasals


mailto:[email protected]

streaming customer insights with datastax cassandra & apache kafta at british gas connected...

Technology