streaming customer insights with datastax cassandra & apache kafta at british gas connected...
TRANSCRIPT
Streaming Customer InsightsWith DSE Cassandra and Apache Kafka
At British Gas Connected Homes
Josep Casals | @jcasals | 2016 London
Data Sources• Gas and electricity meter readings
• Thermostat temperature data
• Connected boiler data
• Real time energy consumption data
• Introducing motion sensors, window and door sensors, etc.
2Josep Casals | @jcasals | 2016 London
Meter Data
• Millions of gas and electricity customers
• 2 Million smart meters
• Readings every 30 minutes from smart meters
3Josep Casals | @jcasals | 2016 London
Machine Learning applied to Meter Data
• Energy disaggregation
• Similar homes comparison
• Smart meters used in indirect algorithms for non-smart customers
4Josep Casals | @jcasals | 2016 London
Connected Thermostats
• 300k Connected Thermostats
• Temperature data time series
5Josep Casals | @jcasals | 2016 London
Boiler IQ
• Proactive maintenance
• Failure detection
6Josep Casals | @jcasals | 2016 London
In Home Displays in a mobile App
• Data every 10 seconds
• Still needs an access device connected to the router
• Allows real time mobile alerts
7Josep Casals | @jcasals | 2016 London
Connected Home’s Streaming architecture
What real time looks like
• Temperature updates via web socket
• We plot them on a map using postal codes
• Updates for 25 out of 100 partitions
Use Case High Consumption Alerts
• The red dot on top is what we want to detect
• The green bottom dots are the baseline plus the fridge
10Josep Casals | @jcasals | 2016 London
High Consumption Alerts Data Ingest
• Very high volume of messages (every 10 seconds)
• Kafka partitions help us cope with volume
• Often we miss reads, the Samza job also does basic interpolation
11Josep Casals | @jcasals | 2016 London
High Consumption Alerts Spark Streaming with Cassandra
• Real time data comes from Kafka
• Cassandra stores historical usage information
• A Spark Streaming job combines both and applies a machine learning algorithm to generate high usage alerts
12Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Design tips Cassandra (+ Spark)• Partition your customers using buckets
• Use consistent partitioning across Spark & Cassandra as much as possible
• Don’t make your C* nodes too big (< 1TB) otherwise operations become painful.
• Don’t put all your tables inside one schema (it’s good to have flexibility setting replication factors)
Cassandra data modeling with buckets• Using a hashing function that is uniform and deterministic we can cope
with time series data of any amount of customers
• One of our preferred strategies is to use buckets
14Josep Casals | @jcasals | 2016 London
h(k) = ⌊m * frac(kA)⌋• Multiplicative hashing is our preferred simple partitioning algorithm
• m= Number of partitions
• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)
• Online example: jsfiddle.net/joscas/yfp72fq5
15Josep Casals | @jcasals | 2016 London
Design Tips Kafka + Spark Streaming
• Keep our own offsets (don’t rely on Spark checkpointing)
• Avro makes learning curve steeper but it’s worth the effort. (convert into binary + schema as soon as possible)
• Kafka producers are expensive if created for each RDD for each Spark Streaming micro batch
Design Tips Kafka + Spark Streaming
• Beware: Offset Out of Range Exception - ooore :-(
• Kafka manager is very useful
• Schema registry is a weak spot (log.cleanup.policy = compact)
Kafka producer factory for Spark Streaming