big data analytics for connected home

May 22, 2015

Data Science ConsultingHéloïse Nonne

Senior Data Scientist - Manager

Big Data Analytics for connected home

Data analytics for disconnected homes

2

𝑦𝑡 = 𝜇 + 𝜖𝑡 + 𝜙1𝑦𝑡−1 +⋯+ 𝜙𝑛𝑦𝑡−𝑛 − 𝜃1𝜖𝑡−1 −⋯− 𝜃𝑛𝜖𝑡−𝑛

ARIMA models

(AutoRegressive Integrated Moving Average)

𝑦𝑡 = electric load at time t𝜖𝑡 = noise at time t

• Very low frequency resolution for local

(household) measurements (< trimestrial)

• Only aggregated data (sum of individual

loads) for higher frequency

measurements (region, neighborhood)

• Data storage issues

• Computation power

• Limited knowledge at local level

• Limited predictive power• Complex sophisticated models

exist but are difficult to tune

• Sun

• Wind

• Cloud cover

• Humidity

• Temperature

Reducing electricity costs: a complete data ecosystem

3Weather

Energy production

Energy price

Historical data

Actual measurement (real-time)

Forecast

• Appliances and

use

• Heating

• Electricity storage

• Elevators

• Doors / lights

• Network activity

-> current

occupation

• Renewable

energy

• Shutter

orientation

• Anthropologic

data

• Building structure

(thermal mass)

Electricity

demand ????

Regional / national scale

Local / neighborhood scale

Anthropologic data

• Energy consumption

patterns

Anthropologic data

• comfort temperature

• children at school

• activity of occupants

• Weekday /holiday

• Hour of day

Multiple sources of data for multiple models

• Volume

– vast amounts of data

– too large to store and analyse using traditional technology

• Velocity

– speed at which new data is generated

– speed at which data change

• Variety

– types of data (number, text, images, video)

– types of sources (real-time, static)

• Veracity

– accuracy of data (frequency, errors)

– quality of data (sampling errors, typos)

4

Technology choices depend on the usecase

Transaction-oriented • Write/Read• Logs• Transactions

Streaming-oriented • Compute on the fly• Reactivity• Real-time decisions

Computationally intensive• CPU/GPU bound• Complex problem to solve

Storage-oriented• Loads of data• Analysis• Algorithms

Hadoop

SQL interactive

TezMahout

Spark

HbaseCassandra

HPC

StormKafkaSpark

Hardware

Software

Need

Bank – Stock marketWeb logsIn/out

Image recognitionResearch on DNA,…

Energy load managementIndustrial processesAeronauticsCustomers Web journey

Bank – InsuranceCustomer managementRecords, archiving

5

Anomaly

detection

Load

prediction

Statistics for reporting on dashboards

Identification of

consumption patterns

Data analytics on energy load

6

• Moving average and thresholds

• Outlier detection

• ARIMA

• Neural networks

• Recurrent neural networks

+ +

• Clustering: K-means, DBScan

• Self-organizing maps

• Recommendations to reschedule appliances

• Storage of energy (photovoltaic, geothermic, etc.)

Many usecases

• Detect precarity (underheating)

• Detect people in distress (illnesses, elderly, heat wave, …)

• Improved safety (fire detection, security, …)

Business Society

Research / knowledge Sustainability

• Building optimization (thermal mass, isolation,

configuration, windows orientation)

• Consumption patterns

• Social behaviors

• Optimize use and storage of energy (light

management, applicances use, demand reduction, …)

• Improve comfort in neighborhood

• Reduce waste (energy, water, appliances)

• Scoring and customer segmentation

• Predict the demand in energy

• Predictive maintenance (elevators, HVAC, photovoltaic, ..)

• Cost reduction

But remain pragmatic and think about the whole picture

-> predictive maintenance on light bulbs ??!7

Predictive maintenance

Data

• Shaft speed

• Vibrations (X, Y, Z)

• Sound measurements

• Rail vibrations

• Motor temperature

• Oil buffer

• …

Wear, failure

• Bearing fault

• Door: Shoe deformation

• Unbalance

• Misalignment

• Resonance

• …

Elevator maintenance

predict failure before breakage

Cost reduction and improvement of reliability through predictive maintenance

8

A predictive maintenance management system

• Continuous adaptation of diagnostic

• Build, increase and maintain knowledge

• Handle large quantity of data

• Handle uncertainty in diagnostic

• Assess fault severity

Requirements

• Symptoms are a mix of different causes

• Information is unclear

• Limited frequency resolution

• Missing data

• Noise

Challenges

Data centerRemote management

system

Richer knowledge

multiple

sources9

Bayesian networks

• Compact representation of entities states or

events as random variables

• Contains knowledge about how states /events are

related

BF Bearing fault

DF Door deformation

WU Weight unbalance

RN Resonance

MA Misalignment

AYXVibration freq peak on axis A at Y X

TP Temperature > x °C

SP Shaft speed freq peaks

SdB Sound > x dB

MA

RN

SP

SdB

BF

DFWU

X1X X2X

Y1X Y2X

Z1X Z2XTP

• Qualitative = dependence relations• Quantitative = the strengths of the relations

• Mix a priori knowledge with experimental (real-time) data

• Explanatory (human understanding of phenomena vs black-box

models)

• Uncertainty management (assessment of probability of failure)

• Possibility to learn

• Parameters

• Structures (events, entities, causes and effects)

AdvantagesBayesian network

Decision rules for

action

10

Absolute need of prior

knowledge from

professionals

Bayesian networks

MA

RN

SP

SdB

BF

DFWU

X1X X2X

Y1X Y2X

Z1X Z2XTP

WU

True (failure) 0.60

False 0.40

Experience 10

A priori conditional probability table Update with new experience

P n + 1 =(P n ∗ nb_experiences) + 1

nb_experiences + 1

WU

True (failure) 0.636

False 0.364

Experience 11

One can unlearn (forget the past (outdated) experiences)

by using fading tables

Add a fading factor in front of the oldest experiences

11

The big (data) picture

• Many sources of data: weather, energy production, economic, social, behavioral data, appliances characteristics,

current building occupation, activity, etc.

• Different scales: worldwide, regional, local, individual

• Different times: historical data, year, month, day, hour, real-time

• The system is not going to be perfect at once -> design it constant improvement

• A single model is useless: each model has its use and models feed each other with their knowledge and prediction

• Choose the right model and the right technology: according to usecase, time cost, energy cost,

pragmatism, realism

• Build models with the professionals who know the problem

-> build on existing knowledge

An efficient system implies close collaboration

business, researchers, manufacturers, maintainers, owners, users, developpers, data

scientists, data managers, optimization specialists, and end-users12

Quantmetry – Spécialiste de la Data science

Agir

Prédire

Analyser

Stocker

Collecter

13

De plus en plus de data disponibles

Tout stocker!

Analyser pour mieux comprendre signaux forts et faibles

Prévoir ce qui peut advenir grâce aux tendances du passé

Automatiser la décision et l’action

Quantmetry accompagne ses clients sur l’ensemble des strates de la pyramide des données et participe ainsi à leur transformation digitale par le quantitatif

pour des résultats concrets sur leur performance business.

• un cabinet de conseil « pure player » du Big Data et de la Data science dont le développement commercial a démarré en 2013

• des méthodes statistiques avancées, le machine learning et les technologies Big data

• 2014: 1,5 millions d’euros de chiffres d’affaire avec une forte ambition de croissance, en France et à l’étranger

• Une vingtaine de data scientists / consultants

Activités de Quantmetry

14

Optimisation Business par la Data

Structuration d’un Data Lab

Conseil Accompagnement Réalisation

• Détection et priorisation d’opportunités

par la data

• Construction de schéma d’architecture IT

• Retours d’expérience et bonnes pratiques

• Schéma d’organisation et de gouvernance

• Choix d’une architecture technologique

Conduite du changement

Conduite de projet

• Cadrage, projet d’industrialisation

• Méthodologie (modèles statistiques

et algorithmes)

• Technologies Big Data

• Montée en compétences

• Recrutement

• Gouvernance

Projets pilotes

Industrialisation

• Proof of concept de Data science

• Pilotes technologiques

• Industrialisation de pilotes (API, …)

• Création d’une architecture Big Data

et mise en place de flux de données

Veillle technologique et expérimentations

• Des thèmes d’investigation :

– Online learning

– Deep learning et réseaux de neurones

– Industrialisation

– Analyse sémantique

– Energie (analyse de séries temporelles)

– Smart cities

– Amélioration de l’expérience utilisateur

• Acteur de l’écosystème Big Data : participation à des

séminaires, conférences internationales, hackathons,

compétitions Kaggle, partenariats éditeurs… Collaborations

avec des laboratoires de recherche et des écoles.

15

• Création et développement de produits spécifiques autour des technologies Big Data

• Recherche et développement en Data science

Baseline

(régression

logistique)

Gradient

Boosting

Données

non

structurée

s

Feature

engineeri

ng

Lift =

2

Lift =

6

Quelques Références en Data science

16

Amélioration du lift pour la

conquête en banque des

clients assurés

Détection de churn pour un

opérateur télécom

0 20 40

URL page résilitation

Age

Groupe

Nb pages vues…

Durée session

Mise en place d’un Data

Lab pour un assureur

Analyse de comportements

pour une mutuelle

Optimisation d’un outil de pricing

pour un acteur de la distribution B2B Modèles prédictifs de

consommation d’énergie

Excellence Altruisme Résultats

…et Big Data

Visitez notre blogquantmetry-blog.com

www.quantmetry.com

big data analytics for connected home

Data & Analytics

new data

multiple sources of

complete data ecosystem

waste energy

electric load

demand reduction

electricity costs

time t