big data analytics for connected home
TRANSCRIPT
May 22, 2015
Data Science ConsultingHéloïse Nonne
Senior Data Scientist - Manager
Big Data Analytics for connected home
Data analytics for disconnected homes
2
𝑦𝑡 = 𝜇 + 𝜖𝑡 + 𝜙1𝑦𝑡−1 +⋯+ 𝜙𝑛𝑦𝑡−𝑛 − 𝜃1𝜖𝑡−1 −⋯− 𝜃𝑛𝜖𝑡−𝑛
ARIMA models
(AutoRegressive Integrated Moving Average)
𝑦𝑡 = electric load at time t𝜖𝑡 = noise at time t
• Very low frequency resolution for local
(household) measurements (< trimestrial)
• Only aggregated data (sum of individual
loads) for higher frequency
measurements (region, neighborhood)
• Data storage issues
• Computation power
• Limited knowledge at local level
• Limited predictive power• Complex sophisticated models
exist but are difficult to tune
• Sun
• Wind
• Cloud cover
• Humidity
• Temperature
Reducing electricity costs: a complete data ecosystem
3Weather
Energy production
Energy price
Historical data
Actual measurement (real-time)
Forecast
• Appliances and
use
• Heating
• Electricity storage
• Elevators
• Doors / lights
• Network activity
-> current
occupation
• Renewable
energy
• Shutter
orientation
• Anthropologic
data
• Building structure
(thermal mass)
Electricity
demand ????
Regional / national scale
Local / neighborhood scale
Anthropologic data
• Energy consumption
patterns
Anthropologic data
• comfort temperature
• children at school
• activity of occupants
• Weekday /holiday
• Hour of day
Multiple sources of data for multiple models
• Volume
– vast amounts of data
– too large to store and analyse using traditional technology
• Velocity
– speed at which new data is generated
– speed at which data change
• Variety
– types of data (number, text, images, video)
– types of sources (real-time, static)
• Veracity
– accuracy of data (frequency, errors)
– quality of data (sampling errors, typos)
4
Technology choices depend on the usecase
Transaction-oriented • Write/Read• Logs• Transactions
Streaming-oriented • Compute on the fly• Reactivity• Real-time decisions
Computationally intensive• CPU/GPU bound• Complex problem to solve
Storage-oriented• Loads of data• Analysis• Algorithms
Hadoop
SQL interactive
TezMahout
Spark
HbaseCassandra
HPC
StormKafkaSpark
Hardware
Software
Need
Bank – Stock marketWeb logsIn/out
Image recognitionResearch on DNA,…
Energy load managementIndustrial processesAeronauticsCustomers Web journey
Bank – InsuranceCustomer managementRecords, archiving
5
Anomaly
detection
Load
prediction
Statistics for reporting on dashboards
Identification of
consumption patterns
Data analytics on energy load
6
• Moving average and thresholds
• Outlier detection
• ARIMA
• Neural networks
• Recurrent neural networks
+ +
• Clustering: K-means, DBScan
• Self-organizing maps
• Recommendations to reschedule appliances
• Storage of energy (photovoltaic, geothermic, etc.)
Many usecases
• Detect precarity (underheating)
• Detect people in distress (illnesses, elderly, heat wave, …)
• Improved safety (fire detection, security, …)
Business Society
Research / knowledge Sustainability
• Building optimization (thermal mass, isolation,
configuration, windows orientation)
• Consumption patterns
• Social behaviors
• Optimize use and storage of energy (light
management, applicances use, demand reduction, …)
• Improve comfort in neighborhood
• Reduce waste (energy, water, appliances)
• Scoring and customer segmentation
• Predict the demand in energy
• Predictive maintenance (elevators, HVAC, photovoltaic, ..)
• Cost reduction
But remain pragmatic and think about the whole picture
-> predictive maintenance on light bulbs ??!7
Predictive maintenance
Data
• Shaft speed
• Vibrations (X, Y, Z)
• Sound measurements
• Rail vibrations
• Motor temperature
• Oil buffer
• …
Wear, failure
• Bearing fault
• Door: Shoe deformation
• Unbalance
• Misalignment
• Resonance
• …
Elevator maintenance
predict failure before breakage
Cost reduction and improvement of reliability through predictive maintenance
8
A predictive maintenance management system
• Continuous adaptation of diagnostic
• Build, increase and maintain knowledge
• Handle large quantity of data
• Handle uncertainty in diagnostic
• Assess fault severity
Requirements
• Symptoms are a mix of different causes
• Information is unclear
• Limited frequency resolution
• Missing data
• Noise
Challenges
Data centerRemote management
system
Richer knowledge
multiple
sources9
Bayesian networks
• Compact representation of entities states or
events as random variables
• Contains knowledge about how states /events are
related
BF Bearing fault
DF Door deformation
WU Weight unbalance
RN Resonance
MA Misalignment
AYXVibration freq peak on axis A at Y X
TP Temperature > x °C
SP Shaft speed freq peaks
SdB Sound > x dB
MA
RN
SP
SdB
BF
DFWU
X1X X2X
Y1X Y2X
Z1X Z2XTP
• Qualitative = dependence relations• Quantitative = the strengths of the relations
• Mix a priori knowledge with experimental (real-time) data
• Explanatory (human understanding of phenomena vs black-box
models)
• Uncertainty management (assessment of probability of failure)
• Possibility to learn
• Parameters
• Structures (events, entities, causes and effects)
AdvantagesBayesian network
Decision rules for
action
10
Absolute need of prior
knowledge from
professionals
Bayesian networks
MA
RN
SP
SdB
BF
DFWU
X1X X2X
Y1X Y2X
Z1X Z2XTP
WU
True (failure) 0.60
False 0.40
Experience 10
A priori conditional probability table Update with new experience
P n + 1 =(P n ∗ nb_experiences) + 1
nb_experiences + 1
WU
True (failure) 0.636
False 0.364
Experience 11
One can unlearn (forget the past (outdated) experiences)
by using fading tables
Add a fading factor in front of the oldest experiences
11
The big (data) picture
• Many sources of data: weather, energy production, economic, social, behavioral data, appliances characteristics,
current building occupation, activity, etc.
• Different scales: worldwide, regional, local, individual
• Different times: historical data, year, month, day, hour, real-time
• The system is not going to be perfect at once -> design it constant improvement
• A single model is useless: each model has its use and models feed each other with their knowledge and prediction
• Choose the right model and the right technology: according to usecase, time cost, energy cost,
pragmatism, realism
• Build models with the professionals who know the problem
-> build on existing knowledge
An efficient system implies close collaboration
business, researchers, manufacturers, maintainers, owners, users, developpers, data
scientists, data managers, optimization specialists, and end-users12
Quantmetry – Spécialiste de la Data science
Agir
Prédire
Analyser
Stocker
Collecter
13
De plus en plus de data disponibles
Tout stocker!
Analyser pour mieux comprendre signaux forts et faibles
Prévoir ce qui peut advenir grâce aux tendances du passé
Automatiser la décision et l’action
Quantmetry accompagne ses clients sur l’ensemble des strates de la pyramide des données et participe ainsi à leur transformation digitale par le quantitatif
pour des résultats concrets sur leur performance business.
• un cabinet de conseil « pure player » du Big Data et de la Data science dont le développement commercial a démarré en 2013
• des méthodes statistiques avancées, le machine learning et les technologies Big data
• 2014: 1,5 millions d’euros de chiffres d’affaire avec une forte ambition de croissance, en France et à l’étranger
• Une vingtaine de data scientists / consultants
Activités de Quantmetry
14
Optimisation Business par la Data
Structuration d’un Data Lab
Conseil Accompagnement Réalisation
• Détection et priorisation d’opportunités
par la data
• Construction de schéma d’architecture IT
• Retours d’expérience et bonnes pratiques
• Schéma d’organisation et de gouvernance
• Choix d’une architecture technologique
Conduite du changement
Conduite de projet
• Cadrage, projet d’industrialisation
• Méthodologie (modèles statistiques
et algorithmes)
• Technologies Big Data
• Montée en compétences
• Recrutement
• Gouvernance
Projets pilotes
Industrialisation
• Proof of concept de Data science
• Pilotes technologiques
• Industrialisation de pilotes (API, …)
• Création d’une architecture Big Data
et mise en place de flux de données
Veillle technologique et expérimentations
• Des thèmes d’investigation :
– Online learning
– Deep learning et réseaux de neurones
– Industrialisation
– Analyse sémantique
– Energie (analyse de séries temporelles)
– Smart cities
– Amélioration de l’expérience utilisateur
• Acteur de l’écosystème Big Data : participation à des
séminaires, conférences internationales, hackathons,
compétitions Kaggle, partenariats éditeurs… Collaborations
avec des laboratoires de recherche et des écoles.
15
• Création et développement de produits spécifiques autour des technologies Big Data
• Recherche et développement en Data science
Baseline
(régression
logistique)
Gradient
Boosting
Données
non
structurée
s
Feature
engineeri
ng
Lift =
2
Lift =
6
Quelques Références en Data science
16
Amélioration du lift pour la
conquête en banque des
clients assurés
Détection de churn pour un
opérateur télécom
0 20 40
URL page résilitation
Age
Groupe
Nb pages vues…
Durée session
Mise en place d’un Data
Lab pour un assureur
Analyse de comportements
pour une mutuelle
Optimisation d’un outil de pricing
pour un acteur de la distribution B2B Modèles prédictifs de
consommation d’énergie
Excellence Altruisme Résultats
…et Big Data
Visitez notre blogquantmetry-blog.com
www.quantmetry.com