automated anomaly detection, data validation and correction for environmental sensors using...

20
Automated Anomaly Detection, Data Validation and Correction for Environmental Sensors using Statistical Machine Learning Techniques www.aquaticinformatics.com | 1 Touraj Farahmand - Aquatic Informatics Inc. Kevin Swersky - Aquatic Informatics Inc. Nando de Freitas - Department of Computer Science – Machine Learning University of British Columbia (UBC)

Upload: bryan-fields

Post on 18-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Automated Anomaly Detection, Data Validation and Correction for Environmental Sensors using Statistical

Machine Learning Techniques

www.aquaticinformatics.com | 1

Touraj Farahmand - Aquatic Informatics Inc. Kevin Swersky - Aquatic Informatics Inc. Nando de Freitas - Department of Computer Science – Machine Learning University of British Columbia (UBC)

Automated data validation and QA/QC is becoming increasingly important

Growing real-time monitoring sites with huge amount of high sampling rate data

Ensuring quality controlled and clean real-time data continuously available for:

Publishing services Online data mining and analysis tools Online warning and alert system to minimize false positive

alerts Mission critical modeling systems such as flood forecasting

and event detection

www.aquaticinformatics.com | 2 www.aquaticinformatics.com | 2

www.aquaticinformatics.com | 3

Data Logger Comm. LinkData

Acquisition and Decoding

Data Management

System

Sensor outliers

Sensor Drift

Comm. outliers

Comm. Gap

Real abnormal event

Real Parameter from Natural Environment

Sensor Signal before comm. transmission (Logger signal)

Observed telemetry signal after comm. reception and decoding

Site visit and logger data filesField measurementsCalibration ErrorsFouling ErrorsLogger data file

Telemetry Data

www.aquaticinformatics.com | 4

www.aquaticinformatics.com | 5

Environmental time series in general are complex and hard to model

Problems:

Highly non-stationary Highly non-linear Many changes in dynamics Can contains outliers, anomalies, gaps, etc.

Our models need to be:

General Flexible Robust Interpretable Fast and efficient for real-time application Easy to setup and use Can provide the uncertainty of the results

www.aquaticinformatics.com | 6

The (traditional) frequentist approach

Examples:• Linear regression• Hypothesis testing• Confidence intervals

In frequentist paradigm Probability is defined in terms of the frequencies of random repeatable events

Here, we create a model with parameters Θ, and fit the model to data X. This forms a probability distribution P(X| Θ) which is the likelihood of data given the parameter

We can create very flexible models by adding more parameters

With enough parameters we can fit almost anything!

Data Modeling Approaches

www.aquaticinformatics.com | 7

“With enough parameters we can fit almost anything!”

This sounds nice, but adding too many parameters means we will overfit

Overfitting means we can get very low error on training data, but this model will be useless in practice

But a model that is too simple will also do a poor job

We need some sort of tradeoff between model complexity and model generalization

This is difficult and tedious with frequentist methods

Data Modeling Approaches

www.aquaticinformatics.com | 8

Bayesian methods solve these issues

In Bayesian paradigm, probability provides quantification of uncertainty and makes precise revision of uncertainty in light of new observation

Highly flexible, very general, interpretable and easy to work with Automatically finds the correct model complexity Bonus: naturally incorporates uncertainty and prior knowledge about the

problem Some Applications of statistical machine learning:

Financial prediction Fraud detection (e.g. credit cards) Spam detection Search and recommendation (e.g. Google, Amazon) Automatic speech recognition & speaker verification Face location and identification Troubleshooting and fault detection/correction Printed and handwritten text parsing Much more…

Data Modeling Approaches

www.aquaticinformatics.com | 9

The Bayesian approach

Rather than assuming there is one true Θ that generates our data, we assume there is a distribution over possible Θ’s

Our goal is now to find P(Θ|X) and we use Baye’s rule

P(Θ) is called the prior, it is used to express prior knowledge

Although simple, this idea provides a powerful modeling framework, and naturally guards against overfitting

We can now use infinitely many parameters! P(Θ|X) will only be high when Θ appropriately models the data

This gives us very flexible and very powerful models

Data Modeling Approaches

www.aquaticinformatics.com | 10

R&D Status and Results

Generic Bayesian inference framework has been developed and compiled into AQUARIUS scripting toolbox for Alpha tests

A fast and efficient (real-time) linear and piecewise (switching) linear dynamical machine learning model has been developed and compiled into AQUARIUS scripting toolbox:

Sensor fault/anomaly detection. E.g. outlier, stuck sensor, offset,… Data correction and estimation. E.g. gap filling Short term prediction Smoothing Minimal user interaction since it learns all parameter from data

Nonlinear dynamical machine learning models is under research: They are more accurate for modeling highly chaotic signals The big challenge is computational complexity and speed of training and

inference The framework of suggested correction/flagging and audit trail has already been

added into Data Correction toolbox for automated processes No UI and front end available for modeling yet. It is coming soon… We have started a pilot project with one of our clients

www.aquaticinformatics.com | 11

AQUARIUS Whiteboard For Training/Test for models

We can run this on the server as part of data pre-processing workflow

www.aquaticinformatics.com | 12

Univariate Model Results: Gap Filling/Prediction

www.aquaticinformatics.com | 13

Multivariate Model Results: Gap Filling/Prediction

www.aquaticinformatics.com | 14

Multivariate Model Results: Gap Filling/Prediction

www.aquaticinformatics.com | 15

Univariate Model Results: Sensor fault detection

Flags

www.aquaticinformatics.com | 16

Univariate Model Results: Anomaly detection

www.aquaticinformatics.com | 17

Univariate Model Results: Spike detection

www.aquaticinformatics.com | 18

Univariate Model Results: Offset detection

Flags

www.aquaticinformatics.com | 19

Summary Automated anomaly detection, data validation and QA/QC is becoming

increasingly important Bayesian techniques and probabilistic models give us very flexible and

powerful framework for modeling sequential data and time series They naturally incorporate uncertainty and prior knowledge not supported by

other techniques They naturally guard against overfitting which is a serious problem of

traditional methods They provide the distribution of model parameters given the observation In most of the use cases they learn required parameter from data and

metadata with minimal user interactionThey can be used for:

Anomaly detection Data correction (estimation) Prediction Smoothing Sensor fault detection and diagnosis Uncertainty propagation for derived data

Questions?

www.aquaticinformatics.com | 20