in a lot of applications, wireless sensing systems are used for inference and prediction on...

1
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. In a lot of applications, wireless sensing systems are used for inference and prediction on environmental phenomena. Statistical models are widely used to represent these environmental phenomena: Models characterize how unknown quantities (phenomena) are related to known quantities (measurements): Choosing the models involves a great deal of uncertainty. Often a single model M is used. If M does not characterize a phenomenon correctly, the inferences and predictions will not be accurate. It is better to start with multiple plausible models and select the model by collecting measurements at informative locations. Reducing Uncertainty in Sensor Calibration Reducing Uncertainty in Sensor Calibration Reducing Uncertainty in Hardware Functionality (Fault Reducing Uncertainty in Hardware Functionality (Fault Detection/Diagnosis) Detection/Diagnosis) Take Physical Sample Reducing Uncertainty in Model Selection Reducing Uncertainty in Model Selection Minimizing Data Uncertainty through Minimizing Data Uncertainty through System Design System Design Deploymen t Data quality indicators Banglades h 45% GDI Sensors reported 3-60% faulty data Ecuador Volcano 82% false negative rate / 13% false positive rate Macroscop e 8 of 33 temperature sensors faulty Laura Balzano, Nabil Hajj Chehade, Sheela Nair, Nithya Ramanathan, Abhishek Sharma, Deborah Estrin, Leana Golubchik, Ramesh Govindan, Mark Hansen, Eddie Kohler, Greg Pottie, Mani Srivastava Integrity Group, Center for Embedded Networked Sensing Introduction: Introduction: There are Many Sources of Uncertainty in There are Many Sources of Uncertainty in Interpreting Data Interpreting Data Environment Modeling Uncertainty Sensor Calibration Uncertainty UCLA – UCR – Caltech – USC – UC Merced UCLA – UCR – Caltech – USC – UC Merced Center for Embedded Networked Sensing Center for Embedded Networked Sensing Data uncertainty can be reduced through careful system design! Hardware Uncertainty Wireless sensing systems utilize low cost and unreliable hardware Faults are common Examples of Sensor Faults Accurate calibration function is required to translate data from sensors Calibration parameters for most sensors drifts non- deterministically over time Problem Description: Online fault detection and diagnosis By detecting faults when they occur, instead of after the fact, users can take actions in the field to validate questionable data and fix hardware faults. Confidence Assumptions: Faults can be common, an initial fault-free training period is not always available, environmental phenomena are hard to predict so tight bounds on expected behavior are not possible Evaluated in real-world Deployments Confidence detects faults with low false positive and negative rates. Difficult to validate what is truly a fault without ground truth In our San Joaquin deployment we validated data by analyzing soil samples taken from each sensor Outlier Detection: Using a continually updated distribution, in place of statically defined thresholds, makes Confidence resilient to human configuration error and adaptable to dynamic environments Replace Sensor Gradient Standard Deviation Readings are mapped into a multi-dimensional space defined by carefully chosen features: gradient, distance from LDR, distance from NLDR, standard deviation. Points far from the origin are faulty Assume a normal distribution of distances for good points. Points outside 2 standard deviations of the mean distance are considered outliers and are rejected. All other points are used to continually update distribution parameters. Points are clustered using an online K-means algorithm. Clusters are associated with a previously successful remediating action Bangladesh Detects 85% of faulty data in a real-world data-trace captured in Bangladesh even though over one third of the data are faulty San Joaquin River We ran Confidence In a deployment of 20 sensors in San Joaquin. Confidence accurately detected all 4 faults that occurred and correctly diagnosed 3 of the 4 faults, with no false positives or negatives Data-driven techniques for identifying faulty sensor readings 1) Rule/Heuristic-based methods 2) Linear-Least Squares Estimation based method • Exploits correlation in the data measured at different sensors • LLSE Equation: HMM model: Number of states Transition probabilities Conditional probability: Pr [O | S ] • SHORT Rule : Compute the rate of change between two successive samples. If it is above a threshold, this is an instance of SHORT fault. •NOISE Rule : Compute the std. deviation of samples within time window W. If it is above a threshold, the samples are corrupted by NOISE fault. Results Analyzed data sets from real world deployments to characterize the prevalence of data faults using these 3 methods. NAMOS deployment : CONSTANT+NOISE faults, up to 30% of samples affected by data faults. Intel Lab, Berkeley deployment: CONSTANT+NOISE faults, up to 20% of samples affected by data faults. • Great Duck Island deployment: SHORT+NOISE faults, 10-15% of samples affected by data faults. SensorScope deployment: SHORT faults, very few samples affected by data faults. SHORT fault NOISE fault 3) Learning data models : Hidden Markov Models Injected CONSTANT fault NO YES Signatures for modeling normal and faulty behavior Difficult to initialize sensor signature without learning period that is guaranteed to be fault-free. Can use a stricter threshold during learning period to decrease chance of incorporating faults into sensor signature Method is dependent on accurately representing fault models, which is difficult without available labeled training data. Summarize sensor and fault behaviors using a signature: multivariate probability density of features (Cahill, Lambert, Pinhiero, and Sun; 2000) Features chosen to exploit differences between faulty and normal behavior. Current features summarize temporal and spatial information: Temporal: actual reading, change between successive readings, voltage Spatial: diff. from neighboring sensors. Calculate score for new readings using log likelihood ratio: Higher scores are more suspicious. Use of sensor signatures allows for sensor-specific fault detection. Fault Detection Algorithm (adapted from Detecting Fraud In the Real World; Cahill, Lambert, Pinhiero, and Sun; 2000) Tested on one week of Cold-Air Drainage data 4/06 4/08 4/10 4/12 stuck-at fault Sensor 2 malfunctioning at start of deployment; Noisy readings are learned as “normal” sensor behavior update sensor sig. Signature update requires online density estimation •Sequentially update density estimate with each new reading •Unable to store historical data •Must compactly represent density •No single parametric family flexible enough to represent all distributions of features Developing a new method to do this using log-splines. Calculate Features: X t Calculate score New reading Sensor signature: S t Fault signature: F Score > threshold? update fault sig. Sensor 1 Sensor 2 unusually noisy readings Low voltage Problem Description: Blind Calibration Blindly calibrate sensor response from routine measurements collected from the sensor network. Manual calibration is not a scalable practice! Consider a network with n sensors. We can call the vector of a true signal from the n sensors x: And the vector of the measured signal y: Then assuming the measured signals y are a linear function of x: and assuming the true signals x lie in a known r- dimensional subspace of R n which can be defined by P, the orthogonal projection matrix onto the orthogonal complement of that subspace: Then under certain conditions on P, with no noise and exact knowledge of the subspace, we can perfectly recover the gain factors and partially recover the offset factors. robust to noise Error at 2% noise in the measured signal: Gain: <.01% Offset: <2.4% robust to mismodeling Error at 10% of the true signal outside of the assumed subspace: Gain: <1% Offset: <4% Evaluation: In a deployment with all sensors in a styrofoam box, thus with a 1-d signal subspace, the algorithm recovers the gains and offsets almost exactly. In a deployment with sensors spread across a valley at the James Reserve, using a 4-d signal subspace constructed from the calibrated data, the gain calibration was quite accurate. The offset calibration, as expected, captured some of the non-zero mean signal; additionally it was sensitive to the model. M 1 : t i = 1 ( x i , 1 ) +e i , i= 1,..., n M 2 : t i = 2 ( x i , 2 ) +e i , i= 1,... n Max 2 ( ) p i { 1 ( x i , 1 ) 2 ( x i , ˆ t 2 )} 2 i 1 n n i i i i t x x p 1 2 2 2 1 1 2 )} , ( ) , ( { min arg ˆ 2 1.Given a design N ,where N is the number of observ find: ˆ 1 N argmin 1 {t i 1 (x i , 1 )} 2 i 1 N ˆ 2N argmin 2 {t i 2 (x i , 2 )} 2 i 1 N 2. Add to the design a point x N 1 such that : x N 1 argmax x Z { 1 (x i , ˆ N1 ) 2 (x i , ˆ 2 N )} 2 3.The (N 1 )th observation is taken at x N 1 Update : N 1 = (1- )* N * ( x N 1 ) 4.Go back to 1 Algorithm: T-Designs A sequential algorithm is used to iteratively collect measurements that maximize the discrimination between the two models [1]. 0 0.5 1 1.5 2 0 20 40 60 80 10 15 20 25 30 x y Tem p Evaluation on Real Data: t i =f ( x i ) +e i , i= 1,... n M 1 : t i = 10 + 11 x+ 12 y+e i , i= 1,..., n M 2 : t i = 20 + 21 x+ 22 y+ 23 x 2 + 24 y 2 +e i , i= 1,... n Likelihoods: M1 0.1754, M2 3.4368 M2 fits better. Generalization: In case of multiple models, apply the same algorithm to the best two models that fit the data at each iteration (worst case). Problem Description: Optimal Sensor placement Where should we collect measurements to optimally choose a model that represents the field? Assumptions: Two plausible models. Gaussian noise. Idea: Find the locations where the “difference” between the two models is the largest. Technically: [1] A.C. Atkinson and V.V. Fedorov. Optimal design: Experiments for discriminating between several models. Biometrika 62, 289-303, 1975.

Upload: leonard-rose

Post on 29-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: In a lot of applications, wireless sensing systems are used for inference and prediction on environmental phenomena. Statistical models are widely used

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• In a lot of applications, wireless sensing systems are used for inference and prediction on environmental phenomena.

• Statistical models are widely used to represent these environmental phenomena:

• Models characterize how unknown quantities (phenomena) are related to known quantities (measurements):

• Choosing the models involves a great deal of uncertainty.• Often a single model M is used. If M does not characterize a

phenomenon correctly, the inferences and predictions will not be accurate.

• It is better to start with multiple plausible models and select the model by collecting measurements at informative locations.

Reducing Uncertainty in Sensor CalibrationReducing Uncertainty in Sensor Calibration

Reducing Uncertainty in Hardware Functionality (Fault Detection/Diagnosis)Reducing Uncertainty in Hardware Functionality (Fault Detection/Diagnosis)

Take Physical Sample

Reducing Uncertainty in Model SelectionReducing Uncertainty in Model Selection

Minimizing Data Uncertainty through System DesignMinimizing Data Uncertainty through System Design

Deployment Data quality indicators

Bangladesh 45%

GDI Sensors reported 3-60% faulty data

Ecuador Volcano

82% false negative rate / 13% false positive rate

Macroscope 8 of 33 temperature sensors faulty

Laura Balzano, Nabil Hajj Chehade, Sheela Nair, Nithya Ramanathan, Abhishek Sharma, Deborah Estrin, Leana Golubchik, Ramesh Govindan, Mark Hansen, Eddie Kohler, Greg Pottie, Mani Srivastava

Integrity Group, Center for Embedded Networked Sensing

Introduction:Introduction: There are Many Sources of Uncertainty in Interpreting Data There are Many Sources of Uncertainty in Interpreting DataEnvironment Modeling Uncertainty Sensor Calibration Uncertainty

 

UCLA – UCR – Caltech – USC – UC MercedUCLA – UCR – Caltech – USC – UC Merced

Center for Embedded Networked SensingCenter for Embedded Networked Sensing

Data uncertainty can be reduced through careful system design!

Hardware Uncertainty• Wireless sensing systems utilize low cost and unreliable hardware Faults are common Examples of Sensor

Faults

• Accurate calibration function is required to translate data from sensors

• Calibration parameters for most sensors drifts non-deterministically over time

Problem Description: Online fault detection and diagnosisBy detecting faults when they occur, instead of after the fact, users can take actions in the field to validate questionable data and fix hardware faults.

Confidence

Assumptions: Faults can be common, an initial fault-free training period is not always available, environmental phenomena are hard to predict so tight bounds on expected behavior are not possible

Evaluated in real-world DeploymentsConfidence detects faults with low false positive and negative rates.

Difficult to validate what is truly a fault without ground truth In our San Joaquin deployment we validated data by analyzing soil samples

taken from each sensor

Outlier Detection: Using a continually updated distribution, in place of statically defined thresholds, makes Confidence resilient to human configuration error and adaptable to dynamic environments

Replace Sensor

Gradient

Stan

dard

Dev

iati

on

Readings are mapped into a multi-dimensional space defined by carefully chosen features: gradient, distance from LDR, distance from NLDR, standard deviation.

Points far from the origin are faulty Assume a normal distribution of distances for good points. Points outside 2 standard deviations of the mean distance are considered outliers and are rejected. All other points are used to continually update distribution parameters.

Points are clustered using an online K-means algorithm. Clusters are associated with a previously successful remediating action

Bangladesh Detects 85% of faulty data in a real-world data-trace captured in Bangladesh even though over one third of the data are faulty

San Joaquin River

We ran Confidence In a deployment of 20 sensors in San Joaquin. Confidence accurately detected all 4 faults that occurred and correctly diagnosed 3 of the 4 faults, with no false positives or negatives

Data-driven techniques for identifying faulty sensor readings

1) Rule/Heuristic-based methods

2) Linear-Least Squares Estimation based method• Exploits correlation in the data measured at different sensors

• LLSE Equation:

• HMM model:

• Number of states

•Transition probabilities

• Conditional probability: Pr [O | S ]

• SHORT Rule: Compute the rate of change between two successive samples. If it is above a threshold, this is an instance of SHORT fault.

•NOISE Rule : Compute the std. deviation of samples within time window W. If it is above a threshold, the samples are corrupted by NOISE fault.

Results•Analyzed data sets from real world deployments to characterize the prevalence of data faults using these 3 methods.

• NAMOS deployment : CONSTANT+NOISE faults, up to 30% of samples affected by data faults.

• Intel Lab, Berkeley deployment: CONSTANT+NOISE faults, up to 20% of samples affected by data faults.

• Great Duck Island deployment: SHORT+NOISE faults, 10-15% of samples affected by data faults.

• SensorScope deployment: SHORT faults, very few samples affected by data faults.

SHORT fault NOISE fault

3) Learning data models : Hidden Markov Models

Injected CONSTANT fault

NOYES

Signatures for modeling normal and faulty behavior

• Difficult to initialize sensor signature without learning period that is guaranteed to be fault-free.

– Can use a stricter threshold during learning period to decrease chance of incorporating faults into sensor signature

• Method is dependent on accurately representing fault models, which is difficult without available labeled training data.

• Summarize sensor and fault behaviors using a signature: multivariate probability density of features (Cahill, Lambert, Pinhiero, and Sun; 2000)

• Features chosen to exploit differences between faulty and normal behavior. Current features summarize temporal and spatial information:

– Temporal: actual reading, change between successive readings, voltage

– Spatial: diff. from neighboring sensors.

• Calculate score for new readings using log likelihood ratio:

Higher scores are more suspicious. • Use of sensor signatures allows for sensor-

specific fault detection.

Fault Detection Algorithm(adapted from Detecting Fraud In the Real World; Cahill, Lambert, Pinhiero, and Sun; 2000)

Tested on one week of Cold-Air Drainage data

4/06 4/08 4/10 4/12

stuck-at fault

Sensor 2 malfunctioning at start of deployment;

Noisy readings are learned as “normal” sensor behavior

update sensor sig.

Signature update requires online density estimation•Sequentially update density estimate with each new reading•Unable to store historical data•Must compactly represent density•No single parametric family flexible enough to represent all distributions of featuresDeveloping a new method to do this using log-splines.

Calculate Features: Xt

Calculatescore

New reading

Sensor signature: St Fault signature: F

Score > threshold?

update fault sig.

Sen

sor

1

S

enso

r 2

unusually noisy readings

Low voltage

Problem Description: Blind Calibration Blindly calibrate sensor response from routine measurements collected from the sensor network.Manual calibration is not a scalable practice!

Consider a network with n sensors.

We can call the vector of a true signal from the n sensors x:

And the vector of the measured signal y:

Then assuming the measured signals y are a linear function of x:

and assuming the true signals x lie in a known r-dimensional subspace of Rn which can be defined by P, the orthogonal projection matrix onto the orthogonal complement of that subspace: Then under certain conditions on P, with no noise and exact

knowledge of the subspace, we can perfectly recover the gain factors and partially recover the offset factors.

robust to noise

Error at 2% noise in the measured signal:

Gain: <.01% Offset: <2.4%

robust to mismodeling

Error at 10% of the true signal outside of the assumed subspace:

Gain: <1% Offset: <4%

Evaluation:In a deployment with all sensors in a styrofoam

box, thus with a 1-d signal

subspace, the algorithm

recovers the gains and

offsets almost exactly.

In a deployment with sensors spread across a valley at the James Reserve, using a 4-d

signal subspace constructed from the calibrated data, the gain calibration was quite accurate. The offset

calibration, as expected, captured some of the non-zero mean signal; additionally it was sensitive to the model.

M1 : ti=1(x i,1)+ei, i=1,...,n

M2 : ti=2(x i,2)+ei, i=1,...,n

Max

2() pi{1(x i,1) 2(x i,ˆ t 2)}2

i1

n

n

iiiit xxp

1

222112 )},(),({minargˆ

2

1.Given a design N,where N is the number of observations,find :

ˆ 1Narg min1

{ti 1(x i,1)}2

i1

N

ˆ 2Narg min 2

{ti 2(x i,2)}2

i1

N

2. Add to the design a point xN1 such that :

xN 1 argmaxxZ

{1(x i,ˆ N1) 2(x i,

ˆ 2N )} 2

3.The (N 1)th observation is taken at xN 1

Update : N 1 = (1-) *N *(xN 1)

4.Go back to 1

Algorithm: T-Designs A sequential algorithm is used to iteratively collect measurements that maximize the discrimination between the two models [1].

00.5

11.5

2

0

20

40

60

8010

15

20

25

30

xy

Tem

p

Evaluation on Real Data:

ti=f (x i)+ei, i=1,...,n

M1 : ti=10+11x+12y+e i, i=1,...,n

M2 : ti=20+21x+22y+23x 2+24 y 2+ei, i=1,...,n

Likelihoods: M1 0.1754, M2 3.4368 M2 fits better.

Generalization: In case of multiple models, apply the same algorithm to the best two models that fit the data at each iteration (worst case).

Problem Description: Optimal Sensor placementWhere should we collect measurements to optimally choose a model that represents the field?

Assumptions: Two plausible models. Gaussian noise.

Idea: Find the locations where the “difference” between the two models is the largest.

Technically:

[1] A.C. Atkinson and V.V. Fedorov. Optimal design: Experiments for discriminating between several models. Biometrika 62, 289-303, 1975.