coordinated statistical modeling and optimization for ensuring data integrity and attack-resiliency...

41
Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack- Resiliency in Networked-Embedded Systems Farinaz Koushanfar, ECE Dept. Rice University Statistics Colloquium Oct 9, 2006

Upload: rodger-farmer

Post on 03-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked-Embedded Systems

Farinaz Koushanfar, ECE Dept.

Rice University Statistics Colloquium

Oct 9, 2006

Page 2: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

2

outline

• Sensor Networks: Applications, Challenges• Coordinated Modeling-Optimization Framework

– Inter-sensor models– Embedded sensing models– Optimization for data integrity

• Attack Resilient Location Discovery– Problem formulation and attack models– Robust random sample consensus for attack

detection

Sensor Networks: Applications, Challenges

Page 3: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

3

Sensor Networks

• Comprehensive monitoring and analysis of complex physical environments

• Imagine…

http://www.bluishorange.com/flood/photos/10bridgecars.jpg

Flood in HoustonVibration in Abercrombie

http://dacnet.rice.edu/maps/space/index.cfm?building=abc

Texas wine!!

http://www.alamosawinecellars.com/vineyard2.htm

Air pollution

http://www.ucsusa.org/clean_energy/coalvswind/c02c.html

Page 4: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

4

Sensor Networks, How?

• Networks of embedded sensing (actuating) and computing devices

Mica2Dot, CrossBow Tech.

Courtesy of Prof. Estrin, CENS, ULCA

Page 5: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

5

Challenges in Sensor Networks

• System: sensors, actuators, hardware, software, communication network layers,

• Limited: battery, bandwidth, cost• Unique to sensor networks: Sensing

– Abstract the system state, complex properties, and model physical phenomena accurately, without biases

• Parametric models: a priori assumptions • Often do not capture the complex relationships

– Optimization based on such models have a limited effectiveness

Page 6: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

6

Challenges in Sensing• Massive datasets

– Structure response in USGS building: 72 channels of 24 bit data, 500 samples/sec.

• Energy consumption of the wireless nodes– Motes take 36mW in active mode AA batteries + storage capacity of

1850mWh 50h active mode• Diversity in applications

– Marine biology, seismic sensing, battlefield, contaminant transport, home sensors, laboratories, hospitals, etc.

• Harsh environmental conditions– Battlefield, earthquakes, automatic detection, etc.

• Wireless channel data loss• Sensor cost• Sensitivity of applications• Privacy and security

Page 7: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

7

Inconsistencies in the Measured Sensor Data

• Erroneous measurements– Noisy readings: inevitable due to power and cost constraints and

environmental impact– Systematic errors: offset bias, calibration effect, etc– Partially corrupted, still useful

• Faulty (corrupted) measurements– Remove faults to get a consistent picture– Can be accidental (e.g. bad link), or malicious

• Missing data– May be accidental, intentional (sleeping, subsampling,

compression, filtering), or malicious

Page 8: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

8

Outline

• Sensor Networks: Applications, Challenges• Coordinated Modeling-Optimization Framework

– Inter-sensor models– Embedded sensing models– Optimization for data integrity

• Attack Resilient Location Discovery– Problem formulation and attack models– Robust random sample consensus for attack

detection

Coordinated Modeling-Optimization Framework

Page 9: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

9

Motivational Example

• Deployments show a gap b/w models and the reality• Example: preliminary analysis of temperature sensor

traces at UCLA BG• 23 sensor nodes, sampling each 5 mins• Question: does the locality assumption hold?

1

2

3

45

6

78

9

10

11

12

15

1314

162021

22

1718

19

Page 10: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

10

Motivational Example (Cont’d)

• No consistent relation b/w sensing and distance

• Discontinuities, exposure differences, global sources

• Also, some highly correlated close-by sensors

• Best previous effort: local basis functions

• Need new models for simultaneous abstraction of sensing and distance

• What about other properties?

1

2

3

45

6

78

9

10

11

12

15

1314

162021

22

1718

19

Page 11: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

11

Motivational Example (Cont’d)

• Separation of concerns• Embedded sensing models:

– Define multiple graphs G1, G2, …, GM, that share vertices

• E.g., sensing, distance– dij: distance b/w si,sj

– eij: sensing prediction error, for the model sj=fij(si)

• The distance and sensing are not jammed into one model, but are being simultaneously considered

009.....

11.0....

............

26.31.27.

28.41.4.

............

............

42.45.....

34.41.....

............

32.01.

4.1.01

2

23

22

Sensing graph: adjacency matrix

1 2 3 … 22 23

032....

320....

............

1306566

995764

............

............

6557....

6664....

............

94011

105110

Distance graph: adjacency matrix

1 2 3 … 22 23

1

2

23

22

Page 12: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

12

Motivational Example-2

• Cross-domain optimization: Sensor deployment– Objective: select up to S candidate points for adding

an extra sensor– For each si, a TL sensor is Delaunay neighbor but

cannot be predicted within th error bound– Denote the edges of TL sensors as candidates– Find intelligent ways to select the best set of

candidate points

Page 13: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

13

Motivational Example (Cont’d)

• Coordinated modeling-optimization– Q1: How to do cross-domain optimization?– Q2: Can the models be of higher dimensions?– Q3: Can they help us to address data-integrity

problem? – Q4: How effective are they?

Page 14: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

14

Outline

• Sensor Networks, Applications, Challenges• Coordinated Modeling-Optimization Framework

– Inter-sensor models– Embedded sensing models– Optimization for data integrity

• Attack Resilient Location Discovery– Problem formulation and attack models– Robust random sample consensus for attack

detection

Inter-sensor models

Page 15: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

15

Inter-sensor Models

• Intra-sensor models (autoregressive models)• Have shown the effectiveness of adding shape

constraints to univariate models – Isotonicity– Unimodularity– Number of level sets– Convexity – Bijection– Transitivity

• Combinatorial isotonic regression (CIR), finds the optimal nonparametric shape constrained univariate fit for an arbitrary error norm in average linear time

• Models are precursor for subsequent optimization

Page 16: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

16

Application of CIR on Temperature Sensors at Intel Berkeley*

• Prediction error over all node pairs

• Limiting the number of level sets

* Koushanfar, Taft (Intel), PotkonjakInfocom’06

Page 17: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

17

Multivariate CIR

• Recent result*: – The first optimal, polynomial-time DP-based approach

for multi-dimensional CIR:

(1) Build the relative importance matrix R

(2) Build the error matrix E

(3) Build a cumulative error matrix C by using a nested DP

(4) Starting at the minimum value in the last column of C, trace back a path to the first column that minimizes the cumulative error

•Thanks to Prof. D. Brillinger (UCB), Prof. M. Potkonjak (UCLA) for theuseful discussions

Page 18: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

18

Example

z0 z1

z2 z3

4030

6963

3854

0385

x0 x1 x2 x3

y0

y1

y2

y3

6010

6525

312156

231217x0 x1 x2 x3

y0

y1

y2

y3

3010

5441

3875

1289x0 x1 x2 x3

y0

y1

y2

y3

9050

916107

31293

16127

x0 x1 x2 x3

y0

y1

y2

y3

(2) Input: 4*4*4 error matrix E, A=4 (3) The steps of nested DP on E

y0

1

y1

2

y2

3

y3

4

54432810

54432810

67713412

119875317x0 x1 x2 x3

z0

z1

z2

z3

50422710

50422710

60513112

91784817x0 x1 x2 x3

z0

z1

z2

z3

3632229

3632229

3633229

53493110x0 x1 x2 x3

z0

z1

z2

z3

1615135

1615135

1616135

2625197x0 x1 x2 x3

z0

z1

z2

z3

x(1)x(2)

y(3) y(3)

z(3)

z(2)

z(1)

z(0) Y

Z

Xx(3)

(3) 3D view of DP on E (4) Final

bivariateregression

2222

2222

1111

1111

zzzz

zzzz

zzzz

zzzzx0 x1 x2 x3

y0

y1

y2

y3

Page 19: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

19

Multivariate CIR - complexity

• T sensor values drawn from a finite alphabet A• Complexity of univariate case is dominated by

sorting (T log T)

• Cm(M): complexity of multivariate with M explanatory variables

• Cm(M)=AM+1Cm(M-1), pseudo-polynomial complexity

Page 20: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

20

Open Questions

• How to speed up the Multivariate CIR?– Pruning algorithms that exploit sparsity (?)

• Is it possible to make CIR locally adaptive? – In principle, finding the min error is a global optimization that

cannot be locally addressed

• Can one guarantee convergence and correctness of CIR among sensors?

• Is it possible to have continuous approximations to address the problem?

• How can one build efficient models in presence of missing and/or faulty data?

Page 21: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

21

Outline

• Sensor Networks, Applications, Challenges• Coordinated Modeling-Optimization Framework

– Inter-sensor models– Embedded sensing models– Optimization for data integrity

• Attack Resilient Location Discovery– Problem formulation and attack models– Robust random sample consensus for attack detection– Evaluation and comparison to competing methods

Embedded sensing models

Page 22: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

22

State-of-the-Art Sensing Models

• Parametric models– Gaussian random fields, graphical models (GM),

message passing, iterative message passing, belief propagation (BP)

• Nonparametric models– Marginalized kernels (GM), alternating projections,

distributed EM, nonparametric BP

• Common thread: capture dependence among sensor data, no edge means no dependence,

• Need to capture the shape of field discontinuities and/or lack of correlations b/w adjacent nodes

Page 23: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

23

Embedded Sensing Models

• Principle of separation of concerns (SoC)• Example:

Geometric graph (planar-2D)Delaunay edges (adjacency)

1

2

3

4

5

6

7

81

2

38

45

6

7

Sensing graph: higher dimensional embedded graph

Idea: Map the sensing graph into lower dimensions.Exploit the discrepancy between the higher dimensional topology and the lower dimensional space to identify the obstacles

Page 24: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

24

Open Questions

• Efficient computation and handling of embedded sensing models in higher dimensions

• Joint compression of multiple entities• How can we capture dynamic topologies, i.e.

mobility, dynamic time series, sleeping• Efficient structures/data formats for representing

the multi-dimensional topologies

Page 25: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

25

Coordinated Modeling and Optimization

• Paramount importance of interface in system and software development

• Create statistical models suitable for optimization– Paradigms: continuous, smooth, consistent – Small number of level sets – Convexity– Bijection x'i= G(F(xi)) = xi, where yi=F(xi) and xi=G(yi)– Transitivity zi = F(xi) = G(yi)

• Create optimization mechanisms resilient to statistical variability– Paradigms: randomization– Multiple validations– Constructive probabilistic– Reweighting of OF and constraints

Page 26: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

26

Outline

• Sensor Networks, Applications, Challenges• Coordinated Modeling-Optimization Framework

– Inter-sensor models– Embedded sensing models– Optimization for data integrity

• Attack Resilient Location Discovery– Problem formulation and attack models– Robust random sample consensus for attack detection– Evaluation and comparison to competing methods

Optimization for data integrity

Page 27: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

27

Data Integrity: Multiple Validations

• The data-integrity problems are complex due to the complex environments and uncertainties– Proof of NP-completeness (PhD’05)

• Data integrity (noise reduction, calibration, fault detection, data recovery) exploits system redundancies

• Coordinated modeling-optimization • Multiple validations (MV) optimization algorithms

– The solutions are validated using multiple input samples– Similar in spirit to cross-validation (CV) in statistics– MV is more comprehensive than CV, since it is a generic

optimization paradigm based on resampling the input space and validating the output of a complex algorithm rather than a model

Page 28: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

28

Example: Missing Data

• Between 40%-50% missing data at Intel Berkeley testbed

• Limited A2D: discrete level sets

0 50 100 150

18

.81

9.0

19

.2

Time

Te

mp

era

ture

(C

)

Temperature (C)missing/present vs. time

Missing data

Present data

0 20 40 60 80 100 120 1403

8.5

39

.03

9.5

Time

Hu

mid

ity (

%)

Humidity (%)missing/present vs. time

Missing data

Present data

Page 29: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

29

Missing Data Recovery (MSD)

Problem formulation:• Given:

– N sensors s1, …,sN,

– Sensor’s data at time t: (d1(t), d2(t),…,dN(t))

– Some sensor data missing in an arbitrary way, i.e. there is i, such that di(t)=NA

• Objective: recover the missing data in such a way that the consistency between the readings of different sensors is maximized (prediction error is minimized)

21

3

54

78

9

6

Page 30: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

30

State-of-the-Art in MSD

• MSD is a prevalent problem in many fields• Expectation maximization (EM) Dempster et al.’77

– Assuming multivariate density– Local optimization, likely to be trapped in the local

max of the likelihood function• Multiple imputations (MI) Rubin 1987

– Missing data replaced by multiple simulated versions– May distort variable association dues to treating the

completed dataset as the actual one

• Both MI/EM can be computationally intensive• MV often combines lower dimensional models

Page 31: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

31

MV for Missing Data Recovery

• Iteratively select a sub-sample of available nodes (the present set) and optimize for it

• Remaining nodes (holdout set) used for validating the solution, quantify its uncertainty– 1) Randomly assign :{1,…,|V|}{1,…,K};– 2) for (=1 to =K)

• a: calculate OF-(O); • b: compute MVC-k(O);

– 3) MVC(O)=G(MVC-k(O)), =1, …, K;– 4) Obest= argminO MVC(O);

• Advantage: not only a solution, but an uncertainty bound for the solutions

Page 32: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

32

Open Questions

• Theoretical proof of correctness of MV, which of the properties of CV holds for MV?

• Which MV criteria (MVC) are robust to outliers: e.g., order statistics

• Which objective function (OF) to use?– Ensemble-voting of weak classifiers by boosting (exponential

loss function)

• Real-time implementation on sensor networks testbeds• Scaling properties of the MV algorithm

Page 33: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

33

Outline

• Sensor Networks, Applications, Challenges• Coordinated Modeling-Optimization Framework

– Inter-sensor models– Embedded sensing models– Optimization for data integrity

• Attack Resilient Location Discovery– Problem formulation and attack models– Robust random sample consensus for attack detection– Evaluation and comparison to competing methods

Attack Resilient Location Discovery

Page 34: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

34

Location Discovery*

• A number of nodes have location data (beacons)

• Other nodes estimate their distance to beacons to find their locations

• Many distance estimation methods (e.g., AoA, ToA)

• If more than three beacons, node can estimate location

• We focus on the atomic case (one unknown)

s1

s7

s5

s10

s0

s9

s6

s8

s4

s3

s2

* Joint work with N. Kiyavash, UIUC

Page 35: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

35

Robust Location Discovery: Problem Formulation

• Instance: – A node s0 with unknown coordinates (x0,y0),

– Set L of location tuples {(xn,yn,dn)} (n beacons),

– Consistency metric (sn,s0), consistency threshold t

• Problem: – Find an estimate for (x0,y0) s.t. it is at least

(sn,s0)-consistent with t points in set L

Page 36: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

36

Attack Model

• The attackers can modify the distance measurement of any beacon without any limits

• The network is cryptographically protected against protocol attackes, e.g., wormhole, sybil

• The measurements from each beacon are only considered once

• Both independent and coalition (colluding) attacks• In coalition attacks, the attacking beacons coordinate

their efforts• There is a minimum number of correct beacons,

otherwise colluding beacons will mislead the target

Page 37: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

37

Robust Random Sample Consensus

1. Initialize i;

2. While (i<imax)a. Randomly draw a subset Si of size 3 from L;

b. Use Si to estimate s^0;

c. Calculate K, the number of consistent points w.r.t s^

0 in L\Si;d. If (K>t)

i. {form a new s^0 from the K points; Terminate;}

e. Increment i;

3. Terminate and output the largest consistent estimate;

Page 38: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

38

Selecting the parameters: imax,t

• q - prob. of correctness of a randomly drawn point• Expected number of trials, E[i]=1/q3

- threshold for the prob of missing a good subset, (1-q3)imax= Or, imax= ln() / ln(1-q3)

• I – set of inliers; - percentage of inliers =1-Na/N

• For large datasets q=3, E[i]=-9

• The number of iterations is

2

0j jN

jI

3

N3

I

q

)1ln(

lni

9max

Page 39: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

39

Evaluation – Random Sample consensus

Na/N |s^0-s0| FN %

10% 0.06 0

20% 0.07 1

30% 0.07 2.5

40% 0.11 3.5

50% 0.13 3.7

Page 40: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

40

Comparison to Other Algorithms

Independent attackers Colluding attackers

Page 41: Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked- Embedded Systems Farinaz Koushanfar,

41

Summary

• sensor networks: importance of sensing, data integrity – missing data, faults, noise, systematic errors

• Coordinated modeling and optimization framework– Nonparametric models, shape constraints– Multivariate CIR, optimal algorithm, slow in multiple dimensions– Embedded sensing models, separation of concerns– Projection into lower dimensions– Optimization algorithm: multiple validations (MV)

• Attack-resilient location discovery– 25% more effective in presence of coalition attackers, 35+%

more effective on independent attackers