distributed clustering for robust aggregation in large networks

Distributed Clustering for Robust Aggregation in Large Networks

Ittay Eyal, Idit Keidar, Raphi Rom

Technion, Israel

Aggregation in Sensor Networks – Applications

Temperature sensors thrown in the woods

Seismic sensors

Grid computing load

2

Aggregation in Sensor Networks – Applications

• Large networks, light nodes, low bandwidth• Fault-prone sensors, network• Multi-dimensional (location X temperature)• Target is a function of all sensed data

Average temperature, max location, majority…

3

What has been done?

Tree Aggregation

Hierarchical solution

Fast - O(height of tree)

1 3 9 11

2 10

5

6

Tree Aggregation

Hierarchical solution

Fast - O(height of tree)

Limited to static topology No failure robustness

1 3 9 11

2 1010

6

62

Gossip

• D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS, 2003.

• S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In SenSys, 2004.

Gossip:

Each node maintains a synopsis

7

11

9

3

1

Gossip



Gossip:


Occasionally, each node contacts a neighbor and they improve their synopses

8

11

9

3

1 7

5

5

7

Gossip



Gossip:


Occasionally, each node contacts a neighbor and they improve their synopses

Indifferent to topology changes Crash robust

No data error robustness

Proven convergence

9

7

5

5

7

6

6

6

6

A closer look at the problem

The Implications of Irregular Data

4A single erroneous sample can radically offset the data

31 106

27o

The average (47o) doesn’t tell the whole story

25o 26o 25o 28o 98o 120o 27o

11

Sources of Irregular Data

Sensor Malfunction

Short circuit in a seismic sensor

Sensing Error

An animal sitting on a temperature sensor

Interesting Info: DDoS: Irregular load on some machines in a grid

Software bugs: In grid computing, a machine reports negative CPU usage

Interesting Info: Fire outbreak: Extremely high temperature in a certain area of the woods

Interesting Info: intrusion: A truck driving by a seismic detector

12

It Would Help to Know The Data Distribution

27o

The average is 47o

Bimodal distribution with peaks at 26.3o and 109o

25o 26o 25o 28o 98o 120o 27o

13

Estimate a range of distributions [1,2] or data clustering according to values [3,4]

Fast aggregation [1,2] Tolerate crash failures, dynamic networks [1,2] High bandwidth [3,4], multi-epoch [2,3,4] or One dimensional data only [1,2] No data error robustness [1,2]

Existing Distribution Estimation Solutions

1. M. Haridasan and R. van Renesse. Gossip-based distribution estimation in peer-to-peer networks. In InternationalWorkshop on Peer-to-Peer Systems (IPTPS 08), February 2008.

2. J. Sacha, J. Napper, C. Stratan, and G. Pierre. Reliable distribution estimation in decentralised environments. Submitted for Publication, 2009.

3. W. Kowalczyk and N. A. Vlassis. Newscast em. In Neural Information Processing Systems, 2004.4. N. A. Vlassis, Y. Sfakianakis, and W. Kowalczyk. Gossip-based greedy gaussian mixture learning. In

Panhellenic Conference on Informatics, 2005.

14

Our Solution

Samples deviating from the distribution of the bulk of the data

Outliers:

15

Estimate a range of distributions by data clustering according to values

Fast aggregation Tolerate crash failures, dynamic networks Low bandwidth, single epoch Multi-dimensional data Data error robustness by outlier detection

Outlier Detection Challenge

27o 25o 26o 25o 28o 98o 120o 27o

16

Outlier Detection Challenge

A double bind:

27o 25o 26o 25o 28o 98o 120o 27o

Regular data distribution

~26o

Outliers{98o, 120o}

No one in the system has enough information

17

Aggregating Data Into Clusters

• Each cluster has its own mean and mass • A bounded number (k) of clusters is maintained

Herek = 2

Original samples

1 3 5 10

1a b c d

1

Clustering a and b

1 3

a b

Clustering all

1 3 5 10

1

abc

d

3Clustering a, b and c

5 10

1

2

abc

2

1

18

But What Does The Mean Mean?

New Sample

Mean A Mean B

The variance must be taken into account

Gaussian A

Gaussian B

19

Gossip Aggregation of Gaussian Clusters

Distribution is described as k clusters Each cluster is described by: • Mass• Mean • Covariance matrix (variance for 1-d data)

20

Gossip Aggregation of Gaussian Clusters

a

b

Merge

21

Keep half, Send half

Distributed Clustering for Robust Aggregation 22

• Aggregate a mixture of Gaussian clusters• Merge when necessary (exceeding k)

Our solution:

Recognize outliers

By the time we need to merge, we can estimate the distribution

Simulation Results 23

1. Data error robustness2. Crash robustness3. Elaborate multidimensional data

Simulation Results:

It Works Where It Matters

Not Interesting

Easy

24

It Works Where It Matters

Error

0 5 10 15 20 250

0.5

1E

rror

No outlier d

etection

With outlier detection

25



Simulation Results:

Err

or

Round

Protocol is Crash Robust

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Round

Ave

rage

Err

or

No outlier detection, 5% crash probability

No outlier detection, no crashes

Outlier detection

27



Simulation Results:

Describe Elaborate Data

FireNo Fire

Distance

Tem

pera

ture

29

The algorithm converges• Eventually all nodes have the same clusters forever• Note: this holds even without atomic actions

• The invariant is preserved by both send and receive

Theoretical Results (In Progress) 30

… to the “right” output• If outliers are “far enough” from other samples, then they

are never mixed into non-outlier clusters• They are discovered• They do not bias the good samples’ aggregate

(where it matters)

Summary

Robust Aggregation requires outlier detection

27o 27o 27o 27o 27o98o 120o

31

We present outlier detection by Gaussian clustering:

Merge

Summary – Our Protocol 32

Elaborate DataCrash Robustness

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Round

Ave

rage

Err

or

Outlier Detection (where it matters)

distributed clustering for robust aggregation in large networks

Documents

data clustering

peer networks

robust aggregation

sources of irregular

values fast aggregation

dynamic networks low

synopsis diffusion

node contacts