distributed clustering for robust aggregation in large networks
DESCRIPTION
Distributed Clustering for Robust Aggregation in Large Networks. Ittay Eyal, Idit Keidar, Raphi Rom. Technion, Israel. 2. Aggregation in Sensor Networks – Applications. Temperature sensors thrown in the woods. Seismic sensors. Grid computing load. 3. - PowerPoint PPT PresentationTRANSCRIPT
Distributed Clustering for Robust Aggregation in Large Networks
Ittay Eyal, Idit Keidar, Raphi Rom
Technion, Israel
Aggregation in Sensor Networks – Applications
Temperature sensors thrown in the woods
Seismic sensors
Grid computing load
2
Aggregation in Sensor Networks – Applications
• Large networks, light nodes, low bandwidth• Fault-prone sensors, network• Multi-dimensional (location X temperature)• Target is a function of all sensed data
Average temperature, max location, majority…
3
What has been done?
Tree Aggregation
Hierarchical solution
Fast - O(height of tree)
1 3 9 11
2 10
5
6
Tree Aggregation
Hierarchical solution
Fast - O(height of tree)
Limited to static topology No failure robustness
1 3 9 11
2 1010
6
62
Gossip
• D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS, 2003.
• S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In SenSys, 2004.
Gossip:
Each node maintains a synopsis
7
11
9
3
1
Gossip
• D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS, 2003.
• S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In SenSys, 2004.
Gossip:
Each node maintains a synopsis
Occasionally, each node contacts a neighbor and they improve their synopses
8
11
9
3
1 7
5
5
7
Gossip
• D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS, 2003.
• S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In SenSys, 2004.
Gossip:
Each node maintains a synopsis
Occasionally, each node contacts a neighbor and they improve their synopses
Indifferent to topology changes Crash robust
No data error robustness
Proven convergence
9
7
5
5
7
6
6
6
6
A closer look at the problem
The Implications of Irregular Data
4A single erroneous sample can radically offset the data
31 106
27o
The average (47o) doesn’t tell the whole story
25o 26o 25o 28o 98o 120o 27o
11
Sources of Irregular Data
Sensor Malfunction
Short circuit in a seismic sensor
Sensing Error
An animal sitting on a temperature sensor
Interesting Info: DDoS: Irregular load on some machines in a grid
Software bugs: In grid computing, a machine reports negative CPU usage
Interesting Info: Fire outbreak: Extremely high temperature in a certain area of the woods
Interesting Info: intrusion: A truck driving by a seismic detector
12
It Would Help to Know The Data Distribution
27o
The average is 47o
Bimodal distribution with peaks at 26.3o and 109o
25o 26o 25o 28o 98o 120o 27o
13
Estimate a range of distributions [1,2] or data clustering according to values [3,4]
Fast aggregation [1,2] Tolerate crash failures, dynamic networks [1,2] High bandwidth [3,4], multi-epoch [2,3,4] or One dimensional data only [1,2] No data error robustness [1,2]
Existing Distribution Estimation Solutions
1. M. Haridasan and R. van Renesse. Gossip-based distribution estimation in peer-to-peer networks. In InternationalWorkshop on Peer-to-Peer Systems (IPTPS 08), February 2008.
2. J. Sacha, J. Napper, C. Stratan, and G. Pierre. Reliable distribution estimation in decentralised environments. Submitted for Publication, 2009.
3. W. Kowalczyk and N. A. Vlassis. Newscast em. In Neural Information Processing Systems, 2004.4. N. A. Vlassis, Y. Sfakianakis, and W. Kowalczyk. Gossip-based greedy gaussian mixture learning. In
Panhellenic Conference on Informatics, 2005.
14
Our Solution
Samples deviating from the distribution of the bulk of the data
Outliers:
15
Estimate a range of distributions by data clustering according to values
Fast aggregation Tolerate crash failures, dynamic networks Low bandwidth, single epoch Multi-dimensional data Data error robustness by outlier detection
Outlier Detection Challenge
27o 25o 26o 25o 28o 98o 120o 27o
16
Outlier Detection Challenge
A double bind:
27o 25o 26o 25o 28o 98o 120o 27o
Regular data distribution
~26o
Outliers{98o, 120o}
No one in the system has enough information
17
Aggregating Data Into Clusters
• Each cluster has its own mean and mass • A bounded number (k) of clusters is maintained
Herek = 2
Original samples
1 3 5 10
1a b c d
1
Clustering a and b
1 3
a b
Clustering all
1 3 5 10
1
abc
d
3Clustering a, b and c
5 10
1
2
abc
2
1
18
But What Does The Mean Mean?
New Sample
Mean A Mean B
The variance must be taken into account
Gaussian A
Gaussian B
19
Gossip Aggregation of Gaussian Clusters
Distribution is described as k clusters Each cluster is described by: • Mass• Mean • Covariance matrix (variance for 1-d data)
20
Gossip Aggregation of Gaussian Clusters
a
b
Merge
21
Keep half, Send half
Distributed Clustering for Robust Aggregation 22
• Aggregate a mixture of Gaussian clusters• Merge when necessary (exceeding k)
Our solution:
Recognize outliers
By the time we need to merge, we can estimate the distribution
Simulation Results 23
1. Data error robustness2. Crash robustness3. Elaborate multidimensional data
Simulation Results:
It Works Where It Matters
Not Interesting
Easy
24
It Works Where It Matters
Error
0 5 10 15 20 250
0.5
1E
rror
No outlier d
etection
With outlier detection
25
Simulation Results 26
1. Data error robustness2. Crash robustness3. Elaborate multidimensional data
Simulation Results:
Err
or
Round
Protocol is Crash Robust
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Round
Ave
rage
Err
or
No outlier detection, 5% crash probability
No outlier detection, no crashes
Outlier detection
27
Simulation Results 28
1. Data error robustness2. Crash robustness3. Elaborate multidimensional data
Simulation Results:
Describe Elaborate Data
FireNo Fire
Distance
Tem
pera
ture
29
The algorithm converges• Eventually all nodes have the same clusters forever• Note: this holds even without atomic actions
• The invariant is preserved by both send and receive
Theoretical Results (In Progress) 30
… to the “right” output• If outliers are “far enough” from other samples, then they
are never mixed into non-outlier clusters• They are discovered• They do not bias the good samples’ aggregate
(where it matters)
Summary
Robust Aggregation requires outlier detection
27o 27o 27o 27o 27o98o 120o
31
We present outlier detection by Gaussian clustering:
Merge
Summary – Our Protocol 32
Elaborate DataCrash Robustness
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Round
Ave
rage
Err
or
Outlier Detection (where it matters)