dictyogram: a statistical approach for the definition and visualization of network flow categories
TRANSCRIPT
Dictyogram: a Statistical Approach for theDefinition and Visualization of Network Flow
Categories
David Muelas, Miguel Gordo, Jose Luis Garcıa-Dorado,Jorge E. Lopez de Vergara
Email: {dav.muelas, jl.garcia, jorge.lopez vergara}@uam.es,[email protected]
Universidad Autonoma de Madrid
CNSM 2015 – November 2015
Network Health Check
Network managers must monitor network vital signs to assure it ishealthy:
(a) ECG
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
(b) Dictyogram (Normalized version)
But. . . What exactly is Dictyogram?
Dictyogram (from δικτυo, network in Greek): Method tographically trace the network flow behavior versus time. Itsgraphical results can be like a network electrogram, showing itsvital signs.
IntroductionMethod definition
Experimental resultsConclusions
Outline
1 IntroductionContextOur Goals
2 Method definitionProbability integral transformModeling CDFs
3 Experimental resultsModel evaluationDictyogram visualization
4 Conclusions
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 4
IntroductionMethod definition
Experimental resultsConclusions
ContextOur Goals
Context
Network flow-based monitoring has been proven useful todetect network intrusion, malfunction, or other types ofanomalies.
Unfortunately, network managers have to deal with tons ofmeasurement data, and its interpretation has become achallenge.
Data summaries: difficult to reach a good trade-off betweendetail and simplifications: insufficient data can lead torestricted or even erroneous conclusions.
Not only the measurements are important from the point ofview of network management: the application of suitabletechniques improves the quality and depth of the knowledgethat can be extracted from measurements.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 5
IntroductionMethod definition
Experimental resultsConclusions
ContextOur Goals
Our Goals
Our proposal is intended to ease network managers’ work byproposing a novel approach to study the behavior of network flowcharacteristics. Our main goal is to define comprehensivesummaries of network flow data:
Our approach is based in the study of different flowcharacteristics’ ECDFs — e.g., flow size or durationdistributions.
Using those ECDFs, we define flow categories using theintegral probability transform — e.g., using decile delimitedintervals.
As we will see, this approach improves the detection of networkanomalies and the visualization of network state.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 6
IntroductionMethod definition
Experimental resultsConclusions
Probability integral transformModeling CDFs
Method description
Probability integral transform:
Let X be a continuous random variable with cumulativedistribution function FX . Then FX (X ) follows a uniformdistribution on [0, 1].
(b)
0
0.5
1(a)
Ci = F
X
−1(P
i)
Pi
And them, we define flow categories using a set of probabilitylevels using the CDF of certain flow characteristics.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 7
IntroductionMethod definition
Experimental resultsConclusions
Probability integral transformModeling CDFs
Keep an eye on the hypotheses!
25 30 350
0.2
0.4
0.6
0.8
1
(b)
02004006000
0.2
0.4
0.6
0.8
1
(a)
(c) Gaussian
0 20 40 600
0.2
0.4
0.6
0.8
1
(b)
051015200
0.2
0.4
0.6
0.8
1
(a)
(d) Poisson
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 8
IntroductionMethod definition
Experimental resultsConclusions
Probability integral transformModeling CDFs
How can we model an CDF?
Glivenko-Cantelli theorem: the ECDF converges to the CDFas the number of observations increases.
Nonetheless, computational cost increases when weaccumulate all the values of the characteristic under analysis.
Alternative approach: Functional Data Analysis:
Mean Function: FmeanX =
1
n
n∑i=1
FXi
Problem: not robust
Functional Depth:
Maximum depth observation.Median Function (it is the function that maximizes thefunctional depth we use).
Problem: more computationally expensive
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 9
IntroductionMethod definition
Experimental resultsConclusions
Probability integral transformModeling CDFs
Dataset for the evaluation
To asses the advantages of our method, we have use a realdataset:
Flow records, Spanish Academic Network: more than onemillion users, more than 7 years of data.
Exporters: 5 Netflow exporters, different geographicallocations (all of them in Spain).
Packet level sampling: rate of one out of 100 packets.
Period selected for the evaluation of the CDF estimationmethods: 30 days.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 10
IntroductionMethod definition
Experimental resultsConclusions
Probability integral transformModeling CDFs
Analyzing ECDFs to get a model of the typical behavior
101
102
103
104
105
106
107
108
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X: 40Y: 0.9
X: 44Y: 0.8
X: 53Y: 0.7
X: 80Y: 0.6
X: 149Y: 0.5
X: 501Y: 0.4
X: 1452Y: 0.3
X: 1500Y: 0.2
X: 3000Y: 0.1
Flow size (bytes)
P (
X >
x)
Mean
Deepest
Median
Figure: Comparison between observed CCDFs (orange line, no marker)for Exporter A, and models obtained using the mean (blue line, circles),deepest (black line, diamonds) and median (red line, triangles) functions.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 11
IntroductionMethod definition
Experimental resultsConclusions
Model evaluationDictyogram visualization
Empirical comparison (I)
0 5 10 15 20 25 300
5
10x 10
5
A
0 5 10 15 20 25 300
5
10x 10
6
B
0 5 10 15 20 25 300
5
10x 10
7
C
0 5 10 15 20 25 300
5
10x 10
6
D
0 5 10 15 20 25 300
5x 10
6
E
Mean Deepest Median
Figure: Evolution of the Pearson’s test-statistic for all exporters. (Less isbetter.)
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 12
IntroductionMethod definition
Experimental resultsConclusions
Model evaluationDictyogram visualization
Empirical comparison (II)
Table: Summary of the evaluation of the different methods to estimatethe CDF.
Exporter Method # Best
AMean function 0Deepest obs. 3
Median function 25
BMean function 0Deepest obs. 6
Median function 22
CMean function 20Deepest obs. 8
Median function 0
DMean function 0Deepest obs. 23
Median function 5
EMean function 0Deepest obs. 28
Median function 0
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 13
IntroductionMethod definition
Experimental resultsConclusions
Model evaluationDictyogram visualization
Final visualization of Dictyogram
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:000
2
4x 10
4(a) Mean
Co
ncu
rre
nt
flo
ws f
or
ea
ch
ca
teg
ory
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:000
2
4x 10
4(b) Deepest Observation
Time of day
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:000
2
4x 10
4(c) Median
1
1
1
2
2
2
Figure: Dictyogram representation of fi (t) with their respective sizeintervals delimited by the deciles given by (a) mean, (b) deepest observedECDF, and (c) median.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 14
IntroductionMethod definition
Experimental resultsConclusions
Model evaluationDictyogram visualization
Final visualization of Dictyogram
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:000
0.5
1
1.5
2
2.5
3
3.5
4x 10
4
1 2
Figure: Zoom in the median.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 15
IntroductionMethod definition
Experimental resultsConclusions
Key remarks
Our method:
Is manager friendly: it provides Statistical summaries basedon certain probability levels, which eases the study of theflows traversing the network.
Links statistical properties to time evolution: it eases thedetection of changes in the statistical properties of thecharacteristics under analysis.
Improves network flow data visualization: it lets controlthe resolution of the visualization of the distribution thatnetwork flow characteristics follow.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 16
IntroductionMethod definition
Experimental resultsConclusions
Future work
We plan to:
Study how to summarize several different network behaviors ina multivariate uniform distribution, and use other well-knowndistributions (and not only uniform) for signatures.
Study the distribution of the Pearson’s test-statistic to detectanomalous events.
Test the stability of the estimation of the CDF ( to definesome criteria to recalibrate the model).
Explore other representations with higher dimensionality.
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 17
IntroductionMethod definition
Experimental resultsConclusions
Thank you!
Questions?
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 18
IntroductionMethod definition
Experimental resultsConclusions
Annex: Functional depth
We use the definition given by:
MSn,H(x) = min{SLn(x), ILn(x)} (1)
where
SLn(x) = 1nλ(I )
n∑i=1
λ{t ∈ I : x(t) ≤ xi (t)}
ILn(x) = 1nλ(I )
n∑i=1
λ{t ∈ I : x(t) ≥ xi (t)} (2)
With it, we consider:
Maximum depth observation.
Median Function (it is the function that maximizes thefunctional depth we use).
D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 19