exploiting clustering techniques for web session inference a.bianco, g. mardente, m. mellia,...
TRANSCRIPT
![Page 1: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/1.jpg)
Exploiting Clustering Techniquesfor Web Session Inference
A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello
(Politecnico di Torino)
![Page 2: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/2.jpg)
Outline
• Web Session Model
• Clustering techniques
• The proposed algorithm
• Performance of the algorithm
• Session statistics
![Page 3: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/3.jpg)
Web session definition• A single web client generates a succession of TCP flows and think times
think time Toff think time Toff
• A session here is defined as the set of TCP flows arriving close enough one to each other• For example a threshold can be used to discriminate between think times and inter arrivals of TCP flows
![Page 4: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/4.jpg)
Algorithms
• A threshold based approach needs a priori knowledge of the source
• An adaptive algorithm should be capable to catch traffic variations
• This is supposed to be less sensitive to traffic characteristics
• Clustering is the chosen approach
![Page 5: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/5.jpg)
Proposed algorithm
• Three steps– A K-means is used on all samples to obtain a
first clustering, K is chosen very large– A hierarchical clustering is used only on
representatives of each cluster, K is reduced– A K-means is used on all samples again
• To test the algorithm we need a priori known traffic, that is artificially generated
![Page 6: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/6.jpg)
First Step: K-means
• K is chosen large enough but significantly smaller than the number of samples
• The K farthest flows determine the first partition
• K-means is performed 1000 iterations on all samples
• Each cluster is then represented using a subset of samples, one or two in our algorithm– The mean value (Centroid method)
– The gth and (100-g)th percentiles (Single linkage method if g=0)
g-th percentile (100-g)-th percentile
![Page 7: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/7.jpg)
Second step: a hierarchical method
• A hierarchical method is used on only representatives• This method merges clusters until a quality function
determines that the optimal number of clusters Nc has been found
![Page 8: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/8.jpg)
Gamma function typical behaviour
-10
0
10
20
30
40
50
60
70
0 200 400 600 800 1000 1200 1400
ga
mm
a
Step
![Page 9: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/9.jpg)
Third Step: K-means
• A K-means is performed on all samples
• This last step is not critical but rearranges samples’ positions within clusters that is flows within sessions
• It is not CPU time consuming, than it is not critical to use it
![Page 10: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/10.jpg)
Performance evaluation
• Artificial traffic is generated according to an ON/OFF process
• During ON periods a succession of flows is generated using i.i.d. inter-arrivals
• In this model inferring is to recognize if an inter arrival is an OFF period or an inter arrival between flows within an ON period
• Every time the algorithm does not guess correctly, an error is counted
• Suppose all variables are exponentially distributed
![Page 11: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/11.jpg)
First step sensitivity (1/2)
• If the initial number of clusters is chosen large enough the method is less error prone
• The algorithm is much more sensitive to the value of the idle period
0.01
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Per
cent
age
of e
rror
s
T_{off}
K=1000K=1500K=2000K=2500
![Page 12: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/12.jpg)
First step sensitivity (2/2)• Performance is sensitive to the choice of the percentile g
• When clusters are represented through flows at the border of the session the method is less sensitive to traffic, i.e. g=1
• This is due to the fact
that cluster has a long
and narrow shape and
those representatives
well model this fact
0.01
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Per
cent
age
of e
rror
s
T_{off}
Single linkageCentroid Method
g=1g=5
0.01
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
T_{off}
g=15g=25g=35g=45
![Page 13: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/13.jpg)
Comparison with threshold based algorithms – exponential case
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Pe
rce
nta
ge
of
err
ors
T_{off}
clusteringetha=T_{off}/2etha=T_{off}/4etha=T_{off}/8
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000T_{off}
etha=T_{off}/16etha=T_{off}/32etha=T_{off}/64etha=T_{off}/128
• Threshold based algorithms work well if traffic characteristics are known
• But they are very sensitive to the threshold value• If sessions are already
well clustered because idle periods are large enough compared to flow’s inter arrivals,our algorithm is verygood
![Page 14: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/14.jpg)
Comparison with threshold based algorithms – Pareto case
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Per
cen
tag
e o
f er
rors
T_{off}
clusteringetha=T_{off}/2etha=T_{off}/4etha=T_{off}/8
0.1
1
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000T_{off}
etha=T_{off}/16etha=T_{off}/32etha=T_{off}/64etha=T_{off}/128
• Threshold based algorithms work well if traffic characteristics are known
• But they are very sensitive to the threshold value• If sessions are already
well clustered because idle periods are large enough compared to flow’s inter arrivals,our algorithm is verygood
![Page 15: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/15.jpg)
Some statistics on aggregated sessions
0
0.05
0.1
0.15
0.2
0.25
0.3
1 10 100 1000 10000
PD
F
Number of TCP connections per session
1e-005 0.0001 0.001 0.01 0.1
1
100 1000 10000
Number of TCP connections per session
Com
pl. C
DF
0
0.01
0.02
0.03
0.04
0.05
0.06
1 10 100
PD
F
Session Length [s]
First SYN -> Last TCP Tear-DownFirst SYN -> Last Data Segment
0.0001
0.001
0.01
0.1
1
100 1000 10000
Session Length [s]
Com
pl. C
DF
• The session sizes are heavy tailed (broadly)– Usually each session is made of a few TCP flows
• Flow termination definition is not that important
![Page 16: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/16.jpg)
Some statistics on aggregated sessions
0
0.005
0.01
0.015
0.02
0.025
0.03
100 1000 10000 100000 1e+006
PD
F
Session data [bytes]
Server -> ClientClient -> Server
1e-005 0.0001
0.001 0.01
0.1 1
10000 100000 1e+006 1e+007
Session data [bytes]
Com
pl. C
DF
• Similar results concerning server to client and client to server data
• Similar distribution law, asymetries on volume only
![Page 17: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/17.jpg)
Flow’s and session’s inter-arrivals
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 1 10 100 1000 10000
CD
F
Time [s]
Apr.04 T_{off}Oct.02 T_{off}Apr.04 T_{arr}Oct.02 T_{arr}
• The method infers session which are similar even when considering very different traces
• Tarr and Toff are well identified
![Page 18: Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)](https://reader030.vdocuments.mx/reader030/viewer/2022032605/56649e7c5503460f94b7e145/html5/thumbnails/18.jpg)
Conclusions
• Clustering techniques could be easily used to infer web-session
• The proposed algorithm is a mix a known clustering approaches
• It is able to deal with huge amount of data
• Sessions seems to be very well recognized