probabilistic analysis of an algorithm to compute tcp packet round-trip time for intrusion detection
TRANSCRIPT
![Page 1: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/1.jpg)
ava i lab le at www.sc ienced i rec t . com
journa l homepage : www.e lsev ie r . com/ loca te /cose
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4
Probabilistic analysis of an algorithm to compute TCPpacket round-trip time for intrusion detection
Jianhua Yang*, Shou-Hsuan Stephen Huang
Department of Computer Science, University of Houston, 4800 Calhoun Road, Houston, TX 77204, USA
a r t i c l e i n f o
Article history:
Received 27 January 2006
Accepted 15 August 2006
Keywords:
Network security
Intrusion detection
Round-trip time
Stepping-stone
TCP packet
a b s t r a c t
Estimating the length of a connection chain is challenging and critical in detecting
stepping-stone intrusion. In this paper, we propose a novel method, called standard
deviation-based clustering approach (SDBA), to estimate the length of an interactive
connection chain by computing round-trip time (RTT). SDBA takes advantage of RTTs
distribution and inter-arrival distribution of ‘‘send’’ packets. We prove that the probability
of making a correct selection of RTT through SDBA is bounded by 1� (1/q2), where q is
a number related to standard deviation of RTTs distribution and send packets inter-arrival
distribution. Experimental results showed that SDBA can compete against the best known
algorithm in packet-matching rate and accuracy. This paper also presents the restrictions
of SDBA.
ª 2006 Elsevier Ltd. All rights reserved.
1. Introduction
Using stepping-stone (Zhang and Paxson, 2000) to attack
others, called stepping-stone intrusion, has became popular
since the Internet was widely used in every aspect of human
life. An important point to prevent this kind of attacks is to de-
tect them accurately and efficiently. There have been many
methods proposed to detect stepping-stone intrusion, such
as methods discussed by Zhang and Paxson (2000) and Yoda
and Etoh (2000). But they have the problems of (1) a high false
positive rate; and (2) being vulnerable to intruder’s
manipulation. One important way with the purpose of over-
coming the above two problems of detecting stepping-stone
intrusion is to estimate the length of a downstream connec-
tion chain by computing packet round-trip time even though
this may introduce false negative error because it neglects
the upstream part of the connection chain. Here the ‘length’
means the number of connections of a chain. If a connection
chain is estimated to include more than three or four
connections, there is a high probability that the chain is being
used by an intruder since there is no reason to access a host
through a long chain rather than a direct connection unless
in some very special applications.
Estimating the length of a connection chain on its down-
stream part has been become an important part on detecting
stepping-stone intrusion. So far basically there are two ways
to do this: one is to use echo-delay comparison proposed by
Yung (2002), and the other is to match each send with its cor-
responding echo packet proposed by Yang and Huang (2005).
The basic idea of Yung (2002) is to estimate the length of
a whole downstream connection chain by using reply echo
packet on the chain, to compute the length of the connection
to the nearest downstream host by using delayed
acknowledgement, called a yardstick, and then to compare
them to decide how long the downstream connection chain
is. The problem of this method comes from the selection of
the yardstick. If a yardstick is long relative to other connec-
tions, it is likely to introduce false negative error. Otherwise
* Corresponding author. The Department of Mathematics and Computer Science, Bennett College, 900 E. Washington Street, Greensboro,NC 27401, USA.
E-mail address: [email protected] (J. Yang).0167-4048/$ – see front matter ª 2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.cose.2006.08.011
![Page 2: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/2.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4138
false positive rate is going to be high. Another problem is that
the method used by Yung (2002) to compute packet round-trip
time in a connection chain may result in inaccuracy especially
when the network traffic is non-uniform. Yang and Huang’s
(2005) method to estimate the length of a connection chain
by computing packet round-trip time focuses on trying to
match each send packet with its corresponding echo packet
precisely. There are two algorithms proposed to match TCP/
IP send and echo packets: the Conservative algorithm and
the Greedy algorithm (Yang and Huang, 2005). The Conserva-
tive algorithm can give accurate packet match result but with
only relatively few matches. The Greedy algorithm could
‘match’ all the send packets but some ‘matches’ with low
confident of correctness. The two algorithms suffer from
balancing packet-matching rate and accuracy.
Matching TCP send and echo packets precisely or comput-
ing TCP packet round-trip time accurately is challenging and
significant in detecting stepping-stone intrusion. In this
paper, we propose a novel method SDBA to compute packets
round-trip time more accurate than the two methods by
Yang and Huang (2005). The evaluation of SDBA by estimating
the probability of making a correct selection of RTT through
Chebyshev inequality is presented as well. SDBA can match
most of the send packets compared to the Conservative
algorithm with the same correctness, and can compete
against the Greedy algorithm on the number of packets
matched but with a higher matching accuracy. The experi-
mental results show that SDBA can balance packet-matching
rate and accuracy well. Another contribution of this paper is
that it points out how well each packet-matching is by using
SDBA through probabilistic analysis of making a correct selec-
tion of RTT. The time complexity of SDBA is also presented.
The rest of this paper is arranged as follows. Section 2 dis-
cusses the motivation to propose SDBA to compute packet
round-trip time. Section 3 talks about SDBA in details and
presents its probabilistic analysis. Section 4 presents some
experimental results. Finally in Section 5 we summarize this
paper, discuss limitations of SDBA, and present some future
work.
2. Motivation
Matching each send and its corresponding echo packet is triv-
ial under the situation where there are no send–echo pair
overlaps. In other words, it is easy to match each send packet
with its corresponding echo packet when the echo packet is
always received before the next packet is sent. This is often
the case in a local area network rather than on the Internet.
Send–echo pair overlap occurs often on the Internet because
of network delay and competition for CPU time on the host.
For efficiency, TCP/IP protocol allows some send packets echo-
ed by one or more packets, as well as sending the next packet
before the previous one is acknowledged (Ylonen, 2004a,b),
which complicate packet-matching. There is no marker avail-
able to identify each packet on the Internet. The available
unencrypted stuff we can use to identify each packet is the
size, timestamp, and sequential number. Obviously a packet
size cannot leak any information useful in the identification
of a packet because it depends on an encryption key. The
Conservative and Greedy algorithms which are taking advan-
tage of the sequential number to match TCP/IP send and echo
packets have the problem to get high matching rate and accu-
racy concurrently. When there are unmatched send packets
followed by some echo packets, there is no way to know which
echo packet matches with which send packet except the case
that the first send packet always matches the first echo
packet, which is the strategy used by the Conservative algo-
rithm. The strategy used by the Greedy algorithm is to match
them in order which incurs some error because most probably
they are not one-to-one mapping.
The essential problem of the Conservative and Greedy
algorithms is that they only look at the packets locally when
they try to match each send packet. If looking at the packets
globally rather than locally, we may have additional informa-
tion to identify which send packet matches with which echo
packet even under the scenario that there is send–echo over-
lap. Here we take advantage of the timestamp of each packet
and the fact that even though the packet RTTs of a connection
chain could fluctuate because of the uncertainty of network
delay and host burden, but they should cluster around a spe-
cific RTT value which represents the average traffic of a net-
work. In other words, the fluctuation of the real RTTs of
send packets on a connection chain in a period of time is sup-
posed to be small with a high possibility.
We assume that there are more consecutive send packets
followed by some consecutive echo packets by which each
send packet must be echoed. Even though we do not know
which send packet corresponds to which echo packet or
packets, but we do know that there must be one or more
echo packets corresponding to a specific send packet, say si.
So we simply assume that all the echo packets have the possi-
bility to match with si and compute the gap between each
echo packet and si where only positive values are kept. We
have a data set for send packet si, as well as forming the
data sets for other send packets in a similar way. If we just
look at the data set of potential RTTs for si, there is still no
way to know which one is the real RTT of si. But if we look
at the data sets for several consecutive send packets, we
find a very interesting phenomenon, that is, there is one ele-
ment in each data set that is close to that of the other data
sets. The more the data sets, the higher is the confidence.
That is if we check send and echo packets globally, there is
a way to detect which send packet matches with which echo
packet whatever the scenarios are.
The problem turns out to be how to extract each RTT from
the data sets formed for send packets. One obvious way to do
this is to use data clustering method. There are many cluster-
ing approaches proposed by Mirkin (1996) and Jain and Dubes
(1988). But most of them have a common problem which is
that the final cluster result depends on a predefined precision
parameter 3. The fact is that RTT sequence of a connection
chain in a certain time period is supposed to be unique.
Searching a unique RTT sequence makes the most existing
clustering methods unavailable, especially the methods only
working with 3. We studied the distribution of real RTTs of
an interactive session and found that most RTTs are concen-
trated around its mean value except very few outliers. This
phenomenon motivates us to consider about taking advan-
tage of the fluctuation of RTTs to extract them. We use
![Page 3: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/3.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4 139
standard deviation of RTTs to characterize its fluctuation. The
standard deviation of real RTTs is supposed to be smaller with
a high probability than that of any other combinations which
are combined from each data set randomly. We assume that
the one with smallest standard deviation of all the combina-
tions from each data set has the highest probability to be the
real packet RTTs. These are the reasons that motivated us to
study SDBA, a standard deviation-based clustering approach,
to compute packet RTTs of an interactive connection chain.
3. The algorithm and its probabilisticanalysis
3.1. RTT algorithm
Given two sequences S¼ {s1, s2, ., sn} and E¼ {e1, e2, ., em},
where s1, s2, ., sn are send packets and e1, e2, ., em are echo
packets corresponding to the send packets in S. If a packet si
in S is echoed by a packet ej in E, we denote si f ej. If all the
packets are captured on a host in a connection chain at a
period of time, the following conditions must be satisfied:
(1) Any send packet in S must be echoed by one or more
packets in E; similarly, any echo packet in E must echo
one or more send packets in S;
(2) Packets both in S and E are stored in chronological order;
(3) For any two packets si, sj in S and ep, eq in E, if si f ep, sj f eq,
and i< j, then we have p� q.
Condition (1) indicates that the relationships between send
and its corresponding echo packets may be one-to-one, many-
to-one, or one-to-many. RTT of a send packet can be defined
as the gap between the timestamp of a send packet and that
of its corresponding echo packet if the relationship between
them is one-to-one. However, if the relationship is many-to-
one or one-to-many, the gap is not unique. If there are k
send packets from si to siþk�1 echoed by ej, the RTT of those
send packets is defined as the gap between the timestamp of
siþk�1 and that of ej, i.e., the smallest gap. Similarly, if a send
packet si is echoed by k packets ej to ejþk�1, only packets si
and ej are involved in the RTT definition of si. Conditions (2)
and (3) guarantee that send packets must be replied sequen-
tially. Each send packet must be echoed by one or more
packets successfully at one time, and the value of a send
packet RTT must be positive. All the above three conditions
can be justified from TCP/IP protocol (Ylonen, 2004a,b).
We compute the gaps between the timestamp of each echo
packet in E and that of all the send packets in S. It is safe to
eliminate the negative values as RTT must be positive. We
group these differences in sets according to each echo packet
in E, forming data sets E1, E2, ., Em for echo packets e1, e2, .,
em, respectively.
E1¼ {s1e1, s2e1, ., sne1},
E2¼ {s1e2, s2e2, ., sne2},
.
Em¼ {s1em, s2em, ., snem},
where element siej in Ej represents the gap ej–si between time-
stamp of the jth echo packet in E and that of the ith send
packet in S, where 1� i� n and 1� j�m. For analysis conve-
nience, here we generate each data set based on each echo
packet in E. It is eventually equivalent to generate each data
set based on each send packet in S as we have discussed in
Section 2.
We know that each send packet can only be echoed by
one or more packets successfully at one time, which indi-
cates that in each data set Ej there is only one element to
represent the real RTT of that send packet. We construct
cluster Xu, which is one candidate for the RTT sequence,
by taking one element from each data set Ej (1� j�m),
and storing them to Xu according to the chronological order
of the echo packets in E. If we go through all the possible
combinations, there will be up to nm possible clusters, but
only one of them represents the correct RTTs for all echo
packets. Each cluster Xu (1� u� nm) has m elements while
some of them may share a same echo or send packet.
We remove the send (echo) packets which share the same
echo (send) packets but keep the one with smaller gap.
Eventually we can get Xu (1� u� nm) with each element re-
lating to unique send and echo packets. Each cluster Xu is
a candidate of the RTTs for the send packets in S while
only one of them is the real one or close to the real one
with high probability. We select the one with smallest stan-
dard deviation among all clusters Xu (1� u� nm) to be the
RTTs of the packets in S. The following is the simplified al-
gorithm, called standard deviation-based clustering ap-
proach (SDBA), to compute packet RTTs given send packet
sequence S and echo packet sequence E.
Algorithm SDBA(S, E)
Begin:
(1) Generate data sets Ej (1� j�m): Ej¼ {t(i, j )jt(i, j )¼ ej� si,
i¼ 1, /, n & t(i, j )> 0}.
(2) Combine data from sets Ej (1� j�m) to form clusters Xu
(1� u� nm): Xu¼ {t(ij, j ) ˛ Ejj c 1� j�m & i1< i2</<im}.
(3) For each cluster X: (a) if x(i, j ), x(i, k) ˛ X, j< k, then delete
x(i, j ), and (b) if x(i, j ), x(k, j ) ˛ X, i< k, then delete x(k, j ).
(4) Output R¼ {r1, r2, ., rs} (1� s� n) which is the cluster X
with smallest standard deviation among all Xu (1� u� nm).
End
Let us analyze the time and space complexity of this algo-
rithm. Obviously the time complexity of SDBA in the worst
case is O(nm). This will cost too much CPU time and make
SDBA inefficient for practical use. However, the space complex-
ity is not a serious problem for SDBA because it is not necessary
to memorize all the combinations. What we need to do is to
remember the combination with smallest standard deviation.
The smaller m and n, the better the time complexity we get.
There is a way to diminish m and n by dividing a long packet
stream for a user into some subsections based upon the fact
that is if the gap between two consecutive send packets,
such as si, and siþ1 with some echo packets in between, is
more than a predefined threshold (usually 1–5 s), it makes
sense that all the echo packets after siþ1 would only echo
the send packets after siþ1. Even so, the size of the send and
echo packets could be hundreds by monitoring some real
![Page 4: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/4.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4140
Internet traffic. So the time complexity of SDBA is a serious
problem even though it could give us better result in comput-
ing RTTs than other methods.
The inefficiency of SDBA is its global computation, which
indicates looking over all the possibilities of X to find the
RTTs. To make SDBA more efficient, we can replace the global
computation by local computation. Instead of generating all
the possibilities of X, we only generate the cluster which
is the most likely candidate of the RTTs. Taking one element
of the first data set E1 as the first element of candidate X, we
then check the second data set E2 to get one element from
E2 and make X have smallest standard deviation when this el-
ement is added to X. Similarly, we can take one element from
each of other data sets Ej (3� j�m) and add them to X to make
X hold smallest standard deviation comparing to selecting
other elements of each data set. Consequently, for n elements
of data set E1 we generate n candidates. We select the one with
smallest standard deviation as the RTTs among n candidates
generated. Suppose there are n elements in E1 (the worst
case) and m data sets, the complexity of this algorithm is
O(n� (nþ n)� (m� 1))¼O(m� n2). Comparing with time com-
plexity of SDBA, this one is more efficient apparently. The
problem is that we cannot guarantee the correctness of its re-
sults because we do not go over all the possible combinations.
Even though we cannot prove that the correctness of the
result of SDBA as well, but we can evaluate the results of
SDBA by computing the probability of making a correct selec-
tion of RTT in SDBA when a cluster X with smallest standard
deviation is chosen among all the possibilities.
3.2. Probabilistic analysis of SDBA
We first estimate the probability of making an incorrect choice
of RTT for any element rj of a cluster R. We assume the distri-
bution of R is Z with mean m1 and standard deviation s1, and
send packet interval distribution is Y with mean m2 and stan-
dard deviation s2. Suppose rj is selected from Ej¼ {s1ej, s2ej,
., sk�1ej, skej, skþ1ej, ., snej}, we assume the correct selection
should be skej but an incorrect element in Ej is selected by
the algorithm. To satisfy the condition that R has the smallest
standard deviation, the element in Ej selected incorrectly must
be closer to m1 than skej. Only one of two elements sk�1ej and
skþ1ej has the highest possibility to be selected incorrectly be-
cause the elements in Ej are in descending order. Here we as-
sume skþ1ej is closer to m1 than sk�1ej. Consequently, the only
condition that skþ1ej is selected incorrectly is because:��skþ1ej � m1
��< ��skej � m1
��: (1)
We assume the smallest interval of the send packets is L,
most probably that the interval between sk and skþ1 is bigger
than L while the worst case is equal:
skþ1 � sk ¼ L ¼ 2qs1 (2)
Here q is a real number. From Eq. (2), for any echo packet ej, we
have,
skþ1 � ej þ ej � sk ¼ 2qs1�skþ1 � ej
���sk � ej
�¼ 2qs1
skþ1ej � skej ¼ 2qs1
skþ1ej � m1 þ m1 � skej ¼ 2qs1
��skþ1ej � m1
��þ ��skej � m1
�� ¼ 2qs1 (3)
From Eqs. (1) and (3), we have,��skej � m1
�� > qs1
We estimate the probability that rj is selected incorrectly
by using Chebyshev inequality (Kao, 1996; Feller, 1968) as the
following:
P�rj is selected incorrectly
�¼ P
�skþ1ej is selected
�¼ P
���skþ1ej � m1
��< ��skej � m1
���
¼ P���skej � m1
�� > qs1
�<
1q2
The probability of making an incorrect selection of a packet
RTT is bounded if we select the cluster with smallest standard
deviation among all Xu (1� u� nm
) as the RTT sequence.
In other words, the probability to make a correct selection
of a packet RTT can be estimated by using the following
equation:
P�rj is selected correctly
�� 1� 1
q2: (4)
The parameter q varies depending on the inter-arrival
distribution of the send packets and RTTs distribution.
3.3. Estimation of parameter q
The parameter q is determined by the smallest interval L of
distribution Y. The probability of making an incorrect selec-
tion of a packet RTT is determined by L. If the interval
between two consecutive send packets is the smallest one
of Y, we get the lowest boundary of the probability. The
point is the probability that the interval between two con-
secutive send packets takes the smallest interval of Y is
very small. So in reality, we usually do not use the smallest
interval to estimate the parameter q. We use the interval Lp
which makes the cumulative probability P(x< Lp) in Y be
5%. We estimate Lp upon the assumption that Y is Gamma
distribution with shape parameter b and scale parameter a.
In other words, we select Lp that must satisfy the following
equation:
Z Lp
0
ðx=aÞb�1e�xa
aGðbÞ dx ¼ 0:05 (5)
where GðbÞ ¼RN
0 e�uub�1 du.
We can compute Lp from Eq. (5) if b and a are known.
Parameters b and a vary upon keystroke and network envi-
ronment. The most usual way to estimate Lp is to take
a sample of send packets inter-arrival to estimate the
parameters b and a by using MLE (maximum likelihood
estimation) or other methods (Johnson and Kotz, 1970),
and then compute the Lp for the distribution with the
parameters b and a estimated. This way is appropriate for
individual computation, but not convenient for probabilistic
analysis. Here the way we use is to estimate the range
of the parameters b and a of the interval distribution of
send packets, and compute the range of Lp by the range
of estimated parameters b and a. We use the lower bound
of Lp to compute the probability that one element is
selected correctly in SDBA. We could probably know
![Page 5: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/5.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4 141
how well SDBA is from the probability estimated by using
Eq. (4).
We did many experiments with different users and differ-
ent environments (i.e., connection chains with different
paths) on the Internet and present some typical examples in
Table 1, where the unit of the send packet inter-arrival of
each sample is microsecond, as well as the unit of Lp. From
the experiment we know that the range of interval Lp of
send packets with cumulative probability 0.05 is from 32,000
to 52,000 approximately. There is no way to predict the exact
range of Lp by merely using experiment without further theory
analysis. In Section 4, we will use the lower bound of Lp to
compute the probability of making a correct selection of RTT
to evaluate SDBA of finding packet RTTs of an interactive
session.
4. Empirical study
We have discussed that we can use Gamma distribution to
simulate the inter-arrival of send packets on an interactive
session. This is foundation of our probabilistic analysis. So
in this section, we shall show that it is reasonable to simulate
inter-arrival distribution of send packets by Gamma distribu-
tion. We have proved that the probability of making a correct
selection of RTT in SDBA is bounded by a boundary which is
given by Eq. (4). Then we shall compute some real examples
to give a practical sense of this equation. SDBA can compete
against the Conservative and the Greedy algorithms both in
packet-matching rate and accuracy. Finally we compare the
performance among SDBA, the Conservative, and the Greedy
algorithms on packet matching.
4.1. Inter-arrival distribution of send packets
We established a connection chain which spanned U.S. and
Mexico by using OpenSSH (Ylonen, 1996). There is at least
one host, such as Acl08, that we have the administrator access
while we have only regular user rights to access all the other
hosts. At the starting point of the chain we ask several
students to simulate intruders by typing some commands
independently and collected all the send packets on the corre-
sponding outgoing connection of Acl08. We computed the
intervals of these send packets and use Matlab to fit their
distribution. Before fitting the distribution, we first drew the
Table 1 – The range of Lp estimated by experiment
Samples Size of eachsample
Items
b a Lp
1 1297 2.043 137280 51115
2 990 1.956 137480 46448
3 816 1.4434 212600 33733
4 900 1.809 143970 40541
5 176 1.426 280220 43016
6 800 1.629 172720 37617
7 412 1.364 242270 32874
histogram of these data to see what kind of distribution they
look like. It is found that they are more like a Gamma distribu-
tion with a shape parameter bigger than one. And then we use
Matlab distributing fit function to estimate its shape parame-
ter b and scale parameter a.
Once we have obtained these two parameters, we have
a theoretical distribution determined by its shape parameter
b and scale parameter a. We use quantile–quantile function
of Matlab to verify how well the Gamma distribution fit the ex-
ample. Fig. 1 shows the verification result with one typical ex-
ample presented, where X- and Y-axis have scale 105.
In this example the shape and scale parameters are esti-
mated to be 2.0426 and 137,280, respectively. From Fig. 1 we
found that the points with intervals more than 400,000 (micro-
seconds) are not well fitted with Gamma distribution where
the gray dashed line (red in the web version) indicates an ideal
fitting. But the points with intervals less than 400,000 (micro-
seconds) are simulated closely by this Gamma distribution
with b¼ 2.0426 and a¼ 137,280. We are confident about the
value Lp computed from the Gamma distribution because it
is much less than 400,000.
4.2. Sample experiments
The key idea of the algorithm SDBA is to select the combina-
tion of send–echo gaps with smallest standard deviation as
the RTT sequence. The best way to verify this point is compare
the RTT sequence from SDBA with corresponding correct
RTTs to see if they are consistent. The problem is there is no
way to know the correct RTTs for the packets of an interactive
session. If there was a way we could find the correct RTTs, we
would not propose the above algorithm. From Yang and
Huang (2005) we know that matching each send and its corre-
sponding echo packet is trivial when there is no send–echo
pair overlap. So in our first experiment we control the key-
stroke speed so as to generate the scenario without send–
echo pair overlap to make it easy to compute the correct
RTTs, with which we compare the RTT sequence coming
from SDBA to verify if SDBA can compute the RTTs correctly.
This is to validate SDBA by using this controlled data set.
Another way to evaluate SDBA when there are send–echo
pair overlaps, which occur often on the Internet where we
0 2 4 6 8 10 120
2
4
6
8
10
12
X Quantiles
Y Q
uant
iles
Fig. 1 – Verification of send packets interval distribution.
![Page 6: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/6.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4142
do not have correct RTTs, is to justify its performance by com-
puting the probability of making a correct selection of RTT.
We established a connection chain similar to the previ-
ous section. The students were asked to control their key-
stroke speed. We collected all the send and echo packets
in a period of time at Acl08. First we match the send and
echo packets to compute the correct RTTs, and then use
the send and echo packet set as the input of SDBA to get
the RTT sequence. We repeated the experiment many times
with one of the comparisons presented in Fig. 2, where
Y-axis represents RTT value with unit microsecond and
X-axis represents RTT index number. This experimental
result showed that the RTTs from SDBA are exactly same
as the correct RTTs.
The second experiment is for the situation when there are
send–echo pair overlaps. The student participants type inde-
pendently and freely, we captured all the send and echo
packets in a period of time, and compute the RTTs from
SDBA. We take Lp as its lower bound 32,874 and compute the
lower bound probability of making a correct selection of RTT
by using Eq. (4). Three examples are presented in Table 2,
where the second to the fifth columns are average value of
the RTTs with smallest standard deviation with unit micro-
second, standard deviation, q number, and the boundary of
the probability, respectively. From the probability estimated,
we are confident about the result from SDBA because the
probabilities in these three examples are all higher than
97%. So even if we cannot compare the result from SDBA to
0 10 20 30 40 50 60 70 80 90 1002.45
2.5
2.55
2.6
2.65
2.7
2.75
RTT Index
RTT
val
ue (m
icro
sec
ond)
Real RTTSDBA
x 105
Fig. 2 – Verification of SDBA under the situation without
send–echo overlap.
Table 2 – The results of probability estimation
Examples Items
m s q p
1 264947.0 2810.708 11.695 0.9927
2 265756.3 5514.666 5.9612 0.9719
3 265727.2 5549.605 5.9237 0.9715
a correct RTT because we do not have a correct one when
there are send–echo pair overlaps, but we can still evaluate
SDBA by estimating the probability of making a
correct selection of RTT.
4.3. Packet-matching algorithm comparison
Conservative algorithm is supposed to give correct packet-
matching results (Yang and Huang, 2005), but only few
packets are matched when there are send–echo pair overlaps.
If there is no send–echo pair overlap, the Conservative and
Greedy algorithms are all supposed to match TCP packets
correctly.
First, we compare SDBA with the Conservative and Greedy
algorithms under the situation that there is no send–echo pair
60 65 70 75 80 85 90 95 100 105 1102.5
2.55
2.6
2.65
2.7
2.75
2.8
2.85
2.9x 105
RTT index
RTT
val
ue (m
icro
seco
nd)
SDBAConservativeGreedy
Fig. 3 – Packet-matching comparison among the Conserva-
tive, Greedy, and SDBA without send–echo pair overlaps.
100 110 120 130 140 150 160 1702.6
2.8
3
3.2
3.4
3.6
3.8x105
RTT index
RTT
val
ue (m
icro
seco
nd)
SDBAConservative
Fig. 4 – Packet-matching comparison between the
Conservative and SDBA with send–echo pair overlaps.
![Page 7: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/7.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4 143
overlap. When we did the experiment we need to control our
typing speed as we did before as slow as possible so as to be
sure there is no send–echo pair overlap. Three algorithms
ran on Acl08 at the same time interval to monitor the same
connection chain. The packet-matching results by the three
algorithms are showed partly in Fig. 3, respectively, where
each point represents the RTT gap for a send packet. From
the result shown in Fig. 3 we know if there is no send–echo
pair overlap, we can get the same packet-matching results
from the three methods and compute the same RTTs.
Second, however, most probably there are send–echo pair
overlaps on the Internet. We cannot claim that these three al-
gorithms still give us the same packet-matching result under
this situation. But what we have to be sure is that the Conser-
vative algorithm can still give correct result with fewer send
packets matched. If we compare the packet-matching results
by SDBA with results by the Conservative and the Greedy
algorithms, we will know the performance of SDBA both in
matching rate and accuracy.
Fig. 4 shows the packet-matching comparison between
the Conservative algorithm and SDBA when there are
send–echo pair overlaps. Here we collect 169 send packets,
in which 44 send packets (in Fig. 4 only 29 are displayed
for clarity) are matched by the Conservative algorithm, 169
send packets are matched by SDBA. The RTT gaps found
by the Conservative algorithm are exactly included in RTT
gaps found by the SDBA. Even though we are not sure about
the correctness of the rest RTTs, but we still get a sense
about the correctness of RTTs computed by SDBA from
this comparison.
We verify the packet-matching rate of SDBA by compar-
ing with the Greedy algorithm. Fig. 5 still shows only part
of the packet-matching comparison results between SDBA
and the Greedy algorithm. It indicates that most of the
RTTs are consistent but there are fewer of them. Among
the 169 RTTs, 157 RTTs of the Greedy matches are included
in the results of SDBA. But we are not sure about the
100 110 120 130 140 150 160 1702
3
4
5
6
7
8
9
10
11x 105
RTT index
RTT
val
ue (m
icro
seco
nd)
SDBAGreedy
Fig. 5 – Packet-matching comparison between the SDBA and
the Greedy with send–echo pair overlaps.
correctness of the other 12 RTTs (for clarity only 7 points
are displayed in Fig. 5) of the Greedy algorithm until we
compare them with the results of the Conservative algo-
rithm because it should always give us correct results. We
found there are at least 4 of the 12 RTTs potentially incorrect
after comparing with the Conservative results. Comparing
with the RTTs found by the Greedy algorithm, the RTTs
found by SDBA are closer to the ones found by the Conser-
vative algorithm. The experimental results showed that
SDBA can compete favorably not only against the Conserva-
tive in packet-matching accuracy but also against the
Greedy in packet-matching rate.
5. Conclusion and future work
Estimating the length of a downstream connection chain is an
effective way to detect stepping-stone intrusion. The core
technology of estimating the length of a connection chain is
to compute the round-trip time for each send packet by
matching send and echo packets through the chain. We
have proposed the approach SDBA to compute round-trip
time and a way to evaluate SDBA by probabilistic analysis.
SDBA takes advantage of the fact that the RTTs of a connection
chain are around a value which indicates average network
traffic.
SDBA can compete against the best known packet-match-
ing algorithm both in matching rate and accuracy. We have
proved that the probability of making a correct selection of
RTT through SDBA is bounded by 1� ð1=q2Þwhere q is a num-
ber related to the distribution of RTTs and inter-arrival distri-
bution of send packets. Some real case experimental results
showed that SDBA computes a correct RTTs with a probability
higher than 97%.
There are still some problems about the algorithm SDBA.
The algorithm is somewhat inefficient in time complexity.
Finding an efficient one is our future work and under way cur-
rently even though we have discussed it a little in Section 3.1.
Also SDBA can only compute the packet RTTs for a connection
chain on its downstream part. Finding the packet RTTs for the
upstream part of a connection chain is more challenging and
will provide us a better estimation of the connection chain
length, thus a better stepping-stone detection.
r e f e r e n c e s
Feller W. An introduction to probability theory and its applica-tions, vol. I. New York: John Wiley & Sons, Inc.; 1968.
Jain A, Dubes R. Algorithms for clustering data. New Jersey:Prentice Hall, Inc.; 1988. p. 55–143.
Johnson Normal I, Kotz Samuel. Continuous univariatedistributions-1. New York: John Wiley & Sons, Inc.; 1970.p. 166–97.
Kao E. An introduction to stochastic processes. New York:Duxbury Press; 1996.
Mirkin B. Mathematical classification and clustering. Dor-drecht, The Netherlands: Kluwer Academic Publishers; 1996.p. 169–98.
![Page 8: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection](https://reader036.vdocuments.mx/reader036/viewer/2022072116/57501f1f1a28ab877e941985/html5/thumbnails/8.jpg)
c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4144
Yang Jianhua, Huang Shou-Hsuan Stephen. Matching TCP packetsand its application to the detection of long connection chains.In: IEEE proceedings of 19th international conference on ad-vanced information networking and applications (AINA’05),Taipei, Taiwan; March 2005. p. 1005–10.
Ylonen T. SSH – secure login connections over the Internet. In:Sixth USENIX Security Symposium, San Jose, CA, USA; 1996.p. 37–42.
Ylonen T. SSH protocol architecture (draft–IETF document),<http://www.ietf.org/internet-drafts/draft-ietf-secsh-architecture-16.txt>; June 2004a.
Ylonen T. SSH Transport layer protocol (draft IETF document),<http://www.ietf.org/internet-drafts/draft-ietf-secsh-transport-18.txt>; June 2004b.
Yoda K, Etoh H. Finding Connection Chain for Tracing Intruders.In: Proceedings of the sixth European symposium onresearch in computer security (LNCS 1985), Toulouse, France;2000. p. 31–42.
Yung Kwong H. Detecting long connecting chains of interactiveterminal sessions, RAID 2002. Zurich, Switzerland: SpringerPress; October 2002. p. 1–16.
Yin Zhang, Vern Paxson. Detecting stepping-stones. In: Proceed-ings of the ninth USENIX security symposium, Denver, CO;August 2000. p. 67–81.
Dr. Jianhua Yang is an Assistant Pro-
fessor in the Department of Mathe-
matics and Computer Science at
Bennett College for Women, Greens-
boro NC. His research interests are
computer, network and information
security. Dr. Yang earned his Ph.D.
in Computer Science at the Univer-
sity of Houston. Before joining in
Bennett College, Dr. Yang was an Associate Professor at Bei-
jing Institute of Computer Technology, Beijing, China from
1990 to 2000. He is currently a member of IEEE. Dr. Yang can
be reached at [email protected].
Dr. Shou-Hsuan Stephen Huang is a professor of Computer
Science at the University of Houston. His research interests in-
clude data structures and algorithms, intrusion detection and
computer security. Stephen Huang received his Ph.D. degree
from the University of Texas – Austin. He is a senior member
of the IEEE Computer Society. Dr. Huang can be reached at