probabilistic analysis of an algorithm to compute tcp packet round-trip time for intrusion detection

8
Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection Jianhua Yang*, Shou-Hsuan Stephen Huang Department of Computer Science, University of Houston, 4800 Calhoun Road, Houston, TX 77204, USA article info Article history: Received 27 January 2006 Accepted 15 August 2006 Keywords: Network security Intrusion detection Round-trip time Stepping-stone TCP packet abstract Estimating the length of a connection chain is challenging and critical in detecting stepping-stone intrusion. In this paper, we propose a novel method, called standard deviation-based clustering approach (SDBA), to estimate the length of an interactive connection chain by computing round-trip time (RTT). SDBA takes advantage of RTTs distribution and inter-arrival distribution of ‘‘send’’ packets. We prove that the probability of making a correct selection of RTT through SDBA is bounded by 1 (1/q 2 ), where q is a number related to standard deviation of RTTs distribution and send packets inter-arrival distribution. Experimental results showed that SDBA can compete against the best known algorithm in packet-matching rate and accuracy. This paper also presents the restrictions of SDBA. ª 2006 Elsevier Ltd. All rights reserved. 1. Introduction Using stepping-stone (Zhang and Paxson, 2000) to attack others, called stepping-stone intrusion, has became popular since the Internet was widely used in every aspect of human life. An important point to prevent this kind of attacks is to de- tect them accurately and efficiently. There have been many methods proposed to detect stepping-stone intrusion, such as methods discussed by Zhang and Paxson (2000) and Yoda and Etoh (2000). But they have the problems of (1) a high false positive rate; and (2) being vulnerable to intruder’s manipulation. One important way with the purpose of over- coming the above two problems of detecting stepping-stone intrusion is to estimate the length of a downstream connec- tion chain by computing packet round-trip time even though this may introduce false negative error because it neglects the upstream part of the connection chain. Here the ‘length’ means the number of connections of a chain. If a connection chain is estimated to include more than three or four connections, there is a high probability that the chain is being used by an intruder since there is no reason to access a host through a long chain rather than a direct connection unless in some very special applications. Estimating the length of a connection chain on its down- stream part has been become an important part on detecting stepping-stone intrusion. So far basically there are two ways to do this: one is to use echo-delay comparison proposed by Yung (2002), and the other is to match each send with its cor- responding echo packet proposed by Yang and Huang (2005). The basic idea of Yung (2002) is to estimate the length of a whole downstream connection chain by using reply echo packet on the chain, to compute the length of the connection to the nearest downstream host by using delayed acknowledgement, called a yardstick, and then to compare them to decide how long the downstream connection chain is. The problem of this method comes from the selection of the yardstick. If a yardstick is long relative to other connec- tions, it is likely to introduce false negative error. Otherwise * Corresponding author. The Department of Mathematics and Computer Science, Bennett College, 900 E. Washington Street, Greensboro, NC 27401, USA. E-mail address: [email protected] (J. Yang). available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/cose 0167-4048/$ – see front matter ª 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.cose.2006.08.011 computers & security 26 (2007) 137–144

Upload: jianhua-yang

Post on 26-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

ava i lab le at www.sc ienced i rec t . com

journa l homepage : www.e lsev ie r . com/ loca te /cose

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4

Probabilistic analysis of an algorithm to compute TCPpacket round-trip time for intrusion detection

Jianhua Yang*, Shou-Hsuan Stephen Huang

Department of Computer Science, University of Houston, 4800 Calhoun Road, Houston, TX 77204, USA

a r t i c l e i n f o

Article history:

Received 27 January 2006

Accepted 15 August 2006

Keywords:

Network security

Intrusion detection

Round-trip time

Stepping-stone

TCP packet

a b s t r a c t

Estimating the length of a connection chain is challenging and critical in detecting

stepping-stone intrusion. In this paper, we propose a novel method, called standard

deviation-based clustering approach (SDBA), to estimate the length of an interactive

connection chain by computing round-trip time (RTT). SDBA takes advantage of RTTs

distribution and inter-arrival distribution of ‘‘send’’ packets. We prove that the probability

of making a correct selection of RTT through SDBA is bounded by 1� (1/q2), where q is

a number related to standard deviation of RTTs distribution and send packets inter-arrival

distribution. Experimental results showed that SDBA can compete against the best known

algorithm in packet-matching rate and accuracy. This paper also presents the restrictions

of SDBA.

ª 2006 Elsevier Ltd. All rights reserved.

1. Introduction

Using stepping-stone (Zhang and Paxson, 2000) to attack

others, called stepping-stone intrusion, has became popular

since the Internet was widely used in every aspect of human

life. An important point to prevent this kind of attacks is to de-

tect them accurately and efficiently. There have been many

methods proposed to detect stepping-stone intrusion, such

as methods discussed by Zhang and Paxson (2000) and Yoda

and Etoh (2000). But they have the problems of (1) a high false

positive rate; and (2) being vulnerable to intruder’s

manipulation. One important way with the purpose of over-

coming the above two problems of detecting stepping-stone

intrusion is to estimate the length of a downstream connec-

tion chain by computing packet round-trip time even though

this may introduce false negative error because it neglects

the upstream part of the connection chain. Here the ‘length’

means the number of connections of a chain. If a connection

chain is estimated to include more than three or four

connections, there is a high probability that the chain is being

used by an intruder since there is no reason to access a host

through a long chain rather than a direct connection unless

in some very special applications.

Estimating the length of a connection chain on its down-

stream part has been become an important part on detecting

stepping-stone intrusion. So far basically there are two ways

to do this: one is to use echo-delay comparison proposed by

Yung (2002), and the other is to match each send with its cor-

responding echo packet proposed by Yang and Huang (2005).

The basic idea of Yung (2002) is to estimate the length of

a whole downstream connection chain by using reply echo

packet on the chain, to compute the length of the connection

to the nearest downstream host by using delayed

acknowledgement, called a yardstick, and then to compare

them to decide how long the downstream connection chain

is. The problem of this method comes from the selection of

the yardstick. If a yardstick is long relative to other connec-

tions, it is likely to introduce false negative error. Otherwise

* Corresponding author. The Department of Mathematics and Computer Science, Bennett College, 900 E. Washington Street, Greensboro,NC 27401, USA.

E-mail address: [email protected] (J. Yang).0167-4048/$ – see front matter ª 2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.cose.2006.08.011

Page 2: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4138

false positive rate is going to be high. Another problem is that

the method used by Yung (2002) to compute packet round-trip

time in a connection chain may result in inaccuracy especially

when the network traffic is non-uniform. Yang and Huang’s

(2005) method to estimate the length of a connection chain

by computing packet round-trip time focuses on trying to

match each send packet with its corresponding echo packet

precisely. There are two algorithms proposed to match TCP/

IP send and echo packets: the Conservative algorithm and

the Greedy algorithm (Yang and Huang, 2005). The Conserva-

tive algorithm can give accurate packet match result but with

only relatively few matches. The Greedy algorithm could

‘match’ all the send packets but some ‘matches’ with low

confident of correctness. The two algorithms suffer from

balancing packet-matching rate and accuracy.

Matching TCP send and echo packets precisely or comput-

ing TCP packet round-trip time accurately is challenging and

significant in detecting stepping-stone intrusion. In this

paper, we propose a novel method SDBA to compute packets

round-trip time more accurate than the two methods by

Yang and Huang (2005). The evaluation of SDBA by estimating

the probability of making a correct selection of RTT through

Chebyshev inequality is presented as well. SDBA can match

most of the send packets compared to the Conservative

algorithm with the same correctness, and can compete

against the Greedy algorithm on the number of packets

matched but with a higher matching accuracy. The experi-

mental results show that SDBA can balance packet-matching

rate and accuracy well. Another contribution of this paper is

that it points out how well each packet-matching is by using

SDBA through probabilistic analysis of making a correct selec-

tion of RTT. The time complexity of SDBA is also presented.

The rest of this paper is arranged as follows. Section 2 dis-

cusses the motivation to propose SDBA to compute packet

round-trip time. Section 3 talks about SDBA in details and

presents its probabilistic analysis. Section 4 presents some

experimental results. Finally in Section 5 we summarize this

paper, discuss limitations of SDBA, and present some future

work.

2. Motivation

Matching each send and its corresponding echo packet is triv-

ial under the situation where there are no send–echo pair

overlaps. In other words, it is easy to match each send packet

with its corresponding echo packet when the echo packet is

always received before the next packet is sent. This is often

the case in a local area network rather than on the Internet.

Send–echo pair overlap occurs often on the Internet because

of network delay and competition for CPU time on the host.

For efficiency, TCP/IP protocol allows some send packets echo-

ed by one or more packets, as well as sending the next packet

before the previous one is acknowledged (Ylonen, 2004a,b),

which complicate packet-matching. There is no marker avail-

able to identify each packet on the Internet. The available

unencrypted stuff we can use to identify each packet is the

size, timestamp, and sequential number. Obviously a packet

size cannot leak any information useful in the identification

of a packet because it depends on an encryption key. The

Conservative and Greedy algorithms which are taking advan-

tage of the sequential number to match TCP/IP send and echo

packets have the problem to get high matching rate and accu-

racy concurrently. When there are unmatched send packets

followed by some echo packets, there is no way to know which

echo packet matches with which send packet except the case

that the first send packet always matches the first echo

packet, which is the strategy used by the Conservative algo-

rithm. The strategy used by the Greedy algorithm is to match

them in order which incurs some error because most probably

they are not one-to-one mapping.

The essential problem of the Conservative and Greedy

algorithms is that they only look at the packets locally when

they try to match each send packet. If looking at the packets

globally rather than locally, we may have additional informa-

tion to identify which send packet matches with which echo

packet even under the scenario that there is send–echo over-

lap. Here we take advantage of the timestamp of each packet

and the fact that even though the packet RTTs of a connection

chain could fluctuate because of the uncertainty of network

delay and host burden, but they should cluster around a spe-

cific RTT value which represents the average traffic of a net-

work. In other words, the fluctuation of the real RTTs of

send packets on a connection chain in a period of time is sup-

posed to be small with a high possibility.

We assume that there are more consecutive send packets

followed by some consecutive echo packets by which each

send packet must be echoed. Even though we do not know

which send packet corresponds to which echo packet or

packets, but we do know that there must be one or more

echo packets corresponding to a specific send packet, say si.

So we simply assume that all the echo packets have the possi-

bility to match with si and compute the gap between each

echo packet and si where only positive values are kept. We

have a data set for send packet si, as well as forming the

data sets for other send packets in a similar way. If we just

look at the data set of potential RTTs for si, there is still no

way to know which one is the real RTT of si. But if we look

at the data sets for several consecutive send packets, we

find a very interesting phenomenon, that is, there is one ele-

ment in each data set that is close to that of the other data

sets. The more the data sets, the higher is the confidence.

That is if we check send and echo packets globally, there is

a way to detect which send packet matches with which echo

packet whatever the scenarios are.

The problem turns out to be how to extract each RTT from

the data sets formed for send packets. One obvious way to do

this is to use data clustering method. There are many cluster-

ing approaches proposed by Mirkin (1996) and Jain and Dubes

(1988). But most of them have a common problem which is

that the final cluster result depends on a predefined precision

parameter 3. The fact is that RTT sequence of a connection

chain in a certain time period is supposed to be unique.

Searching a unique RTT sequence makes the most existing

clustering methods unavailable, especially the methods only

working with 3. We studied the distribution of real RTTs of

an interactive session and found that most RTTs are concen-

trated around its mean value except very few outliers. This

phenomenon motivates us to consider about taking advan-

tage of the fluctuation of RTTs to extract them. We use

Page 3: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4 139

standard deviation of RTTs to characterize its fluctuation. The

standard deviation of real RTTs is supposed to be smaller with

a high probability than that of any other combinations which

are combined from each data set randomly. We assume that

the one with smallest standard deviation of all the combina-

tions from each data set has the highest probability to be the

real packet RTTs. These are the reasons that motivated us to

study SDBA, a standard deviation-based clustering approach,

to compute packet RTTs of an interactive connection chain.

3. The algorithm and its probabilisticanalysis

3.1. RTT algorithm

Given two sequences S¼ {s1, s2, ., sn} and E¼ {e1, e2, ., em},

where s1, s2, ., sn are send packets and e1, e2, ., em are echo

packets corresponding to the send packets in S. If a packet si

in S is echoed by a packet ej in E, we denote si f ej. If all the

packets are captured on a host in a connection chain at a

period of time, the following conditions must be satisfied:

(1) Any send packet in S must be echoed by one or more

packets in E; similarly, any echo packet in E must echo

one or more send packets in S;

(2) Packets both in S and E are stored in chronological order;

(3) For any two packets si, sj in S and ep, eq in E, if si f ep, sj f eq,

and i< j, then we have p� q.

Condition (1) indicates that the relationships between send

and its corresponding echo packets may be one-to-one, many-

to-one, or one-to-many. RTT of a send packet can be defined

as the gap between the timestamp of a send packet and that

of its corresponding echo packet if the relationship between

them is one-to-one. However, if the relationship is many-to-

one or one-to-many, the gap is not unique. If there are k

send packets from si to siþk�1 echoed by ej, the RTT of those

send packets is defined as the gap between the timestamp of

siþk�1 and that of ej, i.e., the smallest gap. Similarly, if a send

packet si is echoed by k packets ej to ejþk�1, only packets si

and ej are involved in the RTT definition of si. Conditions (2)

and (3) guarantee that send packets must be replied sequen-

tially. Each send packet must be echoed by one or more

packets successfully at one time, and the value of a send

packet RTT must be positive. All the above three conditions

can be justified from TCP/IP protocol (Ylonen, 2004a,b).

We compute the gaps between the timestamp of each echo

packet in E and that of all the send packets in S. It is safe to

eliminate the negative values as RTT must be positive. We

group these differences in sets according to each echo packet

in E, forming data sets E1, E2, ., Em for echo packets e1, e2, .,

em, respectively.

E1¼ {s1e1, s2e1, ., sne1},

E2¼ {s1e2, s2e2, ., sne2},

.

Em¼ {s1em, s2em, ., snem},

where element siej in Ej represents the gap ej–si between time-

stamp of the jth echo packet in E and that of the ith send

packet in S, where 1� i� n and 1� j�m. For analysis conve-

nience, here we generate each data set based on each echo

packet in E. It is eventually equivalent to generate each data

set based on each send packet in S as we have discussed in

Section 2.

We know that each send packet can only be echoed by

one or more packets successfully at one time, which indi-

cates that in each data set Ej there is only one element to

represent the real RTT of that send packet. We construct

cluster Xu, which is one candidate for the RTT sequence,

by taking one element from each data set Ej (1� j�m),

and storing them to Xu according to the chronological order

of the echo packets in E. If we go through all the possible

combinations, there will be up to nm possible clusters, but

only one of them represents the correct RTTs for all echo

packets. Each cluster Xu (1� u� nm) has m elements while

some of them may share a same echo or send packet.

We remove the send (echo) packets which share the same

echo (send) packets but keep the one with smaller gap.

Eventually we can get Xu (1� u� nm) with each element re-

lating to unique send and echo packets. Each cluster Xu is

a candidate of the RTTs for the send packets in S while

only one of them is the real one or close to the real one

with high probability. We select the one with smallest stan-

dard deviation among all clusters Xu (1� u� nm) to be the

RTTs of the packets in S. The following is the simplified al-

gorithm, called standard deviation-based clustering ap-

proach (SDBA), to compute packet RTTs given send packet

sequence S and echo packet sequence E.

Algorithm SDBA(S, E)

Begin:

(1) Generate data sets Ej (1� j�m): Ej¼ {t(i, j )jt(i, j )¼ ej� si,

i¼ 1, /, n & t(i, j )> 0}.

(2) Combine data from sets Ej (1� j�m) to form clusters Xu

(1� u� nm): Xu¼ {t(ij, j ) ˛ Ejj c 1� j�m & i1< i2</<im}.

(3) For each cluster X: (a) if x(i, j ), x(i, k) ˛ X, j< k, then delete

x(i, j ), and (b) if x(i, j ), x(k, j ) ˛ X, i< k, then delete x(k, j ).

(4) Output R¼ {r1, r2, ., rs} (1� s� n) which is the cluster X

with smallest standard deviation among all Xu (1� u� nm).

End

Let us analyze the time and space complexity of this algo-

rithm. Obviously the time complexity of SDBA in the worst

case is O(nm). This will cost too much CPU time and make

SDBA inefficient for practical use. However, the space complex-

ity is not a serious problem for SDBA because it is not necessary

to memorize all the combinations. What we need to do is to

remember the combination with smallest standard deviation.

The smaller m and n, the better the time complexity we get.

There is a way to diminish m and n by dividing a long packet

stream for a user into some subsections based upon the fact

that is if the gap between two consecutive send packets,

such as si, and siþ1 with some echo packets in between, is

more than a predefined threshold (usually 1–5 s), it makes

sense that all the echo packets after siþ1 would only echo

the send packets after siþ1. Even so, the size of the send and

echo packets could be hundreds by monitoring some real

Page 4: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4140

Internet traffic. So the time complexity of SDBA is a serious

problem even though it could give us better result in comput-

ing RTTs than other methods.

The inefficiency of SDBA is its global computation, which

indicates looking over all the possibilities of X to find the

RTTs. To make SDBA more efficient, we can replace the global

computation by local computation. Instead of generating all

the possibilities of X, we only generate the cluster which

is the most likely candidate of the RTTs. Taking one element

of the first data set E1 as the first element of candidate X, we

then check the second data set E2 to get one element from

E2 and make X have smallest standard deviation when this el-

ement is added to X. Similarly, we can take one element from

each of other data sets Ej (3� j�m) and add them to X to make

X hold smallest standard deviation comparing to selecting

other elements of each data set. Consequently, for n elements

of data set E1 we generate n candidates. We select the one with

smallest standard deviation as the RTTs among n candidates

generated. Suppose there are n elements in E1 (the worst

case) and m data sets, the complexity of this algorithm is

O(n� (nþ n)� (m� 1))¼O(m� n2). Comparing with time com-

plexity of SDBA, this one is more efficient apparently. The

problem is that we cannot guarantee the correctness of its re-

sults because we do not go over all the possible combinations.

Even though we cannot prove that the correctness of the

result of SDBA as well, but we can evaluate the results of

SDBA by computing the probability of making a correct selec-

tion of RTT in SDBA when a cluster X with smallest standard

deviation is chosen among all the possibilities.

3.2. Probabilistic analysis of SDBA

We first estimate the probability of making an incorrect choice

of RTT for any element rj of a cluster R. We assume the distri-

bution of R is Z with mean m1 and standard deviation s1, and

send packet interval distribution is Y with mean m2 and stan-

dard deviation s2. Suppose rj is selected from Ej¼ {s1ej, s2ej,

., sk�1ej, skej, skþ1ej, ., snej}, we assume the correct selection

should be skej but an incorrect element in Ej is selected by

the algorithm. To satisfy the condition that R has the smallest

standard deviation, the element in Ej selected incorrectly must

be closer to m1 than skej. Only one of two elements sk�1ej and

skþ1ej has the highest possibility to be selected incorrectly be-

cause the elements in Ej are in descending order. Here we as-

sume skþ1ej is closer to m1 than sk�1ej. Consequently, the only

condition that skþ1ej is selected incorrectly is because:��skþ1ej � m1

��< ��skej � m1

��: (1)

We assume the smallest interval of the send packets is L,

most probably that the interval between sk and skþ1 is bigger

than L while the worst case is equal:

skþ1 � sk ¼ L ¼ 2qs1 (2)

Here q is a real number. From Eq. (2), for any echo packet ej, we

have,

skþ1 � ej þ ej � sk ¼ 2qs1�skþ1 � ej

���sk � ej

�¼ 2qs1

skþ1ej � skej ¼ 2qs1

skþ1ej � m1 þ m1 � skej ¼ 2qs1

��skþ1ej � m1

��þ ��skej � m1

�� ¼ 2qs1 (3)

From Eqs. (1) and (3), we have,��skej � m1

�� > qs1

We estimate the probability that rj is selected incorrectly

by using Chebyshev inequality (Kao, 1996; Feller, 1968) as the

following:

P�rj is selected incorrectly

�¼ P

�skþ1ej is selected

�¼ P

���skþ1ej � m1

��< ��skej � m1

���

¼ P���skej � m1

�� > qs1

�<

1q2

The probability of making an incorrect selection of a packet

RTT is bounded if we select the cluster with smallest standard

deviation among all Xu (1� u� nm

) as the RTT sequence.

In other words, the probability to make a correct selection

of a packet RTT can be estimated by using the following

equation:

P�rj is selected correctly

�� 1� 1

q2: (4)

The parameter q varies depending on the inter-arrival

distribution of the send packets and RTTs distribution.

3.3. Estimation of parameter q

The parameter q is determined by the smallest interval L of

distribution Y. The probability of making an incorrect selec-

tion of a packet RTT is determined by L. If the interval

between two consecutive send packets is the smallest one

of Y, we get the lowest boundary of the probability. The

point is the probability that the interval between two con-

secutive send packets takes the smallest interval of Y is

very small. So in reality, we usually do not use the smallest

interval to estimate the parameter q. We use the interval Lp

which makes the cumulative probability P(x< Lp) in Y be

5%. We estimate Lp upon the assumption that Y is Gamma

distribution with shape parameter b and scale parameter a.

In other words, we select Lp that must satisfy the following

equation:

Z Lp

0

ðx=aÞb�1e�xa

aGðbÞ dx ¼ 0:05 (5)

where GðbÞ ¼RN

0 e�uub�1 du.

We can compute Lp from Eq. (5) if b and a are known.

Parameters b and a vary upon keystroke and network envi-

ronment. The most usual way to estimate Lp is to take

a sample of send packets inter-arrival to estimate the

parameters b and a by using MLE (maximum likelihood

estimation) or other methods (Johnson and Kotz, 1970),

and then compute the Lp for the distribution with the

parameters b and a estimated. This way is appropriate for

individual computation, but not convenient for probabilistic

analysis. Here the way we use is to estimate the range

of the parameters b and a of the interval distribution of

send packets, and compute the range of Lp by the range

of estimated parameters b and a. We use the lower bound

of Lp to compute the probability that one element is

selected correctly in SDBA. We could probably know

Page 5: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4 141

how well SDBA is from the probability estimated by using

Eq. (4).

We did many experiments with different users and differ-

ent environments (i.e., connection chains with different

paths) on the Internet and present some typical examples in

Table 1, where the unit of the send packet inter-arrival of

each sample is microsecond, as well as the unit of Lp. From

the experiment we know that the range of interval Lp of

send packets with cumulative probability 0.05 is from 32,000

to 52,000 approximately. There is no way to predict the exact

range of Lp by merely using experiment without further theory

analysis. In Section 4, we will use the lower bound of Lp to

compute the probability of making a correct selection of RTT

to evaluate SDBA of finding packet RTTs of an interactive

session.

4. Empirical study

We have discussed that we can use Gamma distribution to

simulate the inter-arrival of send packets on an interactive

session. This is foundation of our probabilistic analysis. So

in this section, we shall show that it is reasonable to simulate

inter-arrival distribution of send packets by Gamma distribu-

tion. We have proved that the probability of making a correct

selection of RTT in SDBA is bounded by a boundary which is

given by Eq. (4). Then we shall compute some real examples

to give a practical sense of this equation. SDBA can compete

against the Conservative and the Greedy algorithms both in

packet-matching rate and accuracy. Finally we compare the

performance among SDBA, the Conservative, and the Greedy

algorithms on packet matching.

4.1. Inter-arrival distribution of send packets

We established a connection chain which spanned U.S. and

Mexico by using OpenSSH (Ylonen, 1996). There is at least

one host, such as Acl08, that we have the administrator access

while we have only regular user rights to access all the other

hosts. At the starting point of the chain we ask several

students to simulate intruders by typing some commands

independently and collected all the send packets on the corre-

sponding outgoing connection of Acl08. We computed the

intervals of these send packets and use Matlab to fit their

distribution. Before fitting the distribution, we first drew the

Table 1 – The range of Lp estimated by experiment

Samples Size of eachsample

Items

b a Lp

1 1297 2.043 137280 51115

2 990 1.956 137480 46448

3 816 1.4434 212600 33733

4 900 1.809 143970 40541

5 176 1.426 280220 43016

6 800 1.629 172720 37617

7 412 1.364 242270 32874

histogram of these data to see what kind of distribution they

look like. It is found that they are more like a Gamma distribu-

tion with a shape parameter bigger than one. And then we use

Matlab distributing fit function to estimate its shape parame-

ter b and scale parameter a.

Once we have obtained these two parameters, we have

a theoretical distribution determined by its shape parameter

b and scale parameter a. We use quantile–quantile function

of Matlab to verify how well the Gamma distribution fit the ex-

ample. Fig. 1 shows the verification result with one typical ex-

ample presented, where X- and Y-axis have scale 105.

In this example the shape and scale parameters are esti-

mated to be 2.0426 and 137,280, respectively. From Fig. 1 we

found that the points with intervals more than 400,000 (micro-

seconds) are not well fitted with Gamma distribution where

the gray dashed line (red in the web version) indicates an ideal

fitting. But the points with intervals less than 400,000 (micro-

seconds) are simulated closely by this Gamma distribution

with b¼ 2.0426 and a¼ 137,280. We are confident about the

value Lp computed from the Gamma distribution because it

is much less than 400,000.

4.2. Sample experiments

The key idea of the algorithm SDBA is to select the combina-

tion of send–echo gaps with smallest standard deviation as

the RTT sequence. The best way to verify this point is compare

the RTT sequence from SDBA with corresponding correct

RTTs to see if they are consistent. The problem is there is no

way to know the correct RTTs for the packets of an interactive

session. If there was a way we could find the correct RTTs, we

would not propose the above algorithm. From Yang and

Huang (2005) we know that matching each send and its corre-

sponding echo packet is trivial when there is no send–echo

pair overlap. So in our first experiment we control the key-

stroke speed so as to generate the scenario without send–

echo pair overlap to make it easy to compute the correct

RTTs, with which we compare the RTT sequence coming

from SDBA to verify if SDBA can compute the RTTs correctly.

This is to validate SDBA by using this controlled data set.

Another way to evaluate SDBA when there are send–echo

pair overlaps, which occur often on the Internet where we

0 2 4 6 8 10 120

2

4

6

8

10

12

X Quantiles

Y Q

uant

iles

Fig. 1 – Verification of send packets interval distribution.

Page 6: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4142

do not have correct RTTs, is to justify its performance by com-

puting the probability of making a correct selection of RTT.

We established a connection chain similar to the previ-

ous section. The students were asked to control their key-

stroke speed. We collected all the send and echo packets

in a period of time at Acl08. First we match the send and

echo packets to compute the correct RTTs, and then use

the send and echo packet set as the input of SDBA to get

the RTT sequence. We repeated the experiment many times

with one of the comparisons presented in Fig. 2, where

Y-axis represents RTT value with unit microsecond and

X-axis represents RTT index number. This experimental

result showed that the RTTs from SDBA are exactly same

as the correct RTTs.

The second experiment is for the situation when there are

send–echo pair overlaps. The student participants type inde-

pendently and freely, we captured all the send and echo

packets in a period of time, and compute the RTTs from

SDBA. We take Lp as its lower bound 32,874 and compute the

lower bound probability of making a correct selection of RTT

by using Eq. (4). Three examples are presented in Table 2,

where the second to the fifth columns are average value of

the RTTs with smallest standard deviation with unit micro-

second, standard deviation, q number, and the boundary of

the probability, respectively. From the probability estimated,

we are confident about the result from SDBA because the

probabilities in these three examples are all higher than

97%. So even if we cannot compare the result from SDBA to

0 10 20 30 40 50 60 70 80 90 1002.45

2.5

2.55

2.6

2.65

2.7

2.75

RTT Index

RTT

val

ue (m

icro

sec

ond)

Real RTTSDBA

x 105

Fig. 2 – Verification of SDBA under the situation without

send–echo overlap.

Table 2 – The results of probability estimation

Examples Items

m s q p

1 264947.0 2810.708 11.695 0.9927

2 265756.3 5514.666 5.9612 0.9719

3 265727.2 5549.605 5.9237 0.9715

a correct RTT because we do not have a correct one when

there are send–echo pair overlaps, but we can still evaluate

SDBA by estimating the probability of making a

correct selection of RTT.

4.3. Packet-matching algorithm comparison

Conservative algorithm is supposed to give correct packet-

matching results (Yang and Huang, 2005), but only few

packets are matched when there are send–echo pair overlaps.

If there is no send–echo pair overlap, the Conservative and

Greedy algorithms are all supposed to match TCP packets

correctly.

First, we compare SDBA with the Conservative and Greedy

algorithms under the situation that there is no send–echo pair

60 65 70 75 80 85 90 95 100 105 1102.5

2.55

2.6

2.65

2.7

2.75

2.8

2.85

2.9x 105

RTT index

RTT

val

ue (m

icro

seco

nd)

SDBAConservativeGreedy

Fig. 3 – Packet-matching comparison among the Conserva-

tive, Greedy, and SDBA without send–echo pair overlaps.

100 110 120 130 140 150 160 1702.6

2.8

3

3.2

3.4

3.6

3.8x105

RTT index

RTT

val

ue (m

icro

seco

nd)

SDBAConservative

Fig. 4 – Packet-matching comparison between the

Conservative and SDBA with send–echo pair overlaps.

Page 7: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4 143

overlap. When we did the experiment we need to control our

typing speed as we did before as slow as possible so as to be

sure there is no send–echo pair overlap. Three algorithms

ran on Acl08 at the same time interval to monitor the same

connection chain. The packet-matching results by the three

algorithms are showed partly in Fig. 3, respectively, where

each point represents the RTT gap for a send packet. From

the result shown in Fig. 3 we know if there is no send–echo

pair overlap, we can get the same packet-matching results

from the three methods and compute the same RTTs.

Second, however, most probably there are send–echo pair

overlaps on the Internet. We cannot claim that these three al-

gorithms still give us the same packet-matching result under

this situation. But what we have to be sure is that the Conser-

vative algorithm can still give correct result with fewer send

packets matched. If we compare the packet-matching results

by SDBA with results by the Conservative and the Greedy

algorithms, we will know the performance of SDBA both in

matching rate and accuracy.

Fig. 4 shows the packet-matching comparison between

the Conservative algorithm and SDBA when there are

send–echo pair overlaps. Here we collect 169 send packets,

in which 44 send packets (in Fig. 4 only 29 are displayed

for clarity) are matched by the Conservative algorithm, 169

send packets are matched by SDBA. The RTT gaps found

by the Conservative algorithm are exactly included in RTT

gaps found by the SDBA. Even though we are not sure about

the correctness of the rest RTTs, but we still get a sense

about the correctness of RTTs computed by SDBA from

this comparison.

We verify the packet-matching rate of SDBA by compar-

ing with the Greedy algorithm. Fig. 5 still shows only part

of the packet-matching comparison results between SDBA

and the Greedy algorithm. It indicates that most of the

RTTs are consistent but there are fewer of them. Among

the 169 RTTs, 157 RTTs of the Greedy matches are included

in the results of SDBA. But we are not sure about the

100 110 120 130 140 150 160 1702

3

4

5

6

7

8

9

10

11x 105

RTT index

RTT

val

ue (m

icro

seco

nd)

SDBAGreedy

Fig. 5 – Packet-matching comparison between the SDBA and

the Greedy with send–echo pair overlaps.

correctness of the other 12 RTTs (for clarity only 7 points

are displayed in Fig. 5) of the Greedy algorithm until we

compare them with the results of the Conservative algo-

rithm because it should always give us correct results. We

found there are at least 4 of the 12 RTTs potentially incorrect

after comparing with the Conservative results. Comparing

with the RTTs found by the Greedy algorithm, the RTTs

found by SDBA are closer to the ones found by the Conser-

vative algorithm. The experimental results showed that

SDBA can compete favorably not only against the Conserva-

tive in packet-matching accuracy but also against the

Greedy in packet-matching rate.

5. Conclusion and future work

Estimating the length of a downstream connection chain is an

effective way to detect stepping-stone intrusion. The core

technology of estimating the length of a connection chain is

to compute the round-trip time for each send packet by

matching send and echo packets through the chain. We

have proposed the approach SDBA to compute round-trip

time and a way to evaluate SDBA by probabilistic analysis.

SDBA takes advantage of the fact that the RTTs of a connection

chain are around a value which indicates average network

traffic.

SDBA can compete against the best known packet-match-

ing algorithm both in matching rate and accuracy. We have

proved that the probability of making a correct selection of

RTT through SDBA is bounded by 1� ð1=q2Þwhere q is a num-

ber related to the distribution of RTTs and inter-arrival distri-

bution of send packets. Some real case experimental results

showed that SDBA computes a correct RTTs with a probability

higher than 97%.

There are still some problems about the algorithm SDBA.

The algorithm is somewhat inefficient in time complexity.

Finding an efficient one is our future work and under way cur-

rently even though we have discussed it a little in Section 3.1.

Also SDBA can only compute the packet RTTs for a connection

chain on its downstream part. Finding the packet RTTs for the

upstream part of a connection chain is more challenging and

will provide us a better estimation of the connection chain

length, thus a better stepping-stone detection.

r e f e r e n c e s

Feller W. An introduction to probability theory and its applica-tions, vol. I. New York: John Wiley & Sons, Inc.; 1968.

Jain A, Dubes R. Algorithms for clustering data. New Jersey:Prentice Hall, Inc.; 1988. p. 55–143.

Johnson Normal I, Kotz Samuel. Continuous univariatedistributions-1. New York: John Wiley & Sons, Inc.; 1970.p. 166–97.

Kao E. An introduction to stochastic processes. New York:Duxbury Press; 1996.

Mirkin B. Mathematical classification and clustering. Dor-drecht, The Netherlands: Kluwer Academic Publishers; 1996.p. 169–98.

Page 8: Probabilistic analysis of an algorithm to compute TCP packet round-trip time for intrusion detection

c o m p u t e r s & s e c u r i t y 2 6 ( 2 0 0 7 ) 1 3 7 – 1 4 4144

Yang Jianhua, Huang Shou-Hsuan Stephen. Matching TCP packetsand its application to the detection of long connection chains.In: IEEE proceedings of 19th international conference on ad-vanced information networking and applications (AINA’05),Taipei, Taiwan; March 2005. p. 1005–10.

Ylonen T. SSH – secure login connections over the Internet. In:Sixth USENIX Security Symposium, San Jose, CA, USA; 1996.p. 37–42.

Ylonen T. SSH protocol architecture (draft–IETF document),<http://www.ietf.org/internet-drafts/draft-ietf-secsh-architecture-16.txt>; June 2004a.

Ylonen T. SSH Transport layer protocol (draft IETF document),<http://www.ietf.org/internet-drafts/draft-ietf-secsh-transport-18.txt>; June 2004b.

Yoda K, Etoh H. Finding Connection Chain for Tracing Intruders.In: Proceedings of the sixth European symposium onresearch in computer security (LNCS 1985), Toulouse, France;2000. p. 31–42.

Yung Kwong H. Detecting long connecting chains of interactiveterminal sessions, RAID 2002. Zurich, Switzerland: SpringerPress; October 2002. p. 1–16.

Yin Zhang, Vern Paxson. Detecting stepping-stones. In: Proceed-ings of the ninth USENIX security symposium, Denver, CO;August 2000. p. 67–81.

Dr. Jianhua Yang is an Assistant Pro-

fessor in the Department of Mathe-

matics and Computer Science at

Bennett College for Women, Greens-

boro NC. His research interests are

computer, network and information

security. Dr. Yang earned his Ph.D.

in Computer Science at the Univer-

sity of Houston. Before joining in

Bennett College, Dr. Yang was an Associate Professor at Bei-

jing Institute of Computer Technology, Beijing, China from

1990 to 2000. He is currently a member of IEEE. Dr. Yang can

be reached at [email protected].

Dr. Shou-Hsuan Stephen Huang is a professor of Computer

Science at the University of Houston. His research interests in-

clude data structures and algorithms, intrusion detection and

computer security. Stephen Huang received his Ph.D. degree

from the University of Texas – Austin. He is a senior member

of the IEEE Computer Society. Dr. Huang can be reached at

[email protected].