internet traffic measurement and analysis in a high speed network environment: workload and flow...

10
JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000 287 Internet Traffic Measurement and Analysis in a High Speed Network Environment: Workload and Flow Characteristics Jae-Sung Park, Jai-Yong Lee, and Sang-Bae Lee Abstract: A study on Internet traffic characterization is essential in designing the next generation Internet. In this paper we character- ize the aggregated Internet traffic based on the traffic logs captured from a high speed Internet access network environment. First, we constructed an Internet traffic measurement and analysis system in high-speed Internet access network environment. Then, we an- alyzed the captured traffic in two ways. First, we analyze general Internet traffic characteristics. In this analysis, we present general work-load characteristics of Internet traffic at each communica- tion protocol layers. We also scrutinize how the behavior of upper layer protocols affects the characteristic of IP packet size. To be more precise in characterizing the aggregated Internet traffic, we analyze the captured traffic according to Internet flow model and show its general characteristics and derive analytic models describ- ing the random variables associated with Internet flow size. In this analysis, we found that Internet flow consists of few packets, espe- cially over 45% flows are composed of only one packet and its size is small and last only a few milliseconds. Even if Internet flows are small in size, most of Internet traffic is carried by small number of long-lived flows or big-sized flows. We also found that Internet flow size follows log-normal distribution which shows burstiness over a wide range of time scales. This is a sharp contrast to commonly made modeling choices that exponential assumptions dominate and show only short-range dependence and it has very close relation- ship with the self-similarity of the aggregated traffic. Index Terms: Traffic measurement, workload characteristics, in- ternet flow, analytic model, LRD, self-similarity. I. INTRODUCTION Internet has experienced explosive growth in both quantity and quality. Thanks to the rapid deployment of high-speed net- work equipments, the capacity of Internet backbone has been increased from hundreds of Mbps to a few Gbps and high-speed access network technologies such as cable modem and ADSL enhanced the capacity of Internet access networks. With the rapid transformation of the Internet into a commercial infra- Manuscript received January 27, 2000; approved for publication by Dae Young Kim, Division III Editor, August 29, 2000. The authors are with Electronic Engineering and Computer Science Department, Yonsei University, 134 Shinchon-dong, Sodaemon-ku, Seoul 120-749, Korea, e-mail: [email protected], [email protected], [email protected]. This work is partly supported by Korea Telecommunications (KT) and Korea Science and Engineering Foundations (KOSEF). Any options, findings and con- clusions or recommendations expressed in this paper are those of authors and do not necessarily reflect the views of the KT and KOSEF. 1229-2370/00/$10.00 c 2000 KICS structure, there are also a lot of research works to provide more advanced Internet service classes which guarantee user’s qual- ity of service demand specified in Service Level Agreements (SLAs) in addition to the traditional Best Effort service. To de- sign such an infrastructure efficiently, it is essential to under- stand Internet traffic characteristics. To meet this end, we first construct an Internet traffic measurement and analysis system in high-speed Internet access network environment. We analyze the measured traffic at IP level, transport level and application level. To be more precise in characterizing aggregate Internet traffic we also analyze the captured data according to Internet flow model [1]. There have been a number of studies on analysis of Inter- net traffic in recent years. According to the analysis objectives, these research works can be categorized as quantitative work- load analysis and statistical analysis. The purpose of the work- load analysis is to find out the growth trend of Internet traffic workload and the assessment of composition of traffic by appli- cation protocol type [2]–[5]. In this analysis, aggregate traffic is classified according to some combinations of TCP/IP header field information such as source address, destination address, protocol number, source port number and destination port num- ber. In [4], [5], the authors showed that there were a few pre- dominant packet sizes and most of the packets were small in size. They also showed that the predominance of small sized packets was related to the behavior of high layer protocols. The goal of the statistical analysis is to build a mathematical Internet source model by identifying the statistically invariant character- istics of Internet traffic. It has been validated in several papers that the aggregated net- work traffic has self-similarity in nature and the self-similarity of aggregated network traffic is caused by the high variability of the individual connections that make up the aggregate traffic. In LAN context see [6], [7] and for WANs, see [8]–[10]. In [7], the authors analyzed aggregated LAN traffic trace and showed that aggregated LAN traffic have self-similarity in nature by us- ing heuristic self-similarity test. They went one step further and showed that the self-similarity of aggregated network traffic is caused by the high variability of the individual connections that make up the aggregate traffic [7]. In WAN traffic analysis, [8], [9] showed that legacy Internet applications such as ftp and tel- net have very different statistical characteristics compared with those of Poisson distributions. They used TCP connection con- trol mechanism and port numbers to decompose aggregated traf- fic traces into per application connections. Then, they used a

Upload: sang-bae

Post on 09-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000 287

Internet Traffic Measurement and Analysisin a High Speed Network Environment:

Workload and Flow CharacteristicsJae-Sung Park, Jai-Yong Lee, and Sang-Bae Lee

Abstract: A study on Internet traffic characterization is essential indesigning the next generation Internet. In this paper we character-ize the aggregated Internet traffic based on the traffic logs capturedfrom a high speed Internet access network environment. First, weconstructed an Internet traffic measurement and analysis systemin high-speed Internet access network environment. Then, we an-alyzed the captured traffic in two ways. First, we analyze generalInternet traffic characteristics. In this analysis, we present generalwork-load characteristics of Internet traffic at each communica-tion protocol layers. We also scrutinize how the behavior of upperlayer protocols affects the characteristic of IP packet size. To bemore precise in characterizing the aggregated Internet traffic, weanalyze the captured traffic according to Internet flow model andshow its general characteristics and derive analytic models describ-ing the random variables associated with Internet flow size. In thisanalysis, we found that Internet flow consists of few packets, espe-cially over 45% flows are composed of only one packet and its sizeis small and last only a few milliseconds. Even if Internet flows aresmall in size, most of Internet traffic is carried by small number oflong-lived flows or big-sized flows. We also found that Internet flowsize follows log-normal distribution which shows burstiness overa wide range of time scales. This is a sharp contrast to commonlymade modeling choices that exponential assumptions dominate andshow only short-range dependence and it has very close relation-ship with the self-similarity of the aggregated traffic.

Index Terms: Traffic measurement, workload characteristics, in-ternet flow, analytic model, LRD, self-similarity.

I. INTRODUCTION

Internet has experienced explosive growth in both quantityand quality. Thanks to the rapid deployment of high-speed net-work equipments, the capacity of Internet backbone has beenincreased from hundreds of Mbps to a few Gbps and high-speedaccess network technologies such as cable modem and ADSLenhanced the capacity of Internet access networks. With therapid transformation of the Internet into a commercial infra-

Manuscript received January 27, 2000; approved for publication by DaeYoung Kim, Division III Editor, August 29, 2000.

The authors are with Electronic Engineering and Computer ScienceDepartment, Yonsei University, 134 Shinchon-dong, Sodaemon-ku, Seoul120-749, Korea, e-mail: [email protected], [email protected],[email protected].

This work is partly supported by Korea Telecommunications (KT) and KoreaScience and Engineering Foundations (KOSEF). Any options, findings and con-clusions or recommendations expressed in this paper are those of authors and donot necessarily reflect the views of the KT and KOSEF.

1229-2370/00/$10.00 c� 2000 KICS

structure, there are also a lot of research works to provide moreadvanced Internet service classes which guarantee user’s qual-ity of service demand specified in Service Level Agreements(SLAs) in addition to the traditional Best Effort service. To de-sign such an infrastructure efficiently, it is essential to under-stand Internet traffic characteristics. To meet this end, we firstconstruct an Internet traffic measurement and analysis system inhigh-speed Internet access network environment. We analyzethe measured traffic at IP level, transport level and applicationlevel. To be more precise in characterizing aggregate Internettraffic we also analyze the captured data according to Internetflow model [1].

There have been a number of studies on analysis of Inter-net traffic in recent years. According to the analysis objectives,these research works can be categorized as quantitative work-load analysis and statistical analysis. The purpose of the work-load analysis is to find out the growth trend of Internet trafficworkload and the assessment of composition of traffic by appli-cation protocol type [2]–[5]. In this analysis, aggregate trafficis classified according to some combinations of TCP/IP headerfield information such as source address, destination address,protocol number, source port number and destination port num-ber. In [4], [5], the authors showed that there were a few pre-dominant packet sizes and most of the packets were small insize. They also showed that the predominance of small sizedpackets was related to the behavior of high layer protocols. Thegoal of the statistical analysis is to build a mathematical Internetsource model by identifying the statistically invariant character-istics of Internet traffic.

It has been validated in several papers that the aggregated net-work traffic has self-similarity in nature and the self-similarityof aggregated network traffic is caused by the high variability ofthe individual connections that make up the aggregate traffic. InLAN context see [6], [7] and for WANs, see [8]–[10]. In [7],the authors analyzed aggregated LAN traffic trace and showedthat aggregated LAN traffic have self-similarity in nature by us-ing heuristic self-similarity test. They went one step further andshowed that the self-similarity of aggregated network traffic iscaused by the high variability of the individual connections thatmake up the aggregate traffic [7]. In WAN traffic analysis, [8],[9] showed that legacy Internet applications such as ftp and tel-net have very different statistical characteristics compared withthose of Poisson distributions. They used TCP connection con-trol mechanism and port numbers to decompose aggregated traf-fic traces into per application connections. Then, they used a

Page 2: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

288 JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000

modified goodness-of-fit test to compare a few apriori-selectedprobability distributions with the empirical distribution of con-nection level datasets. [11] used the same self-similarity testmethod used in [6] to validate that WWW traffic also have in-herent self-similarity.

In this paper, we perform both the workload and statisticalanalysis by using the traffic trace collected from a high-speedcommercial Internet access link. In our workload analysis, wepresent characteristics of Internet traffic at each communicationprotocol layer and scrutinize how the behavior of upper layerprotocols affects the characteristics of IP packet size. We alsofind that there are a few predominant packet sizes which is inaccordance with the results in [4], [5].

In our statistical analysis, we analyzed captured data both atflow level and aggregated traffic level. For flow level analysis,we decompose the captured data into flow datasets accordingto Internet flow model [1] and use goodness-of-fit test to find themost suitable flow model. We find that Internet flow shows high-variability in its size. According to the mathematical model in[12] and the empirical results in [7], the high variability of In-ternet flow size causes the self-similarity of aggregated Internettraffic. By performing traffic self-similarity test using wavelet-based multi-resolution analysis method [13], [14] and showingthat the aggregated Internet traffic has self-similarity, we verifythe results of the previous works that the aggregated traffic self-similarity at network level is caused by high variability of theindividual flows at source level.

Our approach is similar to that of [8], [9] in that we usegoodness-of-fit test to find the most suitable probability distri-bution at a source level. But, our definition of the source is dif-ferent from that of [8], [9]. [8], [9] used TCP connection controlmechanism to decompose the aggregate traffic trace profile col-lected some point in a network into TCP connection data. Butthis kind of decomposition method has the following drawbacks.First, new multimedia applications which do not use TCP astheir transport protocol have emerged. Second, because Inter-net is a connectionless datagram network in nature, data pathcan be changed during a TCP connection. In this situation, itis impossible to discern between a separate TCP connections ina network node. Third, because aggregation of traffic in a net-work is inevitable, one application’s traffic is influenced fromthe others at the time it is transferred. So it is hard to derivetraffic characteristics of each application from aggregate trafficlogs captured at a node. In this paper, we used time-out-basedInternet flow model [1] to discern one connection from the other.It is a more flexible approach to extract connection informationfrom an aggregated traffic profile.

In our flow analysis, we found that Internet flow consists of afew packets and its size is small and last only a few milliseconds.After classifying Internet flow by its size and by its duration, wefound that even if Internet flows are small in size, most of In-ternet traffic is carried by a small number of long-lived flows orbig-sized flows. This suggests that it is possible to control mostof traffic by handling small number of flows. We also found thatthe size distribution of Internet flow follows log-normal distri-bution which shows burstiness over a wide range of time scales.It means Internet flows show high variability in its size and du-ration. This is a sharp contrast to commonly made modeling

•Packet count

•total

•per protocol

•per application

•Byte size

•total

•per protocol

•per application

•top 10 list(src ip, dst ip pair)

•per packet count

•per byte size

•per packet byte size perapplication

Txoptical splitter

Rx

Monitoring Bix(Ocxmon)

ATM cardHard diskRAM 128MB

Raw Data Analysis Tools

Analysis System

ProcessedData

Web Clients Web Server

Web interface(CGI)

ATM Switch(backbone)

ATM Switch(Access Network)

ON-line

OFF-line

•On-Line analysis function

•flow analysis

•packet count per flow

•byte size per flow

•duration per flow

•various timeout value

•Per size/duration

Fig. 1. Traffic Measurement & analysis architecture and environment.

choices that exponential assumptions with short-range depen-dent traffic characteristics dominate.

The rest of this paper is organized as follows. In Section II,we describe traffic measurement and analysis system in a high-speed network environment and give general workload char-acteristics of aggregate Internet traffic. We introduce a sta-tistical approach to find the most suitable analytic model thatreflects characteristics of empirical model derived from mea-sured datasets in Section III. In Section IV, we analyze Inter-net flow characteristics, deriving distribution of aggregate flowsize, long-lived flow size, short-lived flow size, small-sized flowsize, big-sized flow size, and medium-sized flow size. We alsoperform self-similarity test of aggregated traffic to show that ithas self-similarity and explain the relationship between the highvariability of Internet flow and the self-similarity of the aggre-gated traffic. We also give some possible implications of long-range dependence of network traffic in performance analysis andcongestion control. We conclude the paper and remark some fu-ture works in Section V.

II. TRAFFIC MEASUREMENT ENVIRONMENT ANDWORKLOAD ANALYSIS

A. A Method and an Environment of Traffic Measurement

In order to characterize the Internet traffic, we collected In-ternet traffic from a point within a commercial Internet in Ko-rea. Fig. 1 shows the traffic measurement and analysis systemwe implemented. We passively monitored traffic as it traversedfrom an ATM-based high-speed (155 Mbps) commercial Inter-net backbone to an access network which support ADSL sub-scribers. The overall architecture follows RFC 2722 [15] whichis proposed by IETF Real-Time Flow Measurement (RTFM)Working Group.

To get a copy of optical link signal which goes from backboneto access network, we used optical splitter to divide the opticalsignal. To implement a monitoring machine, we installed Coral-Reef ver3.1.1 library on FreeBSD version 2.7 and configured themonitoring box to capture only the first cell of each packet [9].The first cell of each packet contains IP header and transport

Page 3: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

PARK et al.: INTERNET TRAFFIC MEASUREMENT AND ANALYSIS IN A HIGH SPEED... 289

Table 1. Summary of measured data sets.

Data Set I Data Set IIDuration ’99.7.23 PM 1:00 2:00 ’99.7.28 PM 10:00 11:00

Total Packet Count 30,872,930 25,245,975TCP Packet Count 25,968,167 18,647,290UDP Packet Count 4,856,110 6,566,013Total Data Volume 4.5 Gbyte 2.7 GbyteTCP Data Volume 4.2 GByte 2.3 GbyteUDP Data Volume 331 Mbyte 408 Mbyte

layer header information. Each of IP header and TCP headerlength is 20 bytes if option field is not included. Because thelength of an ATM cell payload is 48 bytes, only the first cell ofeach packet may be insufficient to get complete header infor-mation of each layer if one of each layer’s option fields is notincluded. But, an option field of each layer is hardly used in ourmeasured data set, we ignored the data in our analysis and cap-tured only the first cell of each packet. The method we take tomonitor a traffic stream has the following advantages. First, wecan reduce storage size needed to store captured data by storingonly the first cell of each packet. Second, it reduces overhead ofa monitor box, so it can capture traffic stream without loss on ahigh-speed links.

A captured data is sent to an analysis system through a dedi-cated line. In an analysis system, traffic analysis is done in twomodes. One is an analysis of general workload characteristicsand the other is an analysis of flow characteristics.

B. General Workload Characteristics of Internet Traffic

We measured traffic streams many times by changing daysand periods of a day. In this paper, we present representativedatasets. Data set I was measured during a work hour and dataset II was measured during a period of night. Table 1 showssummary information of each dataset.

Compared to previous traffic measurement based researches[8], [9] in which captured data was only a hundreds of Mbytesin a day, the size of our datasets were up to a few Gbytes, soit can represent a high-speed network environment. There werediverse applications in our datasets such as WWW, active mail,battle-net, audio mail, POP, kerberos, SQL as well as legacyInternet applications such as ftp and telnet.

We analyzed the workload characteristics of Internet trafficin a various way. First, we analyzed composition of traffic bytransport protocols. In data set I, TCP contributed to 92.5% oftotal traffic volume, constituting 84% of the packet count and7.3% of total traffic volume was attributed to UDP comprisingof 15.7% of the packet count. In data set II, 83.7% of total trafficvolume was attributed to TCP, constituting 73.9% of the packetcount and UDP contributed to 16.1% of total traffic volume thatconstituted 26% of the packet count.

Fig. 2 shows the cumulative distribution of packet sizes andof bytes by the size of packets. As we can see in the figure, thereis a predominance of small packets, with peaks at the commonsizes 40, 51, 60, 576, 1500 bytes. The predominance of smallsized packets is related to high layer protocols. The packetswhose size are 40–44 bytes in length include TCP control seg-ments such as ACK, NAK, RST, SYN, and FIN packets. Thepackets, 51–60 bytes in length, include UDP packets suck as

DNS query and reply, on-line game, or telnet packets whichcarry a single character. Many TCP implementations that do notimplement Path Maximum Transmission Unit(MTU) Discoveryuse either 512 or 365 bytes as the default Maximum SegmentSize(MSS) for non-local IP destinations, yielding 552 bytes or576 bytes packet size. A MTU size of 1500 byte is characteris-tics of ethernet-attached hosts. Because there are predominantpacket sizes, the cumulative distribution of packet sizes lookslike step function. This is a sharp contrast to the general as-sumption that the size of Internet packet follows exponentialdistribution.

Almost 85% of the packets are smaller than the typical TCPMSS of 576 bytes. Over 45% of the packets are 40 bytes inlength. Note however that in terms of bytes, the small sizedpackets do not take much portion. While almost 85% of packetsare 61 bytes or less, these packets constitute only 25% of thetotal volume in bytes.

III. STATISTICAL ANALYSIS METHODS

In this section, we introduce the statistical approach to deter-mine the most suitable analytic model for flow size distributions.It is well known in the statistics community that large datasetsalmost never have statistically exact descriptions [16]. But, it ispossible to find the most suitable analytic model based on pre-vious researches [17], [18]. The analytic approach takes threesteps. First, we choose several probability distributions and es-timate parameters based on the measured data. Second, a mod-ified goodness-of-fit of each analytical model is tested. Finally,the most suitable distribution is selected which has the smallestdiscrepancy measure.

For analysis, we consider several distributions that have alarge range of values, as well as the conventional exponentialand normal distributions. It was motivated by the recent ad-vancement of researches on traffic modeling [6], [8], [17]. Wealso chose a long-tailed distributions; a log-normal distribution.

A. Parameter Estimation

The models we have selected have free parameters that mustbe estimated from a given dataset before testing the models forvalidity in describing the dataset. For example, a log-normaldistribution requires that the geometric mean and standard devi-ation be estimated from the dataset. The authors of [20] makethe important point that estimating free parameters from datasetsalters the significance levels corresponding to statistics suchas Anderson-Darling (A�) test [21] computed from the fittedmodel. We used maximum likelihood (ML) estimator to esti-mate free parameters which have statistically rigorous charac-teristics. The ML estimator aims at determining the free param-eters so as to maximize the likelihood function which is definedas follows. The detailed statistical characteristics of the estima-tor can be found in [22].

Definition 1: The likelihood function of n random variablesx�� x�� � � � � xn is the joint density of the n random variablesg�x�� x�� � � � � xn� �� which is a function of �. In particular, fora random sample x�� x�� � � � � xn, the maximum likelihood func-

Page 4: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

290 JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000

���� ��� �

��

���

���

��

�� � �� ��

���

� � � � �� �

������ ����������

����������

������ �����

������ ����

(a) (b)

���� ��� ��

��

��

��

�� ��

���

��

�� ���

��

��

��

�� � � �� �� ���

������ ����������

�����������

������ �����

������ ����

(c) (d)

Fig. 2. The distribution of packet sizes and of bytes by the size of packets carrying them: (a) the portion of packet count and of bytes by the sizeof packets (Data Set I), (b) cumulative distribution of the number of packet sizes and of bytes size of packets (Data Set I), (c) portion of packetcount and of bytes by the size of packets (Data Set II), (d) cumulative distribution of the number of packet sizes and of byte size of packets (DataSet II).

tion is given by

g�x�� x�� � � � � xn� �� �n

�i��

f�xi� ��� (1)

where we have assumed that f�xi� �� is the population density.

Definition 2: If �� is the value of � in the probability space� that maximizes the likelihood function, then �� is called themaximum likelihood(ML) estimator of �.

Let L��� � g�x�� x�� � � � � xn� ��. Since the likelihood func-tion has certain regularity conditions, the ML estimator of � isobtained by solving

dL���

d�� �� (2)

B. Comparing Analytic and Empirical Models

The random variables we want to model all come from dis-tributions of with essentially unbounded maximum. Moreover,these distributions are either continuous or in the case of datatransferred, continuous in the non-negative integers. As such thevalues of the variables do not naturally fall into a finite number

of categories, finding the statistically exact model impossible[9], [23].

But we still can produce useful analytic models by buildingon the work of [17], [24] in the following way. In those papers,the authors insist that their empirical models are valuable be-cause the variation in traffic characteristics from site-to-site andover time is fairly small. Therefore the tcplib models, whichwere taken from UCB datasets, faithfully reproduce the charac-teristics of wide area TCP connections. So, our aim is to findanalytic models that are just as good at reproducing the charac-teristics of the real network traffic.

To compare an analytic model and with an empirical one, weuse some sort of goodness-of-fit metric. While under certainconditions, one can apply tests such as A� as metrics [20], theyare not appropriate metrics for measuring the fit of an empiricalmodel. We use ��� discrepancy measure, which is similar to chi-square test. The criterion ��� of each model is derived as follows.Suppose that we have observedn instances of a random variableY that we want to model using another model distribution Z.We partition the distribution Z into M bins. Each bin has aprobability pi associated with it, which is the proportion of the

Page 5: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

PARK et al.: INTERNET TRAFFIC MEASUREMENT AND ANALYSIS IN A HIGH SPEED... 291

distribution Z falling into the ith bin. Let Ni be the number ofobservations of Y that actually fell into the ith bin. Then ��� isdefined as

��� �X�

�K �M � �

n� �� (3)

where,

X� �

MX

i��

�Ni � npi��

npi� K �

MX

i��

Ni � npi

npi� (4)

We chose the distribution with the smallest value of ��� as themost accurate one. An essential part of Eq. (3) is a chi-squarecriterion of the chi-squared examination. However the optimalnumber of bins to use when computing chi-square discrepancymeasure varies with the size of the dataset and its standard devi-ation. So, it is impossible to use chi-square discrepancy measureto compare discrepancies between datasets which have differentnumber of bins. Refer to [9] for more details of the statisticalapproach taken above.

IV. INTERNET FLOW ANALYSIS

We define a flow as a set of packets which satisfy specifictemporal and spatial locality conditions, as observed at an in-ternal point of the network. That is, a flow represents actualtraffic activity from one or both of its transmission endpoints asperceived at a given network measurement point. A flow is ac-tive as long as observed packets that meet the flow specificationarrive separated in time by less than a specified timeout value.

As the value of flow timeout becomes smaller, the proba-bility of misunderstanding a long-lived flow into a few short-lived ones increases. On the contrary, if the value of timeout islarge, there is a tendency to regard a few short-lived flows as onelong-lived flow. To find appropriate flow timeout value, we firstexamine the results of the work-load analysis. We examinedgeneral usage patterns of each applications found in our tracedatasets. Most of the applications end its service in a few sec-onds, but interactive applications like telnet and WWW lasts atleast tens of seconds. Furthermore, because of user think time,there is active idle period in these interactive applications andit makes the degree of the variation of packet inter-arrival timehigh. So, the flow timeout value must be larger than the max-imum length of active idle time. To determine appropriate up-per bound of the timeout value, we changed the timeout valuesfrom 4 seconds to 256 seconds and examined the total numberof expired flows. We found that the number of expired flows de-creases remarkably when we changed the timeout value from 16second to 32 seconds. This indicates that if we increase the time-out values over 16 seconds, the probability of misunderstandinga few short-lived flows into a long-lived one increases. We used15 seconds as our timeout values according to the above reason-ing. For the analysis in this paper we consider a flow based on itsprotocol, source IP address, destination IP address, source portand destination port and a 15 second timeout. A packet is con-sidered to belong to the same flow if no more than 15 secondshave passed since the last packet with the same flow attributes.We collected expired flows to analyze characteristics of Internetflow.

(a)

(b)

(c)

Fig. 3. The empirical cumulative distribution of Internet flow: (a) flowsize in packets, (b) flow size in bytes, (c) flow duration.

A. General Characteristics of Internet Flow

Fig. 3 shows the cumulative distribution of Internet flow sizein packets in bytes and the cumulative distribution of Internetflow duration. The figure shows the following general charac-

Page 6: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

292 JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000

Table 2. Analytic Model of Each Flow Size Distribution.

Model ��� Parameter 1 Parameter 2

Total Flow SizeNormal 39.7

Log-normal 1.9 Log. Mean: 6.64 Log. Std.dev: 1.9Exponential 58

Short-lived Flow SizeNormal 3.25

Log-normal 2.58 Log. Mean: 8.7 Log. Std.dev: 1.4Exponential 6.27

Long-lived Flow SizeNormal 21.9 Mean: 23806481 Std.dec: 10996724

Log-normal 4.0Exponential 16.2

Small-lived Flow SizeNormal 11.8

Log-normal 5.1 Log. Mean: 1.3 Log. Std.dev: 0.4Exponential 5.3

Medium-sized FlowSize

Normal 1.4Log-normal 45 Log. Mean: 8.4 Log. Std.dev: 1.3Exponential 7.3

Big-sized Flow SizeNormal 9.8

Log-normal 9.4 Log. Mean: 15.8 Log. Std.dev: 1.1Exponential 10.2

teristics of Internet flow. First, flows are composed of very smallnumber of packets. In particular, 46% of total flows are com-prised of only one packet. Second, the size of flows in bytes issmall. Over 67% of flows are smaller than 1.6 kbyte. Finally,the majority of flows have a very short duration. This fact re-flects the Internet workload characteristics that there is predom-inance of small sized packets. Because small sized packets aredue to TCP connection control packets and UDP-based servicesthat complete sending only one packet, there is predominance ofInternet flows which are composed of only one packet.

We classify Internet flows by its size and its duration. Wecall it a short-lived flow if the duration of a flow is less than 1second. Otherwise we call it a long-lived flow. We also classifyflows by its size in bytes. If the size of a flow is less than 1kbyte, we name it a small-sized flow. If the size of a flow islarger than 1 Mbyte, we call it a big-sized flow. Otherwise, wecall it a medium-sized flow.

Each of the threshold value that delimits flow in size and du-ration is determined according to general usage patterns of eachapplications found in our trace datasets. Small sized flow cor-responds to flows that end its service by transferring only oneor two packets such as ACK and DNS queries. Because thesizes of these flows are between tens of bytes and hundreds ofbytes, we choose 1 kbyte as our threshold value that delimit asmall-sized flow. Applications such as WWW, RealPlayer andftp, require large band-width. Their data volume exceeds a fewMbytes, so we choose 1 Mbyte as our second threshold valueto represent high bandwidth applications. Medium-sized flowsinclude on-line games and telnet flows that have long durationbut the size of which is small. In general, typical network appli-cations end its service by exchanging a few packets and roundtrip times (RTT) between end hosts are typically within a hun-dreds of milliseconds [25]. So, we select 1 second as our thresh-old value to decompose Internet flows into short-lived and long-lived flows. Long-lived flows include the flows that last a fewseconds or have large data. These threshold values may not beexact boundaries, but by adjusting these values according to theanalysis purpose, the classification method we propose can beapplied to classify packets into various classes [26].

We find that even though the majority of Internet flows are

small in size and have a very short duration, the majority of traf-fic is carried by small number of big-sized flows or long-livedflows. Long-lived flows take 24.5% of total flows count, but ittakes 96.5% of total packet count and 98.2% of total traffic vol-ume. Big-sized flows take only 0.5% of total flow count. Inthe case of total packet count, however, it takes 77.5% and ittakes 91.8% of total traffic volume. It means that it is possibleto control majority of traffic volume by controlling small num-ber of flows which is the basis of data-driven MPLS solutionapproaches [27]. The results can be applied to the developmentof an effective congestion control algorithm and an optimal re-source allocation algorithm.

B. Analytic Model of the Internet Flow

Following the statistical analysis method presented in SectionIII, we determine the best analytic model for the distribution oftotal flow size and the size distribution of flows classified by itssize and its duration. Because it was impossible to find a statis-tically exact model for large datasets, we selected a best modelthat minimizes our discrepancy measure. The reason why thereisn’t a statistically exact model is that an analytic model can-not fully reflect a portion of real world behavior. In real net-work environment, there are consistent spikes in a distributionof flow size or even subtle deviations from smooth behavior be-cause of network congestion and error. The empirical modelderived from the measured dataset can capture these nuances,but an analytic model might easily miss these characteristics.

Table 2 summaries the results of our analysis. The first col-umn in Table 2 shows the categories we examined. The secondcolumn shows the candidate probability distribution and thirdcolumn gives the discrepancy measure. The fourth and the fifthcolumn shows the parameter values estimated by ML estima-tor based on the measured dataset. Fig. 4 compares the actualcumulative distribution of total flow size to the selected analyticdistributions. The x-axis of the graph represents the flow size inlog-scale. We can observe in the figure that whether classifiedby flow size or flow duration, all the distributions of flow sizeshave the best agreement with the log-normal distribution whichshows bursty characteristics over a wide range of time scales,

Page 7: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

PARK et al.: INTERNET TRAFFIC MEASUREMENT AND ANALYSIS IN A HIGH SPEED... 293

101

102

103

104

105

106

107

108

109

0

10

20

30

40

50

60

70

80

90

100

Flow Size[Byte]: log−scale

CD

F *

100

Statistical Distribution : Total Flow

Original DataNormalLognormalExponential

102

103

104

0

10

20

30

40

50

60

70

80

90

100

Flow Size[Byte] : log−scale

CD

F *

100

Statistical Distribution : Short−lived Flow

Original DataLognormalExponentialNormal

(a) (b)

103

104

105

106

107

108

0

10

20

30

40

50

60

70

80

90

100

Flow Size[Byte]: log−scale

CD

F *

100

Statistical Distribution : Long−lived Flow

Original DataLognormalExponentialNormal

100

101

102

0

10

20

30

40

50

60

70

80

90

100

Flow Size[Byte]: log−scale

CD

F *

100

Statistical Distribution : Small−sized Flow

Original DataNormalLog−NormalExponential

(c) (d)

103

104

105

106

0

10

20

30

40

50

60

70

80

90

100

Flow Size[Byte]: log−scale

CD

F *

100

Statistical Distribution : Medium−sized Flow

Original DataNormalLog−NormalExponential

106

107

108

109

0

10

20

30

40

50

60

70

80

90

100

Flow Size[Byte]: log−scale

CD

F *

100

Statistical Distribution : Big−sized Flow

Original DataNormalLog−NormalExponential

(e) (f)

Fig. 4. Comparison of empirical distribution of flow sizes to analytic distributions: (a) total flow size, (b) shot-lived flow size, (c) long-lived flow size,(d) small-sized flow size, (e) medium-sized flow size, (f) big-sized flow size.

Page 8: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

294 JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000

except long-lived flow. This is a sharp contrast to commonlymade modeling choices in today’s traffic engineering theory andpractice, where exponential assumptions still dominate and areonly able to reproduce the bursty behavior of measured trafficover a very limited time scale. The high variability of trafficsources has close relationship with the self-similarity of aggre-gated traffic.

C. Aggregate Traffic Self Similarity

To verify the impact of the above analysis result on the self-similarity of the aggregated traffic, we showed that the aggre-gated Internet traffic has self-similarity and explained that thereason is the high variability of Internet flow size. To validatethe self similarity of our datasets, we follow the discrete wavelettransformation (DWT) method that is newly proposed in [13].Compared with the other test methods, it is simple, computa-tionally efficient, informative and rigorous. See [28] for an in-troduction to wavelets.

Briefly, if X is a self similar process with Hurst parameterH � ����� ��, then the expectation of the energy Ej that lieswithin a given bandwidth ��j around frequency ��j�� is givenby

E�Ej � � E��

Nj

X

k

jdj�kj�� � cj��j��j

���H � (5)

where �� is a frequency which depends on the wavelet and cis a prefactor that does not depend on j and N j denotes thenumber of wavelet coefficient at scale j. By plotting log

��Ej�

against scale j and identifying scaling regions, breakpoints,and non-scaling behavior, we have an unbiased scaling analy-sis of a given signal X . An apparent straight line behavior oflog

�� log

�plot indicates the self-similarity and its slope corre-

sponds to �� �H . For further properties of DWT based scalinganalysis and the details of statistical approach, see [13], [14].

Fig. 5 shows the scaling analysis of dataset I and syntheticdataset generated from Poisson distribution that has the samemean as our dataset I. The dataset represents packet arrival countat every ��ms. Synthetic datasets generated from Poisson dis-tribution does not show any scaling property. On the contrary,actual trace dataset shows a straight line behavior for large val-ues of j with an estimated Hurst parameter of about 0.91. Thisindicates the asymptotic self-similarity and it confirms the previ-ously reported asymptotically self-similar nature of WAN traffic[8], [10].

Mandelbrot [12] presented a mathematical model that showsthat the superposition of high variable ON/OFF sources resultsin self-similar aggregated traffic. In [7], the authors empiricallyvalidate the Mandelbrot’s result by using Ethernet LAN traffictrace. Our analysis results that Internet flow size follow heavy-tailed distribution and the aggregated traffic has asymptoticallyself-similarity conforms this mathematical and empirical proofs.

D. Implications of the Results

The size of Internet flow follows log-normal distribution.Log-normal distribution is a kind of heavy tail distribution and itis possible to define self-similar stochastic processes with such

0 2 4 6 8 10 12 14 16 182

3

4

5

6

7

8

9

10

11

12

Scale j

log

2(E

ner

gy(

j))

Global Scaling Analysis of packet level WAN trace(aggregation level=1msec)

Synthesis data from Poisson modelActual data

Small time scale

region

Break point:

256msec

Large time scale

region

������������ ��� �������

Actual trace

Fig. 5. Global scaling analysis of Dataset I (packet level).

distributions. The results have severe impact on network perfor-mance analysis and protocol design.

Many analytical studies have shown that self-similar networktraffic can have a detrimental impact on network performance,including amplified queuing delay and packet loss rate [29],[30]. One Practical effect of traffic self-similarity is that thebuffers needed at switches must be bigger than those predictedby traditional queuing analysis and simulations. In [31], the au-thors pointed out that the common assumption concerning datatraffic that multiplexing a large number of independent trafficstreams results in a Poisson process is not valid. ATM switchesthat designed under this assumption and queuing analysis hadsmall buffers (10–100 cells). When theses switches were de-ployed in the field, cell loss were far beyond those expected.In networks with large bandwidth-delay products, the use of alarge amount of buffer adds considerable end-to-end delay anddelay jitter. This severely impacts the ability to run interactivemultimedia applications.

Moreover, scale invariant burstiness of LRD process impliesthe existence of concentrated periods of high activity at a widerange of timescales, which adversely affects congestion con-trol and is an important correlation structure which may be ex-ploitable for network congestion control purposes [30]. Net-work performance as expressed by throughput, packet loss rateand packet retransmission rate degrades gradually with increas-ing heavy-tailedness. The degree to which heavy-tailedness af-fects self-similarity is determined by how well congestion con-trol is able to shape its source traffic into an on-average constantoutput stream while conserving flows [29]. We are studying todevelop an effective buffer management scheme to reduce thenetwork congestion.

V. CONCLUSION AND FUTURE WORKS

In this paper, we passively monitored traffic as it traversesfrom an ATM-based high-speed (155 Mbps) Korean commercialInternet backbone to an access network which support ADSLsubscribers. The measured datasets represent current high-speed

Page 9: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

PARK et al.: INTERNET TRAFFIC MEASUREMENT AND ANALYSIS IN A HIGH SPEED... 295

and diverse Internet environment. We analyzed general work-load characteristics of Internet traffic and we also scrutinizedstatistical characteristics of Internet flow. In our flow analysis,we classified Internet flow by its size and by its duration. Wealso presented an analytic model of Internet flow size and gavesome implications of long-range dependence characteristic ofInternet flow size.

General workload characteristics of Internet traffic are sum-marized as follows. First, there is a predominance of small pack-ets, with peaks at the common sizes 40, 51, 60, 576, 1500 bytes.The predominance of small sized packets is related to the be-havior of the high layer protocols. Second, over 75% of pack-ets carry TCP-based applications data. Third, almost 85% ofthe packets are smaller than the typical TCP MSS of 576 bytes.Over 45% of the packets are 40 bytes in length. Fourth, eventhough there is a predominance of small packets, in terms ofbytes, the small sized packets do not take much portion. Whilealmost 85% of packets are 61 bytes or less, they constitute a totalof 25% of the byte volume. Fifth, the distribution of IP packetlength resembles step-function rather than a generally assumedexponential distribution.

In our statistical flow analysis, we found that Internet flowconsists of few packets and its size is small and last only a fewmilliseconds. Even though Internet flows are small in size, mostof Internet traffic volume is carried by small number of long-lived flows and big-sized flows. It suggests that it is possible tocontrol most of traffic by handling small number of flows. Wealso found that Internet flow size follows log-normal distributionwhether classified by its size or by its duration. This is a sharpcontrast to commonly made modeling choices that exponentialassumptions dominate and show only short-range dependence.The heavy-tailedness of flow size has detrimental influences onnetwork performance and traffic control. We also demonstratedasymptotic self-similarity of aggregated traffic and we validatedthe previous research results that that aggregated traffic self-similarity is closely related with high variability of Internet flowwhich follows heavy-tail distribution.

For future research works, it is needed to simplify self-similarity by reducing its modeling to a single measure and gen-erate new network engineering tools and methods that will adap-tively operate on this measure to provide optimal performanceand capacity.

REFERENCES[1] K. C. Claffy, H. W. Braun, and G. Polyzos, “A parameterizable method-

ology for internet traffic flow profiling,” IEEE J. Select. Areas Commun.,Apr. 1996.

[2] R. Caceres, “Measurement of wide-area Internet traffic,” UCB/CSD89/550, University of California Berkely, CA, Dec. 1989.

[3] R. Caceres et al., “Characteristics of wide-area TCP/IP conversation,” inProc. ACM SIGCOMM’91, Sept. 1991.

[4] J. Apisodorf et al., “OC3mon: Flexible, affordable, high performancestatistics collection,” INET’97, Kuala Lumpur, Malaysia, 1997.

[5] K. Thompson, G. Miller, and R. Wilder, “Wide area internet trafficpatterns and characteristics,” IEEE Network, Nov. 1997. Available athttp://www.vbns.net/presentations/papers/MCItraffic.ps.

[6] W. Leland et al., “On the self-similar nature of ethernet traffic(extendedversion),” IEEE/ACM Trans. Networking, vol. 2, pp. 1–15, 1994.

[7] W. Willinger et al., “Self-similarity through high-variability: Statisticalanalysis of ethernet LAN traffic at the source level,” IEEE/ACM Trans.Networking, vol. 5, pp. 71–86, 1997.

[8] V. Paxson and S. Floyd, “The failure of poisson modeling,” in Proc. SIG-COMM’94, Computer Communications Review, vol. 24, London, 1994,pp. 257–268.

[9] V. Paxson, “Empirically derived analytic models of wide-area TCP con-nections,” IEEE/ACM Trans. Networking, vol. 2, no. 4, pp. 316–336, 1994.

[10] A. Feldmann et al., “The changing nature of network traffic: Scaling phe-nomena,” Computer Communication Review, vol. 28, no. 2, Apr. 1998.

[11] M. Crovella and A. Bestavros, “Examining world wide web self-similarity,” Rep. BU-CS-95-015, Boston University, 1995.

[12] B. B. Madelbrot, “Long-run linearity, locally gaussian processes, H-spectra and infinite variances,” International Economic Review, vol. 10,pp. 82–113, 1969.

[13] P. Abry and D. Veitch, “Wavelet analysis of long range dependence traffic,”IEEE Trans. Inform. Theory, vol. 44, pp. 2–15, 1998.

[14] A. Feldmann, A. C. Gilbert, and W. Willinger, “Data networks as cas-cades: Investigating the multifractal nature of internet WAN traffic,” inProc. ACM SIGCOMM’98, 1998, pp. 25–38.

[15] N. Brownlee, C. Mills, and G. Ruth, “Traffic flow measurement: Architec-ture,” RFC2722, Oct. 1999.

[16] P. Martim-Lof, “The notion of redundancy and its use as a quantitativemeasure of the discrepancy between a statistical hypothesis and a set of ob-servational data,” Scandinavian J. Statistics, vol. 1, no. 1, pp. 3–18, 1974.

[17] P. Danzig and S. Jamin, “Tcplib: A library of TCP internetwork traf-fic characteristics,” Report CS-SYS-91-01, Computer Science Department,USC, 1991.

[18] P. Danzig et al., “An empirical workload model for driving wide-areaTCP/IP network simulations,” Internetworking: Research and Experi-ences, vol. 3, no. 1, pp. 1–26, 1992.

[19] M. Nabe, M. Murata, and H. Miyahara, “Analysis and modeling of worldwide web traffic for capacity dimensioning of internet access line,” J. Per-formance Evaluation, Elsevier, vol. 34, pp. 249–271, 1998.

[20] R. B. D. Agostino and M. A. Stephens, Goodness-of-Fit Techniques, Mar-cel Dekker Inc., 1896.

[21] T. W. Anderson and D. A. Darling, “Asymptotic theory of certaingoodness-of-fit criteria based on stochastic processes,” Ann. Math. Statist,vol. 23, pp. 193–212, 1954.

[22] U. Narayan, Elements of applied stochastic processes, John Wiley & SonsInc., 1972.

[23] D. Knuth, Seminumerical Algorithms, Second Edition, Addison-Wesley,1981.

[24] P. Danzig et al., “An empirical workload model for driving wide-areaTCP/IP network simulations,” Internetworking: Research and Experience,vol. 3, no. 1, pp. 1–26, 1992.

[25] A. Feldmann et al., “Dynamics of IP traffic: A study of the role of vari-ability and the impact of control,” in Proc. ACM SIGCOMM’99, 1999.

[26] K. Nicholas et al. “Definition of the differentiated services field (DS Field)in the IPv4 and IPv6 headers,” RFC2474, proposed standard, Dec. 1998.

[27] P. Newman et al., “IP switching and gigabit routers,” IEEE Commun.Mag., Jan. 1997.

[28] G. Kaiser, A friendly guide to wavelets, Birkhauser, Boston, 1994.[29] K. Park, G. Kim, and M. Crovella, “On the effect of traffic self-similarity

on network performance,” in Proc. SPIE International Conf. on Perfor-mance and Control of Network System, 1997, pp. 296–310.

[30] P. R. Morin, The Impact of Self-similarity on Network Performance Anal-ysis, Ph.D. Dissertation, Carleton Univ., Dec. 1995.

[31] W. Willinger, W. Wilson, and M. Taqqu, “Self-similar traffic modeling forhigh-speed networks,” ConneXions, Nov. 1994.

Jae-Sung Park received the B.S. and M.S. degrees inelectronic engineering from Yonsei University, Seoul,Korea in 1995 and 1997, respectively. He is currentlyworking toward the Ph.D. degree in the Departmentof Electronic Engineering at Yonsei University, Seoul,Korea. His current research area includes traffic mea-surement and analysis, QoS provisioning and manage-ment in next generation Internet

Page 10: Internet traffic measurement and analysis in a high speed network environment: Workload and flow characteristics

296 JOURNAL OF COMMUNICATIONS AND NETWORKS, VOL.2, NO.3, SEPTEMBER 2000

Jai-Yong Lee received the B.S. degree in electronicengineering from Yonsei University, Seoul, Korea, in1977 and the M.S. and Ph.D. degrees in computerengineering from Iowa State University in 1984 and1987, respectively. From 1977 to 1982, he was a re-search engineer at Agency for Defense Developmentof Korea. From 1987 to 1994 he was an Associate Pro-fessor at Pohang Institute of Science and Technology.He is currently a Professor in the Department of Elec-tronic Engineering, Yonsei University. His current re-search interests include protocol design for QoS man-

agement, network management, high speed networks, and conformance testing.

Sang-Bae Lee received the B.S. degree from SeoulNational University, Korea, in 1961, the M.S. degreefrom Stanford University, Stanford, CA, in 1964 andthe Ph.D. degree from the University of Newcastleupon Tyne, England in 1975. He is currently a Pro-fessor in the Department of Electronic Engineering,Yonsei University, Seoul, Korea. From 1969 to 1979,he was an Assistant Professor at Seoul National Uni-versity and from 1982 to 1983, a Visiting Professor atthe University of Newcastle upon Tyne, England. Hehad served as a Chairman of the IEEE Korea Section

from 1986 to 1987. From 1989 to 1990 he was a Chairman of the Korea Instituteof Telematics and Electronics. He was IEE Korea Section chairman in 1992. Heis currently retiring from Yonsei University. His research interests are computernetwork, data communication and graph theory.