wireless channel modeling and malware detection … · information decreases when an endpoint is...

WIRELESS CHANNEL MODELING AND MALWARE DETECTION USING

STATISTICAL AND INFORMATION-THEORETIC TOOLS

By

Syed Ali Khayam

A DISSERTATION

Submitted to Michigan State University

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical and Computer Engineering

2006

ABSTRACT

WIRELESS CHANNEL MODELING AND MALWARE DETECTION USING STATISTICAL

AND INFORMATION-THEORETIC TOOLS

By

Syed Ali Khayam

This is a bipartite thesis that tackles two different research problems: (i) medium

access control (MAC) layer wireless channel modeling and applications of the models in

design, analysis and simulations of wireless systems; and (ii) malicious software

(malware) detection at network endpoints. For both problems, we collect extensive new

datasets which are analyzed and modeled using statistical and information-theoretic tools.

In the first part of this thesis, we provide analysis and modeling of bit-errors at the

802.11b MAC layer. We show that the bit-errors at 2 Mbps and 5.5 Mbps can be modeled

by high-order full-state Markov (FSM) chains. Bit-errors at 11 Mbps are shown to have

long-range dependence (LRD), and consequently a multifractal wavelet model (MWM) is

used to model these LRD bit-errors. The complexity of FSM chains is an exponential

function of the bit-error process’ memory-length. To mitigate the exponential FSM

complexity, we derive guidelines for accurate approximation of an FSM chain of

arbitrary memory-length. These guidelines lead to a novel and accurate constant-

complexity model (CCM) which always consists of five states irrespective of a process'

memory-length.

Two applications of the proposed channel models are explored. First, we use the

models in a novel maximum-likelihood header estimation framework which can be used

by wireless multimedia applications to realize considerable throughput improvements.

Trace-driven wireless video simulations show that the proposed header estimation

framework provides significant improvements over existing techniques. Second, we use

protocol goodput and retransmission metrics to show that inaccurate channel models can

lead to extremely misleading simulation and analytical results. The models proposed in

this thesis, however, provide highly accurate estimates of goodput and retransmissions.

In the second part of this thesis, we propose three endpoint-based anomaly detection

techniques that detect self-propagating malware in real-time by observing deviations

from a behavioral model derived from a benign data profile. In the first technique, we

leverage the Kullback-Leibler (K-L) information divergence of real-time source and

destination ports’ distributions to characterize deviations from the distributions observed

in the benign traffic profile. Experiments using actual endpoint and malware data

demonstrate that the source and destination ports’ distributions are perturbed significantly

on a compromised endpoint. K-L perturbations are used to train support vector machines

which provide almost 100% detection rates and negligible false alarm rates.

The remaining two malware detection techniques proposed in this thesis employ

perturbations in the distribution of keystrokes that are used to initiate network sessions.

We show that the keystrokes’ entropy increases and the session-keystroke mutual

information decreases when an endpoint is compromised by a self-propagating malware.

These two types of perturbations are used for real-time malware detection. Both detectors

provide almost 100% detection rates and very low false alarm rates.

Copyright by SYED ALI KHAYAM 2006

v

ACKNOWLEDGMENTS

I would like to thank my family for always respecting and supporting my professional

and academic goals. I also thank my academic advisor, Professor Hayder Radha, for

always encouraging me to think out-of-the-box and for helping me identify and refine

research ideas. I sincerely thank my friends, family members and colleagues in WAVES

lab who allowed me to collect network traffic data on their computers. Aparna, Mujahid,

Dmitri and Farshad deserve special mention here for discussing and critiquing the theory,

experiments and writing of my research papers. I also thank Shardha who was a great

friend during my first year, and who I regretfully forgot to acknowledge in my Masters

thesis. I must also acknowledge the Higher Education Commission of Pakistan and the

National Science Foundation of USA for their continued financial support during my

M.S. and Ph.D. studies. I thank my Ph.D. committee members and Professor Rong Jin for

their technical and editorial guidance. Finally, I thank those associate editors and

anonymous reviewers who gave constructive feedback on my papers. That feedback has

definitely improved the quality of this thesis.

vi

TABLE OF CONTENTS

LIST OF TABLES........................................................................................................ x LIST OF FIGURES ..................................................................................................... xi Part A Statistical Models of MAC Layer Wirless Channels and their Applications .... 1 CHAPTER A.1 Introduction......................................................................................... 2

A.1.1 Overview of Contributions ............................................................................. 4 A.1.2 Organization of this Part ................................................................................. 6

CHAPTER A.2 Related Work ...................................................................................... 7

A.2.1 Channel Modeling........................................................................................... 7 A.2.2 Cross-Layer Design for Wireless Multimedia ................................................ 9

CHAPTER A.3 Background....................................................................................... 11

A.3.1 802.11b Wireless Networks .......................................................................... 11 A.3.2 Autocorrelation of Random Processes.......................................................... 12 A.3.3 Discrete-Time Markov Chains...................................................................... 12 A.3.4 Burst Representation of Binary Wireless Traces .......................................... 14 A.3.5 The Gilbert Channel Model .......................................................................... 15 A.3.6 Full-State Markov Chains for Wireless Channels ........................................ 16 A.3.7 Long-Range Dependent Processes................................................................ 17 A.3.8 The Multifractal Wavelet Model .................................................................. 19 A.3.9 Performance Evaluation Measure ................................................................. 20

CHAPTER A.4 Empirical Analysis and Accurate Modeling of Wireless Channels.. 22

A.4.1 Wireless Trace Collection............................................................................. 22 A.4.2 Empirical Analysis of 802.1b Bit-Errors ...................................................... 25

A.4.2.1 Autocorrelation Analysis .................................................................. 25 A.4.2.2 Preliminary Empirical Analysis of FSM Chains .............................. 27 A.4.2.3 Long-Range Dependence in 11 Mbps Bit-Errors ............................. 28

A.4.2.3.1 LRD Evaluation by Observing Energy at Different Scales ......... 28 A.4.2.3.2 LRD Evaluation using Variance-Time Diagrams........................ 30 A.4.2.3.3 LRD Evaluation using the Periodogram ...................................... 32

A.4.3 Accurate Modeling of 802.11b Bit-Errors .................................................... 33 A.4.3.1 Bit-Error Modeling at 5.5 Mbps ....................................................... 33 A.4.3.2 Bit-Error Modeling at 2 Mbps .......................................................... 34 A.4.3.3 Bit-Error Modeling at 11 Mbps ........................................................ 35

A.4.3.3.1 The Multifractal Wavelet Model.................................................. 36

vii

A.4.3.3.2 ENK-based Performance Evaluation ........................................... 36 A.4.3.3.3 Performance in Capturing Energy at Different Scales................. 39 A.4.3.3.4 Performance in Capturing the Variance-Time Characteristics .... 39

A.4.4 Discussion ..................................................................................................... 41 CHAPTER A.5 Complexity Reduction for Markov Channels................................... 43

A.5.1 The Hierarchical Markov Model .................................................................. 44 A.5.2 The Hidden Markov Model .......................................................................... 46 A.5.3 FSM Observations ........................................................................................ 48 A.5.4 Observations about FSM Chains .................................................................. 48 A.5.5 Markov Chain Lumpability........................................................................... 51

A.5.5.1 Lumpability for Wireless Bit-Error Channels................................... 51 A.5.5.2 Folded Markov Chains...................................................................... 55 A.5.5.3 Evaluation of Folded Markov Chains ............................................... 58

A.5.6 Complexity Reduction by Approximating an FSM Chain’s Good- and Bad-Burst Behavior .............................................................................................................. 59

A.5.6.1 Simplification of Good-bursts Distribution ...................................... 64 A.5.6.2 Simplification of Bad-bursts Distribution......................................... 65 A.5.6.3 Guidelines for Approximating an FSM chain................................... 66

A.5.7 Constant-Complexity Model......................................................................... 67 A.5.7.1 Performance of the CCM at 2 Mbps ................................................. 68 A.5.7.2 Performance of the CCM at 5.5 Mbps .............................................. 71

A.5.8 Discussion ..................................................................................................... 72 CHAPTER A.6 Channel Model Based Header Estimation for Wireless Multimedia 73

A.6.1 FEC Redundancy Lower Bounds for UDP, UDP-Lite and Header Estimation...................................................................................................................................... 76

A.6.1.1 Redundancy Bounds on the q-ary Symmetric Channel .................... 77 A.6.1.1.1 FEC Redundancy Bound on a UDP based Protocol Stack .......... 78 A.6.1.1.2 FEC Redundancy Bound on a UDP-Lite based Protocol Stack... 79 A.6.1.1.3 FEC Redundancy Bound on a Header Estimation based Protocol

Stack 80 A.6.1.1.4 Comparison of the FEC Redundancy Bounds ............................. 80

A.6.1.2 Redundancy Bounds on the Gilbert Channel.................................... 83 A.6.1.2.1 Bound on a UDP based Protocol Stack........................................ 83 A.6.1.2.2 Bound on a UDP-Lite based Protocol Stack................................ 84 A.6.1.2.3 Bound on a Header Estimation based Protocol Stack.................. 85 A.6.1.2.4 Comparison of the FEC Redundancy Bounds ............................. 85

A.6.1.3 Discussion ......................................................................................... 88 A.6.2 Maximum-Likelihood Header Estimation Framework................................. 88

A.6.2.1 Functionality at and below a Receiver’s MAC layer........................ 89 A.6.2.2 The Header Estimation Module........................................................ 91 A.6.2.3 Processing at a Receiver’s Network, Transport and Application

Layers 91 A.6.3 Likelihood Functions for Header Estimation................................................ 91

A.6.3.1 Header Estimation Likelihood Function for FSM Chains ................ 93

viii

A.6.3.2 Header Estimation Likelihood Function of MWM........................... 95 A.6.3.3 Extending the FSM Likelihood Function to the CCM...................... 98

A.6.4 Performance Evaluation of the Header Estimation Framework ................... 99 A.6.4.1 Experimental Setup........................................................................... 99 A.6.4.2 Throughput Performance ................................................................ 100 A.6.4.3 Comparison of Packet Drops .......................................................... 101 A.6.4.4 False Alarm Rate............................................................................. 102 A.6.4.5 FEC Performance............................................................................ 103 A.6.4.6 Video Performance ......................................................................... 106

A.6.5 Discussion ................................................................................................... 107 CHAPTER A.7 Impacts of Ignoring Channel Memory on Analysis and Simulation of

Wireless Systems ............................................................................................................ 108 A.7.1 Goodput of an Unreliable Protocol ............................................................. 109

A.7.1.1 Goodput of a Wireless Channel ...................................................... 110 A.7.1.2 Goodput of a Binary-Symmetric Channel Model ........................... 111 A.7.1.3 Goodput of a Gilbert Channel Model ............................................. 111 A.7.1.4 Goodput of a Full-state Markov Channel Model ............................ 112 A.7.1.5 Goodput of a Constant-Complexity Channel Model ...................... 114 A.7.1.6 Comparison of Estimated Goodputs ............................................... 115

A.7.2 Retransmissions of a Reliable Protocol ...................................................... 116 A.7.2.1 Expected Retransmissions on a Wireless Channel ......................... 117 A.7.2.2 Comparison of Estimated Retransmissions .................................... 118

CHAPTER A.8 Conclusions and Future Work ........................................................ 122 Part-A References ..................................................................................................... 123 Part B Self-Propagating Malware Detection at Network Endpoints using Information-

Theoretic Tools ............................................................................................................... 132 CHAPTER B.1 Introduction..................................................................................... 133

B.1.1 Overview of Contributions.......................................................................... 134 B.1.2 Organization of this Part ............................................................................. 137

CHAPTER B.2 Related Work .................................................................................. 139 CHAPTER B.3 Background ..................................................................................... 142

B.3.1 Self-Propagating Malware .......................................................................... 142 B.3.2 Support Vector Machines............................................................................ 143

CHAPTER B.4 Data Collection and Simulation ...................................................... 144

B.4.1 Benign Traffic-Keystroke Profiles .............................................................. 144 B.4.2 All-Keystrokes’ Profiles.............................................................................. 148 B.4.3 Malware Classification................................................................................ 149 B.4.4 Real Malware .............................................................................................. 150

ix

B.4.5 Simulated Malware ..................................................................................... 152 B.4.6 Inserting Malware Data in Benign Traffic Profiles..................................... 153

CHAPTER B.5 Malware Detection using Traffic Features...................................... 155

B.5.1 Malware Detection Using Sample Entropy................................................. 155 B.5.1.1 Entropy of Source and Destination Ports........................................ 156 B.5.1.2 Entropy-based Traffic Perturbations in the Infected Profiles ......... 157

B.5.2 Malware Detection Using Information Divergence .................................... 159 B.5.2.1 Kullback-Leibler Divergence of Source and Destination Ports...... 160 B.5.2.2 K-L-based Traffic Perturbations in the Infected Profiles ............... 163 B.5.2.3 Evaluating Traffic Perturbations with Other Information Divergences

164 B.5.3 Leveraging K-L Perturbations in an SVM-based Framework .................... 167

B.5.3.1 SVM Training ................................................................................. 167 B.5.3.2 Performance Evaluation and Comparison with Existing Techniques

169 B.5.4 Summary and Discussion............................................................................ 173

CHAPTER B.6 Malware Detection using Joint Network-Host Features ................. 174

B.6.1 Correlation in the Session-Key Data........................................................... 174 B.6.2 Malware Detection Using Keystroke Entropy ............................................ 178

B.6.2.1 Definition of Keystroke Entropy .................................................... 178 B.6.2.2 Entropy Perturbations in the Infected Profiles................................ 179

B.6.3 Malware Detection Using Session-Key Mutual Information...................... 182 B.6.3.1 Mutual Information of Sessions and Keys...................................... 182 B.6.3.2 Mutual Information Perturbations in the Infected Profiles ............. 184 B.6.3.3 Automated Detection using Keystroke Perturbations..................... 187

CHAPTER B.7 Attacks and Countermeasures ......................................................... 190

B.7.1 Mimicry Attack ........................................................................................... 190 B.7.2 Attack by Acquiring System-Level Privileges............................................ 191

CHAPTER B.8 Conclusions And Future Work ....................................................... 192 Part-B References ..................................................................................................... 193

x

LIST OF TABLES

Table 1. Packet-Level Statistics at 2, 5.5 and 11 Mbps .................................................... 24

Table 2. Performance of MWM and FSM for the 11 Mbps Bit-Error Process ................ 37

Table 3. Performance of the hMM for 5.5 Mbps Bit-Error Process ................................. 45

Table 4. Performance of the HMM for the 5.5 Mbps Bit-Error Process .......................... 47

Table 5. Empirical Evidence in Support of Observation 2 ............................................... 50

Table 6. Statistics of the Benign Profiles........................................................................ 147

Table 7. Information of Malware Used in This Study.................................................... 151

xi

LIST OF FIGURES

Figure 1. The Gilbert channel model [81]. ....................................................................... 15

Figure 2. Set up for collection of wireless bit-error traces. .............................................. 24

Figure 3. Autocorrelation of bit-error traces..................................................................... 26

Figure 4. Percentage of unused FSM states at 2 and 5.5 Mbps. ....................................... 28

Figure 5. Aggregates of the 11 Mbps energy process at different time scales. ................ 29

Figure 6. Variance-time diagrams of two 11 Mbps bit-error traces. ................................ 31

Figure 7. Logscale periodogram of two 11 Mbps bit-error traces. ................................... 32

Figure 8. Performances of varying order FSM chains for the 5.5 Mbps MAC layer bit-error process. ............................................................................................................. 35

Figure 9. Performances of varying order FSM chains for the 2 Mbps MAC layer bit-error process....................................................................................................................... 35

Figure 10. Probability mass functions for good- and bad-bursts random variables derived from an 11 Mbps trace. (Only the probabilities of small bursts are shown here.) .... 38

Figure 11. Energy processes of actual and synthesized bit-error traces. .......................... 38

Figure 12. Variance-time diagrams of varying order FSM chains for the 11 Mbps bit-error process. ............................................................................................................. 41

Figure 13. Variance-time diagrams of the MWM for the 11 Mbps bit-error process. ..... 41

Figure 14. The hierarchical Markov model (hMM) [18].................................................. 45

Figure 15. Transition possibilities for an FSM chain (memory-length, 4k = ). ............. 50

Figure 16. Aggregate states iS and jS containing FSM states , n m and 2 ,2n m , respectively. .............................................................................................................. 55

Figure 17. Performance of FMCs formed by folding a 512 state FSM to 256, 128, 64, 32, 16, 8, 4 and 2 states; the FSM process is trained using a 5.5 Mbps trace. ................ 58

Figure 18. State transitions of an FSM with memory-length k and a good-burst of length l k≥ . ........................................................................................................................ 60

xii

Figure 19. State aggregation and transitions for the CCM. Each box represents an aggregate CCM state. The number(s) inside a CCM state are the aggregated FSM states.......................................................................................................................... 68

Figure 20. ENK based modeling performance versus complexity for the 2 Mbps bit-error process....................................................................................................................... 69

Figure 21. ENK based modeling performance versus memory-length for the 2 Mbps bit-error process. ............................................................................................................. 69

Figure 22. ENK based modeling performance versus complexity for the 5.5 Mbps bit-error process. ............................................................................................................. 71

Figure 23. ENK based modeling performance versus memory-length for the 5.5 Mbps bit-error process. ....................................................................................................... 71

Figure 24. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header Estimation over an q -ary symmetric channel; 8m = , 256q = , 30L = , 60Hn = ,

452Dn = ................................................................................................................. 82

Figure 25. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header Estimation over a Gilbert channel; 8m = , 30L = , 60Hn = , 452Dn = . ......... 87

Figure 26. Interactions between the UDP-based header estimation module and different layers of a wireless receiver’s protocol stack; modified protocol stack layers are shown in different colors and dotted lines represent communications that are not related to packet reception. ....................................................................................... 89

Figure 27. Average packet drops for UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates and for varying number of video streams per receiver; each point is averaged over ( )3 # of video streams 5 25× × × received video streams. ......................................................................................................... 101

Figure 28. Codeword construction for video FEC simulations. ..................................... 104

Figure 29. Average FEC redundancy required by UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates of an 802.11b LAN; each point is averaged over 3 5 5 25 1875× × × = received video streams............................................... 105

Figure 30. Average PSNR of video sequences for UDP Normal, UDP-Lite and UDP with Header Estimation using a 30 byte RS codeword with 2 parity bytes; each graph is averaged over 3 5 5 75× × = received video streams. .......................................... 107

Figure 31. Comparison of the average goodput of the actual traces with the goodput estimates provided by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five traces................................................................... 115

xiii

Figure 32. Comparison of the number of retransmissions per packet estimated by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five traces. ............................................................................................................... 120

Figure 33. Number of retransmissions per packet without the BSC model.................... 120

Figure 34. Number of retransmissions per packet without the BSC model.................... 120

Figure 35. Source and destination port entropies at infected endpoints. Infection start times are marked with a circle. Infections in (a), (b), and (c) last approximately 15 minutes, while that in (d) lasts approximately one minute. Each non-overlapping time-window is 20 seconds. .................................................................................... 157

Figure 36. Source and destination ports’ K-L divergences at infected endpoints. ......... 162

Figure 37. Jenson-Shannon (J-S), K- and resistor-average (R-A) divergences of source and destination ports at infected endpoints. ............................................................ 166

Figure 38. Comparison of detection and false-alarm rates of the proposed K-L/SVM-based malware detector with maximum-entropy and rate-limiting detectors. Each point is averaged over 12 malware with 100 random infections per malware per endpoint................................................................................................................... 169

Figure 39. A generalized flow diagram of the proposed K-L/SVM-based malware detector. The shaded area contains real-time components...................................... 172

Figure 40. Normalized histograms of 20 most-used session initiation keystrokes. Histograms are generated from the session-key data. Virtual keys codes 1 and 13 correspond to the left mouse click and the Enter key, respectively [48].................................................................................................................................. 175

Figure 41. Normalized histograms of 20 most-used keystrokes. Histograms are generated from the all-keys data. Virtual keys codes 40, 38 and 17 correspond to the down arrow key, the up arrow key and the control key, respectively [48].177

Figure 42. Entropy of the keystroke histograms at infected endpoints. Infection start times are marked with a circle. Infections last approximately15 minutes. Each non-overlapping time-window is 60 seconds................................................................. 181

Figure 43. Mutual information of the session and keystroke random variables at infected endpoints. Infection start times are marked with a circle. Infections last approximately15 minutes. Each non-overlapping time-window is 60 seconds...... 186

Figure 44. Comparison of detection and false-alarm rates of the mutual information based and keystroke-entropy based malware detectors with maximum-entropy [14] and rate-limiting [20] detectors. Each point is averaged over 9 malicious codes with 100 random infections per malicious code per endpoint. .............................................. 189

1

PART A

STATISTICAL MODELS OF MAC LAYER WIRLESS CHANNELS AND THEIR

APPLICATIONS

2

CHAPTER A.1 INTRODUCTION

Error modeling has been used to improve design of communication channels and

systems for many decades [1]–[7]. Stochastic models of wireless medium access control

(MAC) layer packet-losses and bit-errors have recently attracted significant research

attention [8]- [30]. The main objective of analyzing and modeling MAC-to-MAC [29] or

residual [11] bit-errors is to develop accurate simulators which allow experimentation

without having the actual network in place. Moreover, bit-error analysis and modeling

provide important insights into characteristics of the underlying error random process.

This insight is essential for design and performance evaluation of a wide range of

wireless protocols, applications and services. For instance, accurate channel models can

facilitate design, parameter tuning and verification of the following wireless protocols:

• Wireless congestion control protocols, instead of relying on MAC layer

retransmissions, can use accurate MAC layer error models to differentiate between

losses due to congestion, medium degradation or mobility. The inability of wired

congestion control algorithms to differentiate between different types of losses and the

consequent bandwidth underutilization have been repeatedly highlighted by prior

studies [10], [31]–[40]. Knowledge of losses due to channel errors, which is assumed

in many wireless congestion control solutions, can be provided by a real-time MAC-

to-MAC channel model. Understanding of error frequency and burstiness is also

instrumental in parameter tuning of congestion control protocols.

3

• Cross-layer protocols can use a real-time channel model to choose between reliable

(e.g., using MAC layer retransmissions) versus cross-layer (e.g., ignoring data payload

errors [41]–[51]) protocols.

• Reliable routing protocols [52]–[55] for mobile networks can use MAC-to-MAC

channel models to differentiate reliable versus shortest routes to different destinations,

if the model is able to provide real-time error characterization at different hops of the

network.

• MAC protocols can decide when to increase/decrease the physical transmission data

rate based on real-time channel estimation. An accurate channel model can predict

future error characteristics, thereby saving the MAC layer protocol the overhead of

switching to an inaccurate lower/higher data rate based on short-term observations.

Similarly, design of many wireless applications can be improved by accurate channel

models. For instance:

• Real-time channel estimation provided by an accurate model can be employed by rate-

adaptive applications to perform channel- and/or source-coding rate adaptation for

efficient bandwidth utilization.

• Design of effective error-control schemes for different wireless applications requires a

thorough understanding of errors above the physical layer [56].

• Error-resilience features of contemporary multimedia codecs can be effectively

designed and verified with knowledge of MAC layer error characteristics.

Note that most benefits of a wireless MAC layer channel model can be realized if the

model is able to provide real-time and online channel characterization and prediction. In

4

complexity- and power-constrained wireless environments, such channel characterization

is only possible with a low-complexity model. Despite some recent interest in reducing

the complexity of wireless models [23]- [29], development of accurate, pragmatic and

low-complexity wireless channel models is still an open problem.

A.1.1 Overview of Contributions In this part of the thesis, we analyze and model bit-errors propagated to the 802.11

MAC layer at three physical layer data rates of an 802.11b LAN: 2, 5.5 and 11 Mbps

[57], [58]. Our objective is to develop low-complexity MAC-to-MAC channel models

without compromising modeling accuracy. To that end, Chapter A.4 focuses on empirical

analysis and “accurate” modeling of the bit-errors observed at 2, 5.5 and 11 Mbps. After

identifying accurate bit-error models, in Chapter A.5 we reduce the complexity of these

models by approximating their behavior. In Chapter A.6, the accurate and low-

complexity channel models are used in a header estimation framework to improve

wireless multimedia quality. As a final contribution of this part, Chapter A.7 shows that

inaccurate channel models can provide extremely misleading results for critical wireless

performance metrics.

Chapter A.4 shows that the MAC-to-MAC bit-error characteristics vary with changes

in the physical layer data rate. We show that the error-rate is quite low at 2 Mbps as

compared to 5.5 and 11 Mbps. At 2 Mbps, approximately 95% of the packets are

received without errors, which is a testament of the high physical layer robustness at 2

Mbps. The loss-rate subsequently increases with an increase in data rate.

We observe that the 2 and 5.5 Mbps bit-errors exhibit decaying correlation and a low

memory-length can be identified. Thus the bit-errors at 2 and 5.5 Mbps can be modeled

5

using Markov chains [59]. However, the bit-errors at 11 Mbps exhibit very high

correlation even at large lags. Such high correlation is reminiscent of long-range

dependence (LRD) [60] in the 11 Mbps bit-error process. We substantiate the LRD

notion through aggregation, variance-time and periodogram analyses.

Bit-errors at 2 and 5.5 Mbps are accurately modeled using high-order full-state

Markov (FSM) chains [59]. The LRD nature of the 11 Mbps bit-errors renders traditional

stochastic models (e.g., Markov, Poisson) ineffective. Therefore, we employ a

multifractal wavelet model (MWM) [61]–[63] to characterize the 11 Mbps bit-error

random process. For comparison, we also model the 11 Mbps bit-error process using

FSM chains of varying orders. We demonstrate that the MWM outperforms the Markov

models in both complexity and channel approximation.

The complexity of FSM chains increases exponentially with respect to the memory-

length. In Chapter A.5, we mitigate the exponential FSM complexity by approximating

the FSM behavior using low-complexity models. We first show that hierarchical, hidden

and lumped Markov models cannot capture the complex behavior of FSM chains.

Consequently, we directly analyze FSM chains and derive important guidelines that

should be followed to realize accurate, effective and low-complexity models. These

guidelines are used to propose a constant-complexity model (CCM) [30] that always

comprises of five states irrespective of the underlying process’ memory-length. At both 2

and 5.5 Mbps, the 5-state CCM provides performance that is comparable to the

exponential-complexity FSM chains and better than the linear-complexity models [29].

In Chapter A.6, we leverage the proposed low-complexity channel models in a novel

cross-layer wireless multimedia framework. Under the proposed header estimation

6

framework, corrupted headers of received packets are estimated using the MAC-to-MAC

channel models. The corrupted packets are in turn passed to the application layer, which

uses forward error correction (FEC) to recover the corrupted data. Trace-driven wireless

video simulations show that significantly better bandwidth utilization and video quality

than UDP [64] and UDP-Lite [41]- [43] protocols can be achieved by employing the

header estimation framework. We also show analytically that an ideal header estimation

scheme will always perform better than UDP and UDP-Lite under realistic wireless

channel conditions.

As a final contribution of this part, Chapter A.7 shows that an inaccurate channel

model that ignores channel memory can provide extremely misleading results. We use

two critical wireless performance metrics, namely goodput and retransmissions, and show

that highly inaccurate estimates of these metrics are obtained if memory-less or 1st order

channel models are used. On the other hand, FSM and CCM channel models which cater

for channel memory provide very accurate goodput and retransmission estimates.

A.1.2 Organization of this Part The rest of this part is organized as follows. Chapter A.2 provides an overview of the

related work in this area. Chapter A.3 provides background that is required to understand

the material presented in this part. Chapter A.4 focuses on empirical analyses and

“accurate” modeling of the bit-errors at 2, 5.5 and 11 Mbps. Chapter A.5 reduces the

complexity of the proposed models by evaluating low-complexity alternatives. Chapter

A.6 proposes a header estimation framework for wireless multimedia. Chapter A.7 shows

the impact of ignoring channel memory on the design of wireless systems.

7

CHAPTER A. 2 RELATED WORK

A.2.1 Channel Modeling Recently, link layer modeling for reliable protocols has received some research

attention [9], [10]. In the context of delay-sensitive traffic, a previous study derived

conditions under which block-based residual/MAC-to-MAC errors can be modeled using

a Markov chain [11]. For AT&T WaveLAN, a trace-based link layer investigation was

conducted in [13]. In the context of link layer modeling, Konrad et al. performed analysis

and presented a Markov-based Trace Analysis (MTA) model algorithm for frame-errors

on GSM networks [14], [15]. Ji et al. [16], [17] compared the performance of the MTA, a

full-state k -the order model, a hidden Markov model and an extended ON(error-

free)/OFF(error-filled) model in capturing the GSM (link layer) frame losses. Based on

the comparison provided in [16], [17], it was concluded that an extended ON/OFF model

with geometric distributions governing the state holding times provides significantly

better results than the other three modeling paradigms.

In view of the increasing popularity of 802.11 networks, we studied the 802.11b link

layer in order to facilitate design of effective cross-layer error-control schemes for the

support of real-time services [18], [19], [45]. Since most error-control schemes operate

on byte and/or packet boundaries, we proposed Markov-based models at the packet and

byte levels. We showed that a 2-state Markov model can characterize the packet-loss

process and a hierarchical Markov Model was proposed for the byte-level errors [18].

Willig et al. [26] have performed the only prior study that attempts to analyze and model

8

bit-error behavior of 802.11b networks with modeling accuracy as a performance

criterion. There are fundamental differences between the measurements, analyses and

modeling of [26] and this thesis. In [26] the authors attempt to capture the impact of

physical layer parameters (e.g., modulation type, antenna diversity etc.) on the bit-error

rate of a wireless LAN in an industrial setting. This study performs all experiments with

default physical layer parameters, thereby capturing a realistic channel that is

omnipresent in most common home/business/classroom settings. Also, in [26] the error-

prone 11 and 5.5 Mbps channels were not evaluated.

Chen et al. [24], [25] investigated Markov chain lumpability to reduce the complexity

of wireless channel models. Since lumpability constraints are too stringent for practical

wireless channels, Chen et al. [24], [25] resorted to an ON-OFF model that stochastically

bounds the sojourn time distributions of the lumped good and bad states. However, and

as asserted by [11], an ON-OFF model assumes geometric (memory-less) distributions

for good and bad periods which is not a valid assumption in most real-life channels.

Bipartite models were proposed for wireless channels by Willig [26]. The accuracy of

bipartite models depends on a selected value of complexity. We argue that model

accuracy is not optional and even a low-complexity model should provide the requisite

accuracy. Moreover, bipartite models require a large number of parameters to achieve a

certain level of accuracy. Köpke et al. [28] used chaotic maps to model 802.11b bit-errors

at low data rates (1 Mbps and 2 Mbps). Due to the focus on low data rates, in [28] it was

observed that: (a) probability of bit-error bursts of more than two bits is very low, and (b)

there is almost no correlation in error traces. The chaotic map model in [28] ignores the

correlation and captures only the heavy-tail behavior of bit-errors. While this assumption

9

of “no autocorrelation in data” might be suitable for the particular experimental setup

used in [28], it is not generically applicable to network error and loss data. In [20], it was

shown that low-complexity Markov models (such as hidden and hierarchical Markov

models) are inadequate for modeling of an 802.11b link layer bit-error wireless channel.

In [29], two linear-complexity models were proposed which were reasonably effective in

capturing 802.11b bit-errors.

A.2.2 Cross-Layer Design for Wireless Multimedia The traditional UDP protocol detects and drops corrupted packets using a checksum

operating at the MAC layer [64]. Such packet drops results in significant bandwidth

wastage, especially in the context of error-resilient multimedia applications which can

inherently tolerate some errors and losses in the received content. Larzon et al. proposed

a UDP-Lite protocol which allows delivery of partially corrupted packets to the

application layer [41]- [44]. In its commonly-used form, UDP-Lite disables the MAC

layer checksum while the transport layer partial checksum only covers transport and

application layer headers. Errors in the application layer payload are simply ignored.

Note that support of partial checksum requires modifications to the multimedia senders,

receivers and/or (multicast or multihop) intermediate nodes.

Many wireless cross-layer studies have shown that UDP-Lite performs better than

UDP on contemporary wireless networks [18], [41]- [50]. In [18], it was shown that over

802.11b LANs an application layer FEC must be employed in conjunction with UDP-

Lite. Otherwise the partially corrupted packets delivered by UDP-Lite result in almost

unintelligible multimedia quality. It was also shown in [18] that UDP-Lite over 802.11b

10

LANs can only work at the 2 and 5.5 Mbps data rates. At the 11 Mbps data rate, the

errors and losses in the received content are too high for effective FEC-based recovery.

11

CHAPTER A. 3 BACKGROUND

This chapter provides the background that is required to understand the contributions

of this part.

A.3.1 802.11b Wireless Networks Due to their high data rates and use of the time-tested TCP/IP protocol suite, 802.11b

networks have experienced widespread deployment. These LANs are finding their way

into homes and businesses ubiquitously. However, like other wireless technologies,

802.11b networks also suffer from severe quality degradation in the presence of physical

obstructions and inter-symbol-interferences. Two modes of operation are supported in

802.11 networks [57], [58]: (i) ad hoc mode in which wireless nodes can communicate

with each other directly, and (ii) infrastructure mode in which wireless nodes are

arbitrated using a central entity called an access point (AP).

All 802.11b-complaint networks support four basic physical layer data rates of 1

Mbps, 2 Mbps, 5.5 Mbps and 11 Mbps. Increase in the data rate reduces the robustness of

the 802.11b physical layer. In the infrastructure mode, if the number of retransmission

requests exceeds a certain threshold, the AP drops down to a lower data rate than its

current data rate. For retransmissions, 802.11b relies on a 32-bit frame check sequence

(FCS) that computes checksum over the entire MAC layer frame. Positive

acknowledgement (ACK) frames are employed to signal successful transmission of data

frames. If a frame fails checksum then it is dropped at the receiver’s MAC layer. The

sender after timing out schedules a retransmission.

12

A.3.2 Autocorrelation of Random Processes

Let ( )1X n and ( )2X n be two random variables derived from a random process

( )tΧ . The correlation coefficient of these random variables is defined as [65]

( ) ( ) ( ) ( ) ( ) ( ) ( )0

0 0X X

X X X Xη

η ηρ η σ σΕ − Ε Ε= ,

(A.1)

where XΕ and Xσ represent the mean and standard deviation of the random variable

X . When evaluating a dataset, the sample mean and the sample standard deviation are

used to compute the correlation coefficient of (A.1). This sample autocorrelation

coefficient for different values of lag is a direct measure of the level of temporal

dependence in the random process. Lag beyond which the autocorrelation coefficient

drops to an insignificant value corresponds to the memory-length of a random process.

Autocorrelation of a Markov source yields the order of the model required to accurately

characterize the source [66], [67].

A.3.3 Discrete-Time Markov Chains Markov chains are employed to model statistical data with short-term temporal

dependence. Let a stochastic process nΧ take on values denoted by non-negative

integers 0,1, ,M… . If n iΧ = then the process is said to be in state i at timen .

Whenever the process is in state i there is a fixed probability that the next state of the

process will be state j . If that probability can be expressed as

1 1 1 1 1 0 0 1Pr , , , , Prn n n n n nj i i i i j i+ − − +Χ = Χ = Χ = Χ = Χ = = Χ = Χ =… ,

(A.2)

13

for all states 0 1 1, , , , ,ni i i i j−… and all 0n ≥ , then nΧ is a Markov chain [59]. The

property given in (A.2) is commonly referred to as the Markov Property. Thus, for a

Markov chain the conditional distribution of any future state 1n+Χ , given the past states

1 1 0, , ,n−Χ Χ Χ… and the present state nΧ , is independent of the past states and depends

only on the present state. Equation (A.2) is also referred to as homogeneity property since

it ensures that the transition probabilities do not vary with time.

Let , 1Pri j n np j i+= Χ = Χ = denote the probability of transiting to state j

from i . Since ,i jp represents a probability measure, it exhibits the following properties:

(i) , 0i jp ≥ for all , 0i j ≥ , and (ii) ,0

1M

i jj

p=

=∑ for all 0,1, ,i M= … . The probability

of transiting to the next state can be represented in a matrix form. This matrix is referred

to as the one-step state transition probability matrix.

The steady-state or stationary probabilities of a Markov chain represent the long-run

proportion of the time spent in each state. Once the transitional probabilities of a Markov

chain are known, the steady-state probabilities of being in a particular state are the unique

non-negative solutions of the following linear system of equations:

,0

0

, 0,1, ,

1.

Mj i i j

iM

jj

p j Mπ π

π

=

=

= =

=

∑

∑

…

For stationary Markov chains, the steady-state and transition probabilities do not vary

with time. Throughout this thesis, we use stationary Markov chain for modeling bit-error

14

processes. The memory-length of a Markov chain is also referred to as its order.

Discussion in the preceding section outlined that autocorrelation analysis can be

performed on the realizations of a random process to determine the appropriate order of

the respective Markov chain. This observation will be used later to identify the orders of

Markov chains.

A.3.4 Burst Representation of Binary Wireless Traces

Wireless bit-error traces are generally represented as a binary time-series ( ) 1lix i = ,

where ( ) 0,1x i ∈ and l is the length of the time-series. Throughout this thesis, we

define ( )x i as:

( )0 error-free bit1 corrupted bit.x i

=

Without loss of generality, a binary time-series can be represented as an interleaved

sequence of runs (bursts) of good and bad bits ( ( ) 0x i = and ( ) 1x i = ), i.e.,

( ) ( ) ( )1 1 2 2, , , , , ,l lI B I B I BL , where iI and iB represent the lengths of the thi good and

bad bursts, respectively. Wireless channel modeling studies have established that this

binary data representation is rather suitable for definition and evaluation of a model

[14]- [17], [20]. The burst-lengths of good and bad bits are used for empirical

performance evaluations in this thesis. (Subsequent sections discuss this in further detail.)

15

A.3.5 The Gilbert Channel Model The Gilbert channel was proposed in [81] to model channels with 1st order memory.

Since then, it has been used to model many wireless channels at bit, byte and packet

levels [9]- [11], [13]- [15], [18]- [20], [26]. The Gilbert model captures channel memory

through a two-state Markov chain having a good and a bad state. The probability of the

next (good or bad) symbol is dependent on the whether the last received symbol was

good or bad. The steady-state probabilities of staying in the bad and good states are

respectively expressed as:

gbb

gb bgp

p pπ = + and bgg

gb bgp

p pπ = + . (A.3)

Clearly, 1b gπ π+ = .

Higher probabilities of staying in the present state (i.e., ggp and bbp ) indicate the

intensity of channel memory. A more appropriate measure to quantify channel memory

was proposed by Mushkin and Bar-David in [82], where memory µ of a Gilbert channel

was defined as:

gbp

bgp

bbp ggp

Good Bad

Figure 1. The Gilbert channel model [81].

16

1 gb bgp pµ = − − . (A.4)

It can be easily seen that 1 1µ− ≤ ≤ . Moreover, a closer look at above equations reveals

that

0 and b gb g bgp pµ π π= ⇒ = = . (A.5)

In other words, when 0µ = , the probability of getting a good or a bad symbol at any

time instance is independent of the last symbol value, that is, the channel behaves as a

memory-less channel. In [82], channels with 0µ > and 0µ < were referred to the

persistent and oscillatory memory channels, respectively. Real-life channels generally

have persistent memory.

A.3.6 Full-State Markov Chains for Wireless Channels Wireless bit-error processes are generally bursty and have a memory-length of greater

than one bit, and therefore these processes cannot be modeled using the Gilbert model.

To make such a process comply with the Markov property of (A.2), a Markov chain is

defined such that at each time instance the process is characterized by as many bits as the

memory-length. At each time instance, a new bit is added to the memory-window and the

oldest bit is dropped from the memory-window. As mentioned before, memory-length of

a Markov chain is also referred to as its order.

For a memory length of k bits, a full-state Markov (FSM) chain [20] corresponds to

all the 2k different possible combinations of k consecutive bits. Transition probabilities

between states are computed by sliding a k bit memory-window over the data and

counting the number of times a bit-pattern [ ]1 2, , , kx x x… is followed by another bit-

17

pattern [ ]1 2, , , ky y y… . Note that the number of states of an FSM chain increases

exponentially with an increase in memory-length – 2k states for a memory-length of k .

A.3.7 Long-Range Dependent Processes Long-range-dependent processes belong to a generic class of scaling or self-similar

processes [68], [69]. Self-similar processes exhibit similar statistical behavior at different

scales – zooming into or out-of a sample path of the process gives a new process

realization which is statistically similar to the original. A self-similar process ( )tΧ

satisfies the relation:

( ) ( )/d Ht c t cΧ = Χ , (A.6)

where d= represents equivalence in finite-dimensional distributions, c is a scaling

(compression/dilation) factor and H is known as the Hurst parameter. Self-similar

processes are also referred to as H-ss processes. It is not possible to define a characteristic

scale for H-ss processes which implies that these processes are scale-invariant. A self-

similar process with stationary increments is referred to as an H-sssi process [68]- [70].

Long-range dependent (LRD) processes model stationary increments of a second-

order self-similar process. The Hurst parameter of an LRD process is 1 2 1H< < . Also,

the autocovariance [ ]r k of an LRD process is of the form:

[ ] ( )2 2Hrr k c k− −∼ , (A.7)

where ∼ represents asymptotic equivalence and rc is a positive and constant scaling

factor. From (A.7) and the constraint on H it can be seen that summing the

18

autocorrelation function results in a divergent power series [71], [ ]k r k = ∞∑ . Thus

all samples of an LRD process depend heavily on previous samples, thereby resulting in

occurrence of clusters of similar values. For the present binary process, this observation

simply implies long bursts of zeros and ones.

An important property of LRD processes is that they can be equivalently

characterized in terms of the behavior of the aggregated process:

( ) [ ] [ ]( )1 1

1 kmmi k m

k im = − +Χ = Χ∑ ,

(A.8)

where m is the aggregation level. For an LRD process (and in general for all second-

order self-similar processes), ( ) [ ] [ ] 2 2var varm Hk m k−Χ = Χ . Thus for an LRD

process, a log-log plot of ( ) [ ] var m kΧ as a function of m is strictly linear with a slope

of 2 2H − [70]. This plot, generally known as the variance-time diagram, can be used to

ascertain the presence of LRD in the data and can also render an estimate of H .

The power spectral density of an LRD random process is the Fourier transform of the

autocorrelation of (A.7), and has been shown to be [70]:

( )2

1 22 112 sin as 02 2

HH HHi

I C Ci

ωω ω ωω π

∞−+

=−∞ = → +∑ ∼ ,

(A.9)

where ω is a frequency and HC is a constant. Note that the spectral density is

proportional to 1 2Hω − for frequencies close to the origin. A log-log plot of the power

spectral density as a function of the frequency has a slope of 1 2H− , which can be used

to estimateH .

19

A.3.8 The Multifractal Wavelet Model The multifractal wavelet model (MWM) was proposed in [61]- [63] to analyze and

model LRD network data. The MWM has shown promise in modeling various LRD

network phenomena [61]- [63]. The MWM relies on the premise that network data is

inherently non-negative and generally spiky. Both these properties are clearly true for

wireless bit-error data. Moreover, the scaling properties of wireless bit-errors can be

adequately characterized by wavelet-based analysis.

The MWM employs the Haar wavelet family and applies a constraint that the input

training data are always non-negative. For the Haar wavelet, the scaling and wavelet

coefficients are computed recursively as

( ), 1,2 1,2 112j k j k j kU U U+ + += + and ( ), 1,2 1,2 1

12j k j k j kW U U+ + += − , (A.10)

where ,j kU and ,j kW respectively represent the scaling and wavelet coefficients at time

k and scale/level j . With the Haar scaling function, the scaling coefficients are simply

averaged versions of the input signal and thus, due to the non-negative nature of the data,

the scaling coefficients are always non-negative, , 0j kU ≥ . Rearranging (A.10) yields

( )1,2 , ,12j k j k j kU U W+ = + and ( )1,2 1 , ,

12j k j k j kU U W+ + = − . (A.11)

In the first equation of (A.11), to keep the next level’s scaling coefficients ( 1,2j kU + ’s)

non-negative, negative wavelet coefficients are constrained as , ,j k j kW U≤ . Similarly,

to keep the 1,2 1j kU + + ’s non-negative, the positive wavelet coefficients are constrained

as , ,j k j kW U≤ . Combining these two constraints gives a non-negativity constraint that

20

, ,j k j kW U≤ . (A.12)

The above constraint simply ensures that once the inverse transform is taken, the resultant

process is always non-negative. Alternatively, the constraint can be implemented as

, , ,j k j k j kW A U= , (A.13)

where ,j kA is a random variable defined over the interval [ ]1,1− .

In order to train the MWM to match the wireless bit-error traces, two random

variables need to be captured. The first random variable is the scaling coefficient at the

coarsest scale 0 0,j kU . The second set of random variables is the ,j kA ’s at each level

which in turn yield the wavelet coefficients (A.13) at that level. Once a general sense of

probability distribution is ascertained for these random variables, the expectation-

maximization algorithm [76], [77] can be used to fit that distribution to the actual dataset.

The training and synthesis algorithm is detailed in [61]. The complexity of synthesizing a

length N trace using the MWM is ( )O N .

A.3.9 Performance Evaluation Measure Entropy is a measure of the average number of bits required to represent all outcomes

of a probability distribution. The Kullback-Leibler divergence quantifies the difference in

the entropies of two probability distributions [78]. In [20] we proposed an entropy

normalized Kullback-Leibler (ENK) divergence measure to quantify the accuracy of a

channel model. The ENK divergence quantifies the source-coding-like overhead incurred

21

by employing a model instead of the actual source. For two probabilities distributions

( )p X and ( )q X defined over a common alphabet Ψ , the ENK divergence is defined as:

( ) ( )( )( ) ( )

( )( ) ( )( )

2

2

log

logX

X

p Xp X q XENK p X q X p X p X

∈Ψ

∈Ψ

= −

∑∑ ,

(A.14)

where the numerator and denominator respectively represent the Kullback-Leibler

divergence and entropy functions.

The ENK divergence inherits basic properties of the Kullback-Leibler divergence: (a)

non-negativity, ( ) 0ENK p q ≥ , (b) non-symmetry, ( ) ( )ENK p q ENK q p≠ , and (c)

( ) 0ENK p q p q= ⇔ = . Small values of ENK divergence indicate that a model

accurately approximates the actual source. We would expect the ENK between two actual

traces to be a very small value as the traces are realized by the same random source.

Therefore we employ the ENK divergence between two 802.11b traces as a performance

evaluation reference for the ENK divergence between an actual trace and a trace

artificially generated by a model.

The ENK divergence relies on the fact that an appropriate random variable X is

being used to characterize the underlying source. We employ two random variables for

performance evaluation of all the models in this thesis: (i) burst-length of good bits I ,

where I takes positive integer values; (ii) burst-length of bad/corrupted bits B , where

B also takes positive integer values. Throughout this thesis, we refer to I and B as

good-bursts and bad-bursts random variables, respectively.

22

CHAPTER A. 4 EMPIRICAL ANALYSIS AND ACCURATE MODELING OF WIRELESS

CHANNELS

In this chapter, we first describe the wireless trace collection experiment. We then

evaluate the correlation in the bit-error traces collected at 2, 5.5 and 11 Mbps. We

observe that the correlation at 2 and 5.5 Mbps exhibit a decaying trend, but the 11 Mbps

traces have high correlation even at large lags. Due to their manageable correlation, we

use Markov chains to model the 2 and 5.5 Mbps bit-errors. We show that full-state

Markov (FSM) chains provide highly accurate models of the 2 and 5.5 Mbps bit-errors.

Moreover, we show that FSM chains have unused states which can be ignored to reduce

the complexity of the FSM-based channel modeling paradigm.

Unlike the bit-errors at 2 and 5.5 Mbps, the 11 Mbps bit-error process requires a

model that can capture long-memory. We evaluate the 11 Mbps bit-errors using scaling,

variance-time and periodogram analyses. These evaluations substantiate the presence of

long-memory or long-range dependence (LRD) in 11 Mbps bit-errors. Consequently, we

employ a multifractal wavelet model (MWM) to characterize the 11 Mbps bit-errors. We

show that the MWM captures second-order statistics of the 11 Mbps bit-errors much

more accurately than Markov chains.

A.4.1 Wireless Trace Collection For this study, five wireless receivers were used to simultaneously collect error traces

on an 802.11b LAN. The receivers were placed at different locations in a room, while the

23

access point (AP) was placed in a room across a hallway from the receivers to simulate a

realistic home/classroom/office setting as shown in Figure 2.

The receivers’ MAC layer device drivers were modified to pass corrupted packets to

higher layers. The receivers were Linux clients using DLink DWL-650 wireless cards

with the open source linux-wlan-ng device drivers [72]. To capture packets at high

transmission rates, packet dissectors were implemented inside the device drivers. These

packet dissectors ensured that only packets pertinent to our wireless experiment are

processed, while all other packets are simply dropped. Each experiment comprised one

million packets with a payload of 1,000 bytes each, i.e., each trace had approximately 1

GB of data.

A wired sender was used to send multicast packets with a predetermined payload on

the wireless LAN; multicasting disabled MAC layer retransmissions. The sender used

different transmission rates ranging from 4 Kbps to 1 Mbps for each experiment. At the

physical layer, the auto rate selection feature of the AP was disabled and for each

experiment the AP was forced to transmit at a fixed data rate. Each trace collection

experiment was repeated multiple times at 2, 5.5 and 11 Mbps physical layer data rates

and at different times of day.

24

Table 1 provides some statistics of the traces collected for this study. The packet error

rate is computed as

( ) ( )pkt error rate = pkts received with one or more errors total received pkts .

As expected, the average packet error rate increases with an increase in the physical layer

data rate. In particular, the average packet error rate increases from approximately 10%

at 5.5 Mbps to almost 40% at 11 Mbps. Thus traditional higher layer protocols that drop

all corrupted packets (e.g., 802.11 MAC, UDP, TCP etc.) experience profound losses at

11 Mbps, and consequently there is room for considerable improvement. Since the

wireless receivers were placed at different locations, the receivers experienced different

Room1

802.11b AP

Receiver 0

Room2

Sender

Receiver 4

modified linux-wlan-ng drivers

bit error traces

Figure 2. Set up for collection of wireless bit-error traces.

Table 1. Packet-Level Statistics at 2, 5.5 and 11 Mbps

Data rate Average Packet Error Rate

Min Packet Error Rate

Max Packet Error Rate

2 Mbps 5.97% 0.75% 14.31% 5.5 Mbps 9.79% 0.61% 22.74% 11 Mbps 39.5% 10.99% 77.83%

25

packet error rates. The minimum and maximum error rates in Table 1 outline that the

receivers were experiencing both good and bad link conditions.

In our initial experiments all wireless receivers maintained Line of Sight (LoS) with

the access point (AP). The AP was forced to transmit at 2, 5.5 and 11 Mbps for each

trace. It was observed that with clear LoS, the error-rate (at all bitrates) was extremely

low. Such excellent performance deemed further LoS study inconsequential. Hence, we

positioned the receivers in separate rooms to simulate a more realistic

business/classroom/home-network wireless setup as shown in Figure 2.

A.4.2 Empirical Analysis of 802.1b Bit-Errors To maintain focus, throughout Chapters A.4 and A.5 we show results for two traces at

each physical layer data rate. These traces are collected at the same receiver under similar

conditions. The results for the remaining traces and receivers are similar.

A.4.2.1 Autocorrelation Analysis

The sample autocorrelations of 2 Mbps, 5.5 Mbps and 11 Mbps bit-error traces are

illustrated in Figure 3. Clearly, the correlation at 11 Mbps is higher than that at 2 and 5.5

Mbps. Let us first concentrate on the autocorrelation of 2 and 5.5 Mbps traces. It is clear

that the autocorrelation at both data rates is a decaying function, i.e., the level of temporal

dependence is decreasing with time. From the examples provided in Figure 3, we assume

that the memory-length is determined by the lag beyond which the normalized correlation

is less than 0.15 , an empirically-determined threshold. We observed that in some traces

the correlation does not drop significantly below 0.15 , even for very large lags.

However, in general the bit-errors exhibited rapidly decaying correlation as in Figure 3.

26

Extensive performance evaluation suggests that correlation of less than 15% does not

play a significant role in the error process characteristics.

Based on the threshold, the memory-lengths of the 5.5 Mbps traces of Figure 3 are 12

and 14 respectively. The correlation of both 2 Mbps traces drops below 0.15 at the lag of

16. Hence, we use memory-length 14 and 16 as the maximum order of the 5.5 and 2

Mbps Markov chains respectively. Since the memory-lengths of the 2 and 5.5 Mbps bit-

error processes are not very large as compared to the 11 Mbps process, high-order

Markov chains can appropriately model these processes.

Contrary to the correlations at 2 and 5.5 Mbps, Figure 3 clearly shows that the 11

Mbps bit-error process has high correlation even at large lags. This is reminiscent of

long-range dependence since a low-order memory-length cannot be identified for the 11

Mbps bit-error process. Consequently, Markov models cannot be used to model 11 Mbps

bit-errors.

1 10 20 30 400.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

lag

sam

ple

auto

corr

elat

ion

trace 2 Mbps 1trace 2 Mbps 2trace 5.5 Mbps 1trace 5.5 Mbps 2trace 11 Mbps 1trace 11 Mbps 2

Figure 3. Autocorrelation of bit-error traces.

27

A.4.2.2 Preliminary Empirical Analysis of FSM Chains

In accordance with the discussion in Section A.3.4, we represent the bit-error traces

as a binary series ( ) 1lix i = , where ( ) 1x i = represents a bit-error and l is the length of

the series. Also, for a memory-length of k , a full-state Markov (FSM) chain has states

corresponding to all possible 2k combinations of k consecutive bits. The complexity

(i.e., number of states) of the FSM chains increases exponentially with an increase in

memory-length. Previous studies employed low-order Markov chains [8], [14]–[17].

However, due to the present interest in capturing high-order behavior, we provide

analysis and modeling with high-order FSM chains.

For efficient and accurate representation of the transition probability data and to

reduce the complexity we examined the FSM transition probability matrices for bit-

patterns that never occur in the collected traces. We refer to such bit-patterns as the

unused states. These states result in all-zero columns in the transition probability matrix.

An all-zero column implies that the probability of jumping to that state from any state is

zero. While other methods for judicious selection of Markov states exist [67], used states

provide a simple and effective method of minimizing the model complexity.

The percentage of unused states for each order is shown in Figure 4. It can be

observed that the number of unused states grows as the order of the Markov chain is

increased. For example, in case of a 122 state model, at 2 and 5.5 Mbps approximately

80% and 30% states are never used. We lay special emphasis on this observation since

the total number of states directly corresponds to the complexity of the model. All

following FSM results will employ the used states only. Here we recognize that the

number of unused states will decrease as the channel is observed for a significantly long

28

period of time, i.e., number of unused states is inversely proportional to the trace length.

However, and as substantiated by the FSM performance evaluation in later sections, FSM

chains perform quite reasonably without the unused states thereby implying that the

unused states do not play a significant role in overall channel characterization.

A.4.2.3 Long-Range Dependence in 11 Mbps Bit-Errors

The autocorrelation analysis in Section A.4.2.1 provided initial indications that the 11

Mbps bit-errors are LRD in nature. This section substantiates this preliminary notion of

LRD by analyzing the 11 Mbps bit-error process in further detail.

A.4.2.3.1 LRD Evaluation by Observing Energy at Different Scales Since LRD processes typically demonstrate second-order self-similarity, zooming out

from a sample path of the process should yield a path similar to the original in second-

order statistics. As shown by (A.8), in order to determine scaling in a process, we can

define an aggregate process by dividing a bit-error trace of length l into non-overlapping

blocks of length m and averaging over each block. The resultant aggregate sample path

4 16 64 256 1024 4096 163840

10

20

30

40

50

60

70

80

number of states (logscale)

perc

enta

ge o

f unu

sed

stat

es

5.5 Mbps2 Mbps

Figure 4. Percentage of unused FSM states at 2 and 5.5 Mbps.

29

averages m points from the actual sample path. Due to the 0,1 representation of the

bit-errors, an m level aggregate process represents the normalized energy of bit-errors in

non-overlapping windows of size m . We henceforth use the terms aggregate process

and energy process synonymously. Aggregation smoothes high variances in the sample

path and provides an on-average zoomed-out version of the actual sample path. Thus

energy processes at different aggregation levels outline the impact of aggregation on the

short-term second moment of the process.

Figure 5 outlines three aggregate processes. The top figure is a process sample path

( ) [ ]1 kΧ outlining the unnormalized energy (i.e., the total number of errors) observed in

each packet (packet transmission time=1 second). The second figure is a level-4

aggregate of the first sample path which depicts the average energy observed in four

packets. Thus, the first point in this level-4 aggregate sample path is

0 10 20 30 40 50 60 70 80 90 1000

500

1000

1500

X(1)

0 10 20 30 40 50 60 70 80 90 1000

500

1000

X(4)

0 10 20 30 40 50 60 70 80 90 1000

200

400

600

X(8)

0 10 20 30 40 50 60 70 80 90 1000

200

400

600

X(16)

Figure 5. Aggregates of the 11 Mbps energy process at different time scales.

30

( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ]( )4 1 1 1 111 1 2 3 44Χ = Χ + Χ + Χ + Χ .

Similarly, the remaining two figures are aggregates at levels 8 and 16. Each aggregate

path is zooming out of the actual sample path and no statistically differentiating features

are revealed by simple observation. Thus it can be inferred that the decrease in variability

with increased smoothing is very slow. This slow-varying decay is further highlighted by

the analysis of second-order statistics in the following section.

A.4.2.3.2 LRD Evaluation using Variance-Time Diagrams Recall that for an LRD process, the variance ( ) var mΧ of the aggregate process is

equal to ( ) 2 2 1varHm − Χ . Variance-time diagrams plot the logscale variance of the

aggregate process as a function of the logscale aggregation level. Second-order self-

similarity is implied if the logscale decay in the variance is strictly linear, that is, the

change in variance is directly proportional to the aggregation level. For an LRD process,

the Hurst parameter H can then be estimated by fitting a least-squares line through the

plot. A stationary second-order self-similar process is said to be long-range dependent if

1 2 1H< < .

31

The variance-time diagrams of two 11 Mbps bit-error traces for different aggregation

levels is given in Figure 6. Clearly, for both the traces under consideration, the variance

has a mostly linear decay with respect to the aggregation level. Least-squares lines of

order-1 are fit to the data points of the two variance-time diagrams. The slopes of the two

least-squares lines of Figure 6 (a) and (b) are 0.3401− and 0.287− , respectively. In

accordance with the above discussion, an estimate of the Hurst parameter, H , can be

obtained by noting that the slope of the variance-time plot should be equal to 2 2H − .

This results in Hurst parameter estimates of 0.83H = and 0.857H = for the two

traces. The two values of H are quite close to each other because the two traces are

realizations of the same random process. Further, for both Hurst estimates the

1 2 1H< < condition is satisfied, thus implying that the 11 Mbps bit-error process is

long-range dependent. To further substantiate the LRD notion, in following section we

provide LRD analysis using a frequency-domain estimator.

0 0.5 1 1.5 2 2.5 3 3.5-2.6

-2.4

-2.2

-2

-1.8

-1.6

-1.4

-1.2

log(m)

log(V

ar(X

(m) ))

0 0.5 1 1.5 2 2.5 3 3.5

-2.8

-2.6

-2.4

-2.2

-2

-1.8

-1.6

-1.4

-1.2

log(m)

log(V

ar(X

(m) ))

(a) trace 11 Mbps 1: Hurst parameter estimate, 0.83H =

(b) trace 11 Mbps 2: Hurst parameter estimate, 0.857H =

Figure 6. Variance-time diagrams of two 11 Mbps bit-error traces.

32

A.4.2.3.3 LRD Evaluation using the Periodogram A periodogram renders an estimate of the power spectral density of a process. The

periodogram is simply the square of the magnitude of the discrete-time Fourier

transformed samples. Mathematically, the periodogram of a discrete-time process nΧ is

given as:

( )

2

1

12

N ikkk

I eNωω π

−=

= Χ∑ , (A.15)

where ω is the frequency, N is the total number of samples and 1i = − . Recall from

Section A.3.7 that the spectral density of an LRD process is proportional to 1 2Hω − near

the origin, 0ω = . Since the periodogram of (A.15) is an estimate of the spectral density,

a regression of the logarithm of the periodogram on the logarithm of the frequency ω

should render an order 1 polynomial with a slope of 1 2H− . A frequency-domain

estimate of H can thus be obtained by fitting an order-1 least-squares line through a log-

log plot of the periodogram versus the frequencies. In general, only the lower 10%

(a) trace 11 Mbps 1: Hurst parameter estimate, 0.874H =

(b) trace 11 Mbps 2: Hurst parameter estimate, 0.877H =

Figure 7. Logscale periodogram of two 11 Mbps bit-error traces.

33

frequencies of the periodogram are used for this estimation since the approximation only

holds true near the origin.

The logscale periodograms of the two 11 Mbps traces are shown in Figure 7 (a) and

(b), respectively. These slopes yield Hurst parameter estimates of 0.874H = and

0.877H = for the two traces. These estimates satisfy the 1 2 1H< < condition, thus

substantiating that the 11 Mbps random process has long-range dependence. Further, note

that these estimates are quite close to the estimates rendered by the variance-time

diagrams of the last section.

A.4.3 Accurate Modeling of 802.11b Bit-Errors

A.4.3.1 Bit-Error Modeling at 5.5 Mbps

The autocorrelation analysis in preceding sections outlined a maximum memory-

length of 14 for the 5.5 Mbps bit-error process. A memory-length of 14 corresponds to an

FSM chain with 142 16384= states. ENK-based performances1 of FSM chains with

varying memory-lengths are given in Figure 8. FSM chains perform remarkably well for

the bad-bursts random variable. Note that even smaller order chains perform adequately

with the source coding overhead of less than or approximately equal to 0.03 for all cases.

However, the good-bursts random variable incurs significant overhead for smaller order

chains. For example, the two-state chain renders an overhead of approximately 0.5 and is

therefore not a viable option. For higher-order chains, the overhead decreases and drops

to a reasonable level, beginning at the 511-state model. Due to data over-fitting

considerations, we assume that any overhead less than 0.1 is acceptable. Thus we

1 The terms performance and accuracy are used synonymously throughout this thesis.

34

conclude that all FSM chains of orders 9 and above render appropriate models for the 5.5

Mbps bit-error process.

A.4.3.2 Bit-Error Modeling at 2 Mbps

The performances of varying order FSM chains are shown in Figure 9. For both

random variables, small order FSM models incur profound overhead. For instance, for

the good-bursts random variable the overhead of the order-1 (two-state) chain is

approximately 0.8 . Although lower order FSM chains cannot model the bit-error process

effectively, as we move to higher order chains the overhead decreases substantially and

drops to a reasonable level. Since the 5.5 Mbps FSM outlined that the actual models’

order can be smaller than what is outlined by the data correlation, in Figure 9 we only

provide analysis up to order 10 since the performance improvement saturates after the

order-10 (548-state) model.

It is clear from Figure 9 that low-order FSM chains incur significant ENK overhead

and hence are unable capture the 2 Mbps bit-error behavior. For both random variables,

the FSM performance subsequently improves with an increase in the order of the model.

The accuracy of the order-10 FSM chains is comparable to the divergence between two

actual traces. Hence, we conclude that 548-state FSM renders a good model of the 2

Mbps MAC layer bit-error process.

35

A.4.3.3 Bit-Error Modeling at 11 Mbps

In earlier sections, we revealed the LRD nature of the 11 Mbps bit-errors using

autocorrelation and scaling analyses. Based on the results from the last two sections, one

can conjecture that if high-order FSM simulations are performed, it might be possible to

identify an appropriate Markov process of an order lower than what is outlined by the

autocorrelation. However, ascertaining such a model order might require simulations with

2 4 8 16 32 64 128 256 5120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of states

EN

K: g

ood-

burs

ts

FSM5.5 Mbps bit-error traces

2 4 8 16 32 64 128 256 5120.002

0.004

0.006

0.008

0.01

number of states

EN

K: b

ad-b

urst

s

FSM5.5 Mbps bit-error traces

(a) good-bursts (b) bad-bursts Figure 8. Performances of varying order FSM chains for the 5.5 Mbps MAC layer bit-

error process.

2 4 8 16 32 64 128 256 512 10240

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of states

EN

K: g

ood-

burs

ts

FSM 2 Mbps bit-error traces

2 4 8 16 32 64 128 256 512 10240.005

0.01

0.015

0.02

0.025

0.03

0.035

number of states

EN

K: b

ad-b

urst

sFSM 2 Mbps bit-error traces

(a) good-bursts (b) bad-bursts Figure 9. Performances of varying order FSM chains for the 2 Mbps MAC layer bit-error

process.

36

high-order FSM chains, which is computationally infeasible. In this section we show that

a multifractal wavelet model (MWM) captures the LRD characteristics of the 11 Mbps

bit-error process quite accurately. Although Markov models cannot capture the LRD

nature of bit-errors at 11 Mbps, we use Markov chains as a performance reference when

evaluating the performance of the MWM.

A.4.3.3.1 The Multifractal Wavelet Model We used the MWM toolbox [73] to train an MWM. An actual 11 Mbps trace (i.e., a

bit-error sequence of zeros and ones) was used for MWM training. Various simulations

were performed with β , point-mass and hybrid β /point-mass probability distributions

for the ,j kA random variables and Gaussian and log-normal distributions for the 0 0,j kU

random variable. We observed that the performance of the MWM was quite insensitive to

the choice of the probability distribution chosen to capture the MWM random variables.

For brevity we only report results for the β distribution.

A.4.3.3.2 ENK-based Performance Evaluation A cautionary note is in place before we proceed with ENK-based performance

evaluation of the MWM at 11 Mbps. Due to its reliance on entropy, the ENK divergence

compares the skew of the probability distributions, but does not place much emphasis on

the second-order statistics (e.g., energy, variance etc.) of the distributions. The MWM (or

for that matter any model of LRD data), on the other hand, is designed to capture scaling

phenomena (and the consequent second-order statistics) of an LRD random process. Thus

for an LRD process, comparison only using ENK of good- and bad-burst distributions

can be misleading. Thus ENK divergence by itself cannot render an appropriate measure

to completely quantify the MWM performance. In addition to ENK, it is imperative that

37

second-order statistics of the random process be compared with the model. We perform

such second-order performance evaluation in the subsequent sections.

The ENK-based performances of the MWM and FSM chains are tabulated in Table 2,

where ( )FSM x represents an FSM chain with x states. The good-bursts ENK overhead

of the MWM is lesser than the 16-state FSM chain, while the bad-bursts overhead is more

than the 16-state FSM chain. The MWM ENK overhead is slightly worse than the 4096-

state FSM chain. For instance, for the first actual trace the MWM’s ENK good-bursts

divergence is 0.127 0.091 0.036− = more than the 4096-state FSM. For the same

example, the bad-bursts ENK overhead of MWM is 0.093 0.00094 0.09206− = more

than the 4096-state FSM.

Table 2. Performance of MWM and FSM for the 11 Mbps Bit-Error Process

good-bursts bad-bursts ( )trace1 trace2ENK 0.0586 0.00032

( )trace1 MWMENK 0.127 0.093

( )( )trace1 FSM 16ENK 0.174 0.002

( )( )trace1 FSM 4096ENK 0.091 0.00094

( )trace2 MWMENK 0.143 0.096

( )( )trace2 FSM 16ENK 0.189 0.002

( )( )trace2 FSM 4096ENK 0.088 0.0017

38

The slightly superior performance of the FSM chains is due to the fact that the FSM

model is extremely apt at capturing the short-term correlation structure of the random

process. This short-term dependence is because of small bursts of good and bad bits.

Such small bursts are quite prevalent even in an LRD process such as the present 11

Mbps bit-error process. To substantiate this claim, we show the small burst probabilities

of the good- and bad-bursts random variables in Figure 10. Note that burst-lengths of 1,

2, 3, 4 and 5 constitute 78.35% and 98.03% of the probability space of the good- and

bad-bursts random variables, respectively. This small burst behavior is very adequately

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

good burst length

good

bur

st p

roba

bilit

y

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

good burst length

good

bur

st p

roba

bilit

y

(a) good-bursts (b) bad-bursts Figure 10. Probability mass functions for good- and bad-bursts random variables derived

from an 11 Mbps trace. (Only the probabilities of small bursts are shown here.)

0 200000 4000000

1

trace 11 Mbps

0 200000 4000000

1

MWM

0 200000 5000000

116 state

FSM

0 200000 4000000

14096 state

FSM

0 5000 10000 150000

0.6trace

11 Mbps

0 5000 10000 150000

0.6

MWM

0 5000 10000 150000

0.6

16 state FSM

0 5000 10000 150000

0.64096 state

FSM

(a) aggregation level=8 (b) aggregation level=256 Figure 11. Energy processes of actual and synthesized bit-error traces.

39

characterized by an FSM chain. Since the skew of both probability distributions is

dictated by these highly probable small bursts, the ENK overhead of the FSM chains is

quite low, although FSM chains cannot capture the long-term process correlation. The

skew-oriented bias of the ENK measure masks the long-term correlation properties of a

random process, which is exhibited in the spread (i.e., the variance) of the probability

distribution. We henceforth focus solely on second-order analysis of the models under

consideration.

A.4.3.3.3 Performance in Capturing Energy at Different Scales We first consider energy in non-overlapping windows of the bit-error traces. As

mentioned previously, the definition of energy (as given in (A.8) and explained in prior

sections) outlines the second moment of the random process in short-term windows. Two

examples of an energy process derived from an actual source and energy processes

synthesized using the MWM, the 16-state FSM chain and the 4096-state FSM chain are

illustrated in Figure 11. It can be observed that the FSM chains project overly pessimistic

energy estimates (i.e., very high error rates), whereas the MWM in general has less

energy per window than the actual 11 Mbps bit-error process. By simple observation, it

can be deduced that the MWM captures the energy characteristics of the 11 Mbps bit-

error process better than the Markov chains. In the next section, we compare the

aggregated variance-time behavior of the FSM chains and the MWM.

A.4.3.3.4 Performance in Capturing the Variance-Time Characteristics

In this section, we evaluate the accuracy of the MWM and the FSM chains in

modeling the decay of aggregated variances. Figure 12 shows the variance-time behavior

of FSM chains. Clearly, the FSM chains can capture the short-term correlations of the

40

random process with outstanding accuracy as shown in the top-left corner of Figure 12.

However, the performance degrades sharply as the dependence (i.e., linear decay of the

variance) persists at higher scales. Not surprisingly, more and more correlation is

captured as we increase the memory-length of FSM chains. Thus if the complexity of an

FSM chain that captures all the scales present in the data can be afforded then such an

FSM chain can render a highly accurate model.

Unfortunately, in practical LRD processes the correlation typically persists at very

high scales. In such a case, a model that is designed to capture the correlation structure at

different scales (e.g., the MWM) is more suitable than FSM chains. This observation is

outlined in Figure 13 (a), which shows that the MWM can capture the decay of variance

of the 11 Mbps quite accurately within an additive constant. The phenomenon of a

model’s inability to capture the exact variance values is well-known in LRD literature. It

has been diagnosed that this problem arises because of non-stationarities introduced by

jumps in the mean and slow decaying trends [74]. (The jumps in the mean of the 11 Mbps

bit-error process can be easily observed in Figure 11.) Teverovsky and Taqqu [74]

proposed to eliminate this problem by fitting the function 2 21 2 HC C m −+ to the

variance-time diagram of the LRD process. The corrective factors, 1C and 2C , can then

be added to the variances produced by a model.

In the present problem, the corrective factors were 1 0C = and 2 3.71C = .

Variance-time diagram of the MWM with the corrective factors is given in Figure 13 (b).

Clearly, the corrected MWM captures the decay in the variance quite accurately. Thus we

deduce that MWM is an effective model of the long-range dependence present in the 11

Mbps bit-error process.

41

A.4.4 Discussion In this chapter, we proposed accurate models of MAC layer bit-error channels at 2,

5.5 and 11 Mbps data rates of an 802.11b LAN. While the MWM model for 11 Mbps bit-

errors has linear-complexity in synthesizing and predicting bit-error behavior, the FSM

chain model’s complexity increases exponentially with respect to the memory-length.

0 0.5 1 1.5 2 2.5 3 3.5 4-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

log(m)

log(V

ar(X

(m) ))

4 state FSM16 state FSM4096 state FSMtrace 11 Mbps

Figure 12. Variance-time diagrams of varying order FSM chains for the 11 Mbps bit-

error process.

0 0.5 1 1.5 2 2.5 3 3.5 4-4

-3.5

-3

-2.5

-2

-1.5

-1

log(m)

log(V

ar(X

(m) ))

trace 11 Mbps 1MWM

0 0.5 1 1.5 2 2.5 3 3.5 4-3.2

-3

-2.8

-2.6

-2.4

-2.2

-2

-1.8

-1.6

-1.4

-1.2

log(m)

log(V

ar(X

(m) ))

trace 11 MbpsMWM

(a) without corrective factors (b) with corrective factors Figure 13. Variance-time diagrams of the MWM for the 11 Mbps bit-error process.

42

The following chapter reduces the exponential complexity by approximating FSM

chains’ behavior using low-complexity models.

43

CHAPTER A. 5 COMPLEXITY REDUCTION FOR MARKOV CHANNELS

Most benefits of a wireless MAC layer channel model can be realized if the model is

able to provide real-time and online channel characterization and prediction. In

complexity- and power-constrained wireless and mobile environments, such channel

characterization is only possible with a low-complexity model. Despite some recent

interest in reducing the complexity of wireless models [20]–[29], development of

accurate, pragmatic and low-complexity wireless channel models is still an open

problem. Since low-complexity models have not been thoroughly explored and verified

for contemporary wireless and mobile networks, many of the protocols, applications and

systems mentioned in Chapter A.1 have not been realized in practical wireless systems.

The number of states of an FSM chain is an exponential function of the random

process’ memory-length - 2k states for a process with a memory-length of k bits. This

phenomenon is commonly referred to as state explosion. Due to state explosion, although

FSM chains can provide accurate models of wireless bit-errors, their high complexity

renders them impractical for realistic wireless environments.

To reduce FSM chains’ complexity, in this chapter we first consider hierarchical [18]

and hidden [75] Markov models. We observe that these models cannot accurately

characterize the bit-errors channels under consideration. Consequently, we focus on

directly approximating FSM chain behavior. We first make insightful observations about

underlying characteristics of an FSM model. As a first direct approximation model, we

44

derive and evaluate a new class of lumped Markov chains [59]. However, we observe that

lumped Markov chains are also unable to approximate the behavior of FSM chains.

Finally, we analyze how FSM chains capture good- and bad-burst behavior of

wireless channels. Using this analysis, we derive important guidelines for the realization

of accurate, effective and low-complexity models. These guidelines lead to a constant-

complexity model (CCM) that always comprises of five states irrespective of the memory-

length. We show that the performance of the 5-state CCM in modeling of the 2 and 5.5

Mbps MAC layer bit-error channels is comparable to exponential-complexity FSM

chains and better than linear-complexity models [29].

A.5.1 The Hierarchical Markov Model The hierarchical Markov model (hMM) is based on the observation that error traces

exhibit clear delineation between highly bursty error regions and relatively low error

regions. Therefore, in an hMM [18], severe- and low-burst regions are identified in the

bit-error traces. Each of the burst states has an embedded two-state Markov model as

depicted in Figure 14. One of the challenges of the hMM model is the delineation of the

high-level severe- and low-burst states. The work in [18] employed a state demarcation

heuristic to delineate the low- and severe-burst states in the error traces. The state

demarcation heuristic relied on two empirically determined thresholds to transit between

severe- and low-burst states. One of the thresholds, say threshold 1, determines whether

or not a burst of bad bits is a small isolated burst between mostly good bits. The other

threshold, say threshold 2, ascertains the number of good bits which can characterize the

end of a long/severe burst of bad bits. Small thresholds can make the process transit

erratically between the high-level states, whereas large thresholds can unnecessarily

45

increase the sojourn time spent in a high-level state. There is unfortunately no good

method of determining the best values of these thresholds and heuristic experimentations

are needed to find somewhat accurate values of these thresholds.

Table 3 outlines the ENK-based performance of the hMM for varying values of the

two thresholds. The ENK divergence of the actual traces [row 1 of Table 3] provides a

Pgb1

Pbg1

low-burst state

Pbb1 Pgg1

bad good

Pbg2

severe-burst state

Pbb2 Pgg2

bad good

Pgb2

Figure 14. The hierarchical Markov model (hMM) [18].

Table 3. Performance of the hMM for 5.5 Mbps Bit-Error Process good-bursts bad-bursts

( )trace1 trace2ENK 0.0086 0.0029

( )trace1 hMMENK , threshold1=threshold2=10 0.621 0.009








46

reference value for performance evaluation of the hMM. It is obvious that irrespective of

the threshold values, the hMM always incurs a high overhead of more than 0.6 for the

good-bursts random variables. The bad-bursts random variable usually takes small values

since most of the bits are not corrupted. From Table 3, it can be seen that for the bad-

bursts random variable, the ENK distance between the hMM- and actual traces is quite

small for all thresholds. This ENK overhear is nevertheless much larger than the ENK

divergence between the actual traces. From these results, we conclude that the hMM

cannot capture the present MAC layer wireless bit-errors.

A.5.2 The Hidden Markov Model To apply hidden Markov models (HMMs) to this problem, we need discriminative

statistical features that can be used to train the HMM. After much experimentation, we

found that bit-error energy in non-overlapping windows can serve as an effective

discriminative feature for the low and severe bit-error conditions. Due to the present

0,1 representation of bit-errors, the error-rate simply corresponds to the energy process

defined in Section A.4.2.3.1 [equation (A.8)] with the window size representing the

aggregation level. We use the bit-error energy as input to the HMM’s Baum-Welch

forward-backward training algorithm [76], [77].

We ran simulation for varying window sizes and for varying number of HMM states.

Table 4 enumerates performances of three HMMs; similar trends were observed for other

HMM experiments. Note that the HMM performance is quite sensitive to the window

length. For instance, note that the HMM over a 1000 bit window has far inferior

performance than the 2000 bits HMM, even though the 2000 bits HMM has lesser

number of states. In general, the HMM performance improved with an increase in

47

window size. This improvement, however, saturated once the window size became

greater than the packet size.

Comparing the good-bursts column of Table 4 with Table 3 reveals that the HMMs

with 3 and 5 states yield better good-bursts performance than the hMM. However, the

ENK values for the bad-bursts random variable in all the HMM cases are orders of

magnitude greater than the respective values for the hMM traces. Hence, we conclude

that, while the HMM improves the modeling of good-bursts, it does not model the bad-

bursts adequately. Thus the overall performance of HMM modeling for the experimental

error traces is unsatisfactory.

The poor performance of HMMs is because unlike other problem areas where well-

defined characteristics of the input data are available for training, the bit-error traces of

this study do not provide robust training features. Furthermore, the HMM assumes that

the probability of staying in a state is exponentially distributed which is not be an

accurate assumption for wireless bit-errors. This assumption of exponentially distributed

sojourn state times results in inaccurate HMM parameterization.

Table 4. Performance of the HMM for the 5.5 Mbps Bit-Error Process

good-bursts bad-bursts ( )trace1 HMMENK , window=2000 bits, HMM states=3 0.403 0.685

( )trace2 HMMENK , window=2000 bits, HMM states=3 0.409 0.731





48

Since the hierarchical and hidden Markov models cannot capture the bit-error

process, we now focus on directly approximating FSM chains’ behavior. This direct FSM

approximation will be performed by aggregating FSM chain states.

A.5.3 FSM Observations In this section, we first state two intuitive observations regarding FSM chains. These

observations are used in subsequent sections to derive important characteristics of FSM

chains. It is important to outline here how we intend to approximate FSM chains. The

approximate models of this thesis are developed by creating partitions of the FSM chain

state space. All FSM states in a partition are then simply aggregated/grouped into a single

aggregate state of the approximate process. Hence, this section mainly addresses the

following question: How should one define partitions on the FSM state space such that

the resulting aggregate process accurately approximates the underlying FSM chain? In

other words, we are trying to find out which FSM states can be aggregated together

without compromising the FSM model’s performance.

A.5.4 Observations about FSM Chains The first observation is a direct consequence of the binary nature of the present

wireless bit-error process:

Observation 1. If a bit-by-bit sliding window is used to compute the transition

probabilities of a 2k state FSM chain, then from a current state, n iΧ = , in one

transition the FSM chain can transit to only two possible states given by:

49

( )( )12 mod22 1 mod2 ,

kn k

ii+

Χ = +

(A.16)

where k is the memory-length of the FSM chain and 0,1, ,2 1ki ∈ −… is an arbitrary

state in the FSM chain’s state space.

An example given in Figure 15 clearly demonstrates this observation. A memory-

length of four, 4k = , is used in this example so the set all possible FSM states is

40,1,2, ,2 1 15− =… . The current state is ( ) ( )2 100110 6nΧ = = and as the window

slides by one bit, the 0 in the most significant bit position will be dropped and a bit will

be added to the least significant bit position. Since the data are binary, the chain can

transit to either ( ) ( )2 101100 12= or ( ) ( )2 101101 13= . Thus in essence Observation 1

implies that at each slide of the memory-window the process’ current state i is subjected

to three operations: left-shift by one bit which yields 2i , followed by an addition of a

zero (2 0 2i i+ = ) or an addition of a one (2 1i + ) at the least significant bit (LSB)

position, followed by a modulus operation which ensures that if the current state of the

process is 12kn −Χ = then the next state wraps around to state 0 (for 1 2n i+Χ = ) or

state 1 (for 1 2 1n i+Χ = + ). For instance, in the preceding example with 4k = , if

( ) ( ) 12 101000 8 2kn −Χ = = = then the next state will be either

( ) ( )41 22 8 mod2 0000n+Χ = × = or ( ) ( )41 22 8 1 mod2 0001n+Χ = × + = . Since

each FSM state has two transition possibilities, each row of the FSM transition

50

probability matrix will have at most two non-zero entries, given by ( ), 2 mod2ki ip and

( ) ( ), 2 1 mod2 , 2 mod21k ki i i ip p+ = − .

Intuitively, one can claim that the number of error-free bits received over any

reasonable wireless channel should be much more than the number of corrupted bits. The

second observation stated below formulates this claim in terms of FSM chain parameters:

Observation 2. The steady-state probability of state 0 of an FSM chain for wireless

channels is much greater than the steady-state probabilities of all other states,

Sliding Window

0 1 1 0 0

Sliding Window

0 1 1 0 1

6nΧ =

1 2 6 12n +Χ = × =

1 2 6 1 13n +Χ = × + =

Figure 15. Transition possibilities for an FSM chain (memory-length, 4k = ).

Table 5. Empirical Evidence in Support of Observation 2

2 Mbps 5.5 Mbps 0π 0.997 0.974

51

2 10

1

k

jj

π π−

=∑? ,

(A.17)

where k represents the memory-length and iπ represents the steady-state probability of

being in state i of the FSM chain.

The above observation implies that the mean-time spent in state 0 of the FSM chain

(i.e., the state with no errors) is much greater than the mean-time spent in all other states.

It can be intuitively argued that this observation holds for real-life wireless channels. For

instance, Table 5 gives the steady-state probabilities of the 802.11b 2 Mbps bit-error

FSM chain of order 10 and the 5.5 Mbps bit-error FSM chain of order 9. Since the

steady-state probability of staying in the good (all-zero) FSM state is very close to one

for both the channels shown in Table 5, we can safely claim that Observation 2 holds for

the wireless channels currently under consideration.

A.5.5 Markov Chain Lumpability We first evaluate direct applicability of the well-known Markov chain lumpability

technique [59] to the wireless modeling problem under investigation. Chen et al. [24],

[25] showed that on some wireless channels lumpability might be a viable option for

reducing channel modeling complexity. We specialize the general definition of

lumpability to the binary FSM case using the observations made in the previous section.

A.5.5.1 Lumpability for Wireless Bit-Error Channels

Let the state space of an FSM with 2k states be given as 0,1, ,2 1kH = −… . Now

consider a new process with state space 0 1 1, , , NS S S S −= … , where 2kN ≤ . Let the

52

FSM states belonging to H be disjointly distributed among states of the new process. In

other words, each element of S is in turn a set containing one or more FSM states and is

henceforth referred to as an aggregate state. If we impose a condition that an FSM state

cannot exist in two different aggregate states simultaneously then the set S constitutes a

partition of the FSM state space.

Before proceeding further, we employ Observation 1 to prove a necessary condition

for defining partitions of the FSM state space. This condition is stated as a lemma.

LEMMA 1. The next state in an aggregate process can be accurately determined only if

the FSM states ( )2 mod2ki and ( )2 1 mod2ki + do not belong to the same aggregate

state,

( ) ( )2 mod2 2 1 mod2k kj ji S i S∈ ⇒ + ∉ , (A.18)

where k is the memory-length, i H∈ and jS S∈ .

Proof: Lemma 1 is easily proven by contradiction. In essence, this lemma implies that

both transition possibilities of an FSM state cannot be aggregated in a single state. As

mentioned in Observation 1, ( )2 mod2ki and ( )2 1 mod2ki + are the only possible

transitions for FSM state i . Let there exist an aggregate state jS that contains both FSM

states ( )2 mod2ki and ( )2 1 mod2ki + . Also, let qS represent an aggregate state that

contains FSM state i . Then ,q jS Sp does not give any information about whether a good-

or a bad-bit should be added to the memory-window. _

53

Let , ,jj

i S i kk S

p p∈

= ∑ , then , ji Sp represents the probability of moving from FSM

state i to aggregate state jS in one step of the FSM chain. Given Lemma 1 and using

Observation 1, , ji Sp can be written as

( ) ( )

( ) ( ), 2 mod2

, , 2 mod2

, 2 mod2

1 , 2 1 mod20 ,otherwise.

k

kj

k ji iki S ji i

p i S

p p i S

∈= − + ∈

An FSM chain is lumpable [59] with respect to a partition if for every choice of an

FSM chain starting vector the lumped process is a Markov chain and the transition

probabilities do not depend on the choice of the FSM starting vector. A process is said to

be weakly lumpable [59] with respect to a partition if at least one starting vector leads to a

Markov chain.

The strong lumpability theorem [59] is stated as:

THEOREM 1. A necessary and sufficient condition for an FSM to be lumpable with respect

to a partition 0 1 1, , , NS S S S −= … is that for each pair of aggregate sets iS and jS ,

, ji Sp has the same value for every FSM state in iS .

See [59] for proof of a general case of this theorem.

The strong lumpability condition asserts that all FSM states belonging to an aggregate

state should have the same probability of moving out of the aggregate state. We illustrate

it using an example. Figure 16 shows two aggregate states, ,iS n m= and

( ) ( ) 2 mod2 , 2 mod2k kjS n m= , where for ease of notation aggregate set

54

( ) ( ) 2 mod2 , 2 mod2k kn m is written as 2 ,2 n m . As outlined in Observation 1, FSM

state 2n represents one of the two possible transition possibilities of FSM state n . The

probability of this transition is denoted as ,2n np in Figure 16. Similarly, FSM state m

can move to FSM state 2m in one transition and this probability is denoted as ,2m mp .

The overall probability of moving from aggregate state iS to aggregate state jS is given

as ,i jS Sp . For this example, the lumpability condition requires that

, ,2 ,2i jS S n n m mp p p= = .

Since accurate wireless modeling necessitates the derivation of the Markov model

parameters from traces collected over an actual network, it is virtually impossible to

guarantee that the consequent FSM chain will have a transition probability matrix that

strongly or weakly satisfies the lumpability condition. (This assertion can be easily

verified by considering any of the real-life traces collected over actual wireless MAC

channels.) We hence deduce that lumpability in its precise form is not generically

applicable to wireless channel modeling.

The above discussion motivates a new question: Can we somehow relax the

lumpability conditions such that it is more readily applicable to the wireless modeling

problem under investigation? The following section tackles this question.

55

A.5.5.2 Folded Markov Chains

The lumpability condition is too stringent to be enforced on wireless models. In this

section, we modify an FSM chain’s state transition probabilities such that the modified

chain can be divided into two equal-sized partitions that satisfy the strong lumpability

condition. We show that this state aggregation procedure can be applied recursively to a

transition probability matrix to achieve a desired level of complexity. Then we use the

802.11b MAC layer bit-error channel for empirical performance evaluation of this new

class of models.

We first note that to reach FSM states 2i and 2 1i + for 10 2 1ki −≤ ≤ − in a single

transition, the current state of the FSM chain should be either state i or state ( )12k i− + .

In other words, the following pairs of FSM states have the same set of next possible

states:

iS

n

m

jS

2n

2m

,i jS Sp

,2n np

,2m mp

Figure 16. Aggregate states iS and jS containing FSM states , n m and 2 ,2n m ,

respectively.

56

( ) ( ) ( )1 1 1 1 10,2 , 1,1 2 , , 2 1,2 1 2 2 1k k k k k k− − − − −+ − − + = −… . (A.19)

For instance, a 4-state (i.e., memory-length of 2 bits) FSM transition probability matrix is

given by

0,0 0,11,2 1,3

2,0 2,13,2 3,3

0 00 0

0 00 0

p pp p

p pp p

.

We can see that states 0 and 2 have the same one-step transition possibilities since in

one transition both these states can either transit to state 0 or state 1 . Now if the

probability of transiting to state 0 is the same from both states 0 and 2 then these states

will satisfy the lumpability condition and hence can be aggregated together. Similarly,

states 1 and 3 have the same transition possibilities. If the probability of transiting to

state 2 is the same for both states 1 and 3 then they can be aggregated together. That is,

if the above conditions are satisfied then the FSM chain can be lumped with respect to

partitions 2 10 0,0 2 2S −= + = and 2 11 1,1 2 3S −= + = thereby giving the

following transition probability matrix:

0 0 0 11 0 1 1

, ,, ,

S S S SS S S S

p pp p

.

Based on the observation that the state pair ( )1,2ki i− + have the same one-step

transition possibilities, we propose to modify an FSM chain’s transition probabilities

matrix as follows:

57

( ) ( )( ) ( )1

1 , 2 mod2 2 , 2 mod2, 2 mod2 2 , 2 mod2 2

k k kk k k i i i i

i i i ip p

p p−

− ++

+= =$ $

and ( ) ( ) ( )1, 2 1 mod2 2 , 2 1 mod2 , 2 mod21k k k ki i i i i ip p p−+ + += = −$ $ $

(A.20)

For 0,1, ,2 1ki = −… , where ,i jp and ,i jp$ represent the transition probabilities of the

original and modified FSM chains. After this transformation, state pairs ( )1,2ki i− + in

the modified transition probability matrix clearly satisfy the lumpability constraint and

can be aggregated together.

Using the above strategy, any 2 2k k× FSM transition probability matrix can be

modified and folded about 12k− to give a Markov chain with exactly half the number of

states. Since the basic transition probability structure is retained after the folding

operation, this state reduction procedure can in fact be applied recursively to a 2k state

FSM chain to give a 2m state folded Markov chain, where m is an integer such that

1 m k≤ < . We henceforth refer to these models as folded Markov chains (FMCs).

A folded process is a coarse approximation of an FSM chain because folding simply

ensures a non-zero transition probability between two aggregate states. However, the

FSM transition probabilities for the FSM states that are aggregated/grouped together may

be very different. Thus a folded process represents an on-average behavior of the FSM.

This fact will become clear in the performance evaluation section.

At this point, the following question may be raised: How is a 2m state FMC different

from a 2m state FSM? A 2m state FSM represents a process with a memory-length of

m , whereas a 2m state FMC might be a folded version of an FSM with a memory-length

greater than m . For instance, in the following section we evaluate performance of a 64-

state FSM with a 64-state FMC. While the number of states is the same in both the

58

models, the FSM has a memory-length of 6 whereas the FMC is formed by performing

three folding operations on an FSM with a memory-length of 9.

A.5.5.3 Evaluation of Folded Markov Chains

We fold the order-9 FSM chain for the 5.5 Mbps process to FMCs having 256, 128,

64, 32, 16, 8, 4 and 2 states. The performance comparison of these FMCs with FSMs of

memory-lengths 2, 3, 4, 5, 6, 7 and 8 is provided in Figure 17. It can be clearly seen that

the FMC performance for any number of states is similar to or worse than the FSM. Thus

the FMCs do not provide any improvement in performance over varying order FSMs.

FMC performance was similar for a 2 Mbps bit-error channel. It can be deduced that only

the on-average FSM behavior captured by the FMCs is not sufficient and more statistical

characteristics of FSM chains should be incorporated in an effective model.

2 4 8 16 32 64 128 256 5120.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

number of states

EN

K: g

ood-

burs

ts

FSMFMC

2 4 8 16 32 64 128 256 512

0.002

0.004

0.006

0.008

0.01

0.012

0.014

number of states

EN

K: b

ad-b

urst

s

FSMFMC

(a) good-bursts (b) bad-bursts Figure 17. Performance of FMCs formed by folding a 512 state FSM to 256, 128, 64, 32,

16, 8, 4 and 2 states; the FSM process is trained using a 5.5 Mbps trace.

59

A.5.6 Complexity Reduction by Approximating an FSM Chain’s Good- and Bad-Burst Behavior

Now that we have established that lumpability and relaxed versions of it cannot

capture the complex wireless bit-error behavior, we focus on analyzing how an FSM

chain captures a channel’s good- and bad-bursts. To that end, in this section we derive

generalized probability distributions of good- and bad-bursts for an FSM chain of

arbitrary order. The probability distributions are derived in terms of FSM chain transition

and steady-state probabilities. These distributions render useful insights into important

FSM characteristics, which are used to develop guidelines for defining FSM state space

partitions. (Recall that the objective of the present analysis is to ascertain partitions of

FSM state space. FSM states in a particular partition are then grouped together to form an

aggregate state in the low-complexity approximating process.) We want to define the

FSM state space partitions such that the resulting aggregate process, while being less

complex, closely matches the FSM chain characteristics.

Let H and S denote the state spaces of an FSM chain and an aggregate

(approximate) process, respectively. Let i H∈ and iS S∈ denote two arbitrary states of

the FSM and the approximate process, respectively. From Lemma 1 we have a necessary

condition that should be imposed on the aggregate states iS . To simplify notation, from

this point forward we drop the mod2k operation (where k is the memory-length) on

each FSM chain state. Thus, an FSM state ( )mod2ki is simply written as statei . As in

previous sections, let I and B denote the good- and bad-bursts random variables,

respectively. We want to derive closed-form expressions of I and B in terms of FSM

chain parameters. We expect the expressions for good- and bad-burst random variables to

60

render insights into how an FSM chain captures these random variables. The following

theorem states the FSM probability distribution of good-bursts:

THEOREM 2. The probability distribution of a good-burst of length exactly l ,

Pr I l= , for an FSM chain of memory-length k is

( ) ( )

( ) ( ) ( ) ( )

( )

11

1 1

1

min 2, 22 12 1 2 1 2 , 2 1 2

00

2 1 2 , 2 1 2 2 1 2 , 2 1 2 1

0,0 0,12 ,0

Pr ,

,, 0,

, .

kj j

l l l l

k

k li i i i

ji

i i i ii l k

I l p

p p l kk l where

p p p l k

π µ

µ

−+

− +

−

− −−+ + +

==

+ + + + +−

= = × ×

× <∀ > = ≥

∑ ∏

(A.21)

Proof: Before proceeding with the proof, we recall that the subscripts of all transition

and steady-state probabilities are modulo 2k . Let us focus on the proof of the l k≥ case

( )1 2 1 0 2

2 1, , , , 1

n

k k

ix x x x− −

Χ = += =…

( ) 22 2 1 2 mod 2kn i+Χ = +

1 (2 1)2mod2kn i+Χ = +

Initial State

( ) 111

2 1 2 mod 22

k kn kk

i −+ −−

Χ = +=

0n k+Χ = 0n l+Χ = 1 1n l+ +Χ =

Figure 18. State transitions of an FSM with memory-length k and a good-burst of length l k≥ .

61

since the proof of the other case is much simpler and follows a similar procedure. Given

any current state, a good-burst (i.e., burst of 0’s) will start if the current state has a 1 in

the LSB position of the memory-window, i.e., the current state represents an odd-

numbered FSM state 12 1,0 2 1kn i i −Χ = + ≤ ≤ − .

Without loss of generality, consider the state path given in Figure 18. For a good-

burst of length l starting in the current odd-numbered FSM state, the next 1k −

transitions will be ( ) ( ) ( ) ( )2 12 1 , 2 1 2, 2 1 2 , , 2 1 2ki i i i −+ + + +… . Note that

( ) 1 12 1 mod2 2k ki − −+ = and based on the discussion in Observation 1, the process

wraps around to FSM state 0 at this point, i.e., at point 11 2kn k −+ −Χ = , the good-burst

continues and the process wraps around, 0n k+Χ = . This transition sequence is

followed by l k− zero bits, i.e., the next l k− transitions are from state 0 to state 0

giving 1 2 0n k n k n k n l+ + + + + +Χ = Χ = Χ = = Χ =… . The good-burst ends when a

one bit is encountered at the ( )st1l + transition, and the FSM process moves to

( ) ( )1 2 1000 01 1n l+ +Χ = =… . This state-transition path when expressed in the form of

probabilities will have to be summed over all possible odd-valued FSM states,

62

( ) ( ) ( ) ( ) ( )

( ) ( )( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )( )

2 2 1

1

2 2 1

1

2 2 1

1

1, 1 2 1 2, 1 2 1 2 , 1 21

0,0 0,11 2 ,03, 3 2 3 2, 3 2 3 2 , 3 2

30,0 0,13 2 ,0

2 1, 2 1 2 2 1 2, 2 1 2 2 1 2 , 2 1 2

2 12 1 2 ,

Prk k

k

k k

k

k k k k k k k k

kk k

l k

l k

p p pI l

p p pp p p

p p p

p p p

p

π

π

π

− −

−

− −

−

− −

−

−

−

− − − − − −

−−

× × × = = × × × × × × + × × ×

+× × ×

+×

…

…

M…

( )0,0 0,10.l kp p−

× ×

Taking out common terms yields

( ) ( ) ( )

111

2 1 20,0 0,1 2 1 2 1 2 , 2 1 22 ,0

00Pr

kj jk

kl ki i i

jiI l p p p pπ

−+−

− −−+ + +

==

= = × × × ∑ ∏ ,

which is the same as the expression in Theorem 2 for all k l≥ . _

Some explanation of the good-bursts probability distribution given above is as

follows: Let n denote the discrete time index at which a good-burst started. The last bit

received before the good-burst must be a corrupted bit, i.e., ( )1 1x n − = . Thus, at time

instance 1n − the FSM chain’s memory-window had a “1” at the LSB position. In other

words, the FSM chain was in an odd state, i.e., 1 2 1n i−Χ = + , where 11 2 1ki −≤ ≤ − .

For the good-bursts probability distribution, we have to account for (or sum over) all the

odd states of the FSM chain. This fact explains the 2 1iπ + ’s in the additive expression of

(A.21). For a good-burst of l bits, the l bits following ( )1 1x n − = must be error-free,

i.e., ( ) ( ) ( )0, 1 0, , 1 0x n x n x n l= + = + − =… . This results in the multiplicative

expression following each 2 1iπ + . Thus the multiplicative expression characterizes the

63

state transition path for l good bits starting in FSM state 2 1i + . Since the good-burst

ends after l bits, the n l+ -th bit must be corrupted, i.e., ( )1 1x n l+ + = . The iµ

expression characterizes the transition on the n l+ -th step depending on whether the

total burst-length is smaller or longer than the memory-window.

Similar to Theorem 2, the probability distribution of a bad-burst of length l is given

in the following theorem:

THEOREM 3. The probability distribution of a bad-burst of length exactly l ,

Pr B l= , for an FSM chain of memory-length k is

( ) ( )

( ) ( ) ( ) ( )

( )

11

1 1

1

min 2, 22 12 2 1 2 1, 2 1 2 1

00

2 1 2 1, 2 1 2 1 2 1 2 1, 2 1 2 2

2 1,2 1 2 1,2 1 2 1,2 2

Pr ,

,, 0,

, .

kj j

l l l l

k k k k k k

k li i i i

ji

i i i il ki

B l p

p p l kk l where

p p p l k

π µ

µ

−+

− +

−

− −−+ − + −==

+ − + − + − + −−

− − − − − −

= = × ×

× <∀ > = × × ≥

∑ ∏

(A.22)

Proof of this theorem is skipped because it is very similar to the proof of Theorem 2.

The expression for good- and bad-burst probability distributions given in (A.21) and

(A.22) are rather convoluted. Hence in their present forms, these expressions neither offer

any obvious insight into the FSM chain behavior nor are they amenable to further

analysis. In the following section, we employ Observation 2 to simplify the probability

distribution expressions of (A.21) and (A.22). The simplification in turn leads us to the

design guidelines that should be followed by a low complexity model.

64

A.5.6.1 Simplification of Good-bursts Distribution

We know from Observation 2 that the steady-state probability of FSM state 0 is very

high. Consequently, the steady-state probabilities of odd FSM states in the good-bursts

expression of (A.21) are negligible. The terms involving a transition to or from state 0 of

the FSM will hence dominate the good-burst probability distribution of (A.21).

Moreover, since the channel usually stays in the good state for practical wireless

networks, the good-burst length should in general be significantly greater than the

memory-length. Hence, an effective good-bursts probability distribution Pr I l=

should accurately capture the l k≥ behavior. An approximation of the good-bursts

probability distribution of (A.21) for l k≥ can be rewritten as:

( )1 0,0 0,12 ,0Pr , 0kl kI l p p p l k−−= ≈ ∀ ≥ > .

(A.23)

Although the above expression is an approximation of the FSM chain’s good-bursts

probability distribution, it is clearly more insightful. For instance, note that the parameter

characterizing this approximate probability distribution is the probability of a good bit

transmission followed by another good bit transmission ( 0,0p ) since this is the only

parameter in (A.23) that involves the good-burst-length,l . Hence, one important

consideration while grouping FSM states should be that the all-zero (i.e., no-error) FSM

state is not grouped with a large number of other states. This is a natural consequence of

Observation 2 which implies that the mean time spent in the all-zero (i.e., no-error) FSM

state is significantly higher than all other FSM states.

65

Similarly, in addition to the FSM state 0, two other important FSM states are state

12k− and state 1 since 12 ,0kp − and 0,1p are the only parameters, other than 0,0p , that

appear in the approximate probability distribution given in (A.23). Hence, due to their

relative importance in describing real-life wireless channels, a good model, in addition to

FSM state 0, should not group FSM states 1 and 12k− with too many other states. This

guideline will be employed to define the constant-complexity model.

A.5.6.2 Simplification of Bad-bursts Distribution

For the bad-bursts probability distribution of (A.22), we again invoke Observation 2

and neglect the terms in (A.22) that are not multiplied with 0π . Using this

approximation, the bad-bursts distribution (A.22) can be written as:

( )

1

1 1

1

min 2, 20 0 2 1,2 1

0

2 1,2 1 2 1,2 20

2 1,2 1 2 1,2 1 2 1,2 2

Pr ,

,where

, .

j j

l l l l

k k k k k k

k l

j

l k

B l p

p p l k

p p p l k

π µ

µ

+

− +

−

− −− −

=

− − − −−

− − − − − −

= =

<= ≥

∏

(A.24)

The only terms appearing in (A.24) after the approximation involve FSM states 0,

2 2k − , and 2 1j − , for any 1 j k≤ ≤ . From Observation 1 and the good-bursts

approximation, we have already established that FSM state 0 should not be aggregated

with many other states. This deduction is reasserted here. Moreover, it is preferable not to

aggregate FSM state 2 2k − with many other states. Also, if possible, all FSM states

2 1j − , where 1 j k≤ ≤ , should not be grouped with too many other states.

66

A.5.6.3 Guidelines for Approximating an FSM chain

Based on the analyses of previous sections, we now define guidelines that should be

followed to develop partitions on the FSM state space. FSM states in each partition are

then aggregated to give a low-complexity aggregate model. The FSM state aggregation

procedure is based on the underlying assumption that there is a given complexity budget.

That is, the required number of states in the aggregate model is specified beforehand.

FSM state aggregation should result in a model which has the required number of states.

Given a complexity budget in the form of the total number of states and based on

preceding discussions, we define the following guidelines that should be followed to

develop an aggregate model with total number of states satisfying the complexity budget:

Guideline 1. Any FSM chain state aggregation should satisfy the condition given in

Lemma 1.

Guideline 2: FSM state 0 should not be aggregated with other states.

Guideline 3: FSM states 12k− and 1 should be aggregated with a minimal number of

other states.

Guideline 4: FSM states 2 2k − and 2 1j − , for all 1 j k≤ ≤ , should be aggregated

with a minimal number of other states.

Note that Guideline 1 and Guideline 2 are more assertive than Guideline 3 and

Guideline 4. This is due to the analysis provided in the previous section, which outlined

that: (i) Guideline 1 is necessary for an accurate model, and (ii) Guideline 2, which is a

consequence of Observation 2, is asserted by the approximate distributions of both good-

and bad-bursts. Also note that Guideline 1, Guideline 2, and Guideline 3 can be easily

satisfied in a low-complexity model. However, Guideline 4 is somewhat problematic

67

because putting each 2 1j − FSM state, for all 1 j k≤ ≤ , in a separate partition (i.e.,

separate aggregate state) makes the total number of states of the approximate model an

increasing function of the memory-length k . Thus, satisfying Guideline 3 implies that the

resultant complexity (i.e., number of states) of the aggregate model will at least be a

linear function of the memory-length. We, on the other hand, want to keep the number of

states in the model independent of the underlying process’ memory-length. In the

following section, we develop a constant-complexity model which adheres to the first

three guidelines. Performance evaluation of the model for 802.11b channels demonstrates

that although the proposed model ignores Guideline 4, it approximates an FSM chain’s

behavior with outstanding accuracy.

A.5.7 Constant-Complexity Model In this section, we propose a constant-complexity model (CCM) which adheres to

Guideline 1, Guideline 2, and Guideline 3. Here, it should be emphasized that the FSM

state space partitioning presented in this section is only one of the many possible state

assignments. Future low-complexity channel models can define other state partitions

which should perform adequately as long as the above guidelines are followed.

The CCM keeps FSM states 0, 1 and 12k− each in a separate partition, while

grouping all the remaining FSM states into two partitions. The resulting model always

has 5 states irrespective of the memory-length. The structure and transition possibilities

of the CCM are illustrated in Figure 19. It is clearly outlined by Figure 19 that the CCM

assigns separate states to FSM states 0, 1 and 12k− , thereby adhering to Guideline 2 and

Guideline 3. All remaining even FSM states are grouped in a single aggregate CCM state,

68

while all remaining odd FSM states are grouped in another aggregate state. Note that

none of the CCM states contains both an odd and an even FSM state, i.e., an aggregate

state either contains even FSM states or odd FSM states. Thus Guideline 1, which states

that FSM states 2i and 2 1i + should not be aggregated together, is also satisfied by the

CCM. Based on our analysis, this 5-state CCM should follow the behaviour of the

underlying 2k state FSM quite closely. This CCM efficacy will be adequately

highlighted in the next section where we compare its performance with FSM and linear-

complexity models.

A.5.7.1 Performance of the CCM at 2 Mbps

We provide ENK based performance comparison between the 548-state FSM and the

5-state CCM for memory-lengths ranging from 1 up to 10 in Figure 20 and Figure 21. We

also compare performance with previously proposed short-term energy model (SEM) and

zero-crossing model (ZCM) [29]. These two models constrain the complexity to increase

linearly with the memory-length. Performance of the 548-state FSM model formulates a

criterion for performance evaluation of the CCM, SEM and ZCM. The longest memory-

length of 10 yields a 548-state FSM, an 11-state SEM, a 10-state ZCM and a 5-state

0 1

12 k − 1 12, 4, 6, , 2 2, 2 2, , 2 2k k k− −− + −… …

3, 5, 7, , 2 1k −…

Figure 19. State aggregation and transitions for the CCM. Each box represents an

aggregate CCM state. The number(s) inside a CCM state are the aggregated FSM states.

69

CCM. Let us first focus on Figure 20 which plots performance versus complexity for

FSM chains, CCM, SEM and ZCM. Although all memory-lengths from 1 up to 10 were

evaluated, to show the results clearly this figure only plots the ENK values for a certain

number of states.

Due to the fixed CCM complexity, only the ENK performance of one CCM

(corresponding to a memory-length of 8) is shown in Figure 20. This particular CCM was

chosen since it rendered the best overall performance. The performance of CCM models

2 4 8 16 32 64 128 256 512 10240

0.5

1

1.5


EN

K: g

ood-

burs

ts

FSMCCM (memory-length=8)SEMZCM

2 4 8 16 32 64 128 256 512 10240

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


EN

K: b

ad-b

urst

s


(a) good-bursts (b) bad-bursts Figure 20. ENK based modeling performance versus complexity for the 2 Mbps bit-error

process.

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

memory-length

EN

K: g

ood-

burs

ts

FSMCCMSEMZCM

2 4 6 8 10 12 140.001

0.01

0.1

1

10

0.001

0.01

0.1

1

10

0.001

memory-length

EN

K: b

ad-b

urst

sFSMCCMSEMZCM

(a) good-bursts (b) bad-bursts Figure 21. ENK based modeling performance versus memory-length for the 2 Mbps bit-

error process.

70

for the remaining memory-lengths will be discussed shortly. It is clear from Figure 20

that for the good-bursts random variable the CCM performs as well as the 548-state FSM.

For the same complexity as the CCM (i.e., 5-states), the linear-complexity models have

higher ENK overhead. However, the performance of higher order linear-complexity

(SEM and ZCM) models is reasonable. Hence, it can be deduced that the CCM captures

the good-bursts behavior of the 2 Mbps wireless MAC layer channel accurately and with

lesser number of states than any other model under consideration. Similarly, Figure 20

shows that the CCM ENK overhead for the bad-bursts random variable is also very small

and is quite comparable to the corresponding FSM, SEM and ZCM. Specifically, the

CCM incurs an ENK overhead of 0.053 as opposed to 0.018 for the 4-state FSM, 0.039

for the 5-state SEM and 0.0386 for the 5-state ZCM.

Figure 21 provides further insight into the performance of CCM for different

memory-lengths. From Figure 21 it can be observed that the CCM performance for all

orders is better than the FSM model, SEM and ZCM for the good-bursts random variable.

In case of the bad-bursts random variable, the performance of all models with memory-

lengths greater than 3 is comparable. The CCM performance for small orders is better

than the linear-complexity models. For high orders, while both linear- and constant-

complexity models have slightly greater overhead than the FSM model, the CCM

performance is comparable to its linear-complexity counterparts.

The ENK divergence highlights that the CCM provides an accurate and low-

complexity bit-error model for 802.11b LANs operating at 2 Mbps. This performance

substantiates our initial analysis which outlined that a 5-state CCM can render a

performance that is comparable to the respective 2k state FSM chain. As shown in [29],

71

the linear-complexity (SEM and ZCM) models also yield very good ENK based

performances.

A.5.7.2 Performance of the CCM at 5.5 Mbps

ENK based performances of the FSM chains, CCM, SEM and ZCM at 5.5 Mbps are

outlined in Figure 22 and Figure 23. Only a CCM with memory-length of 6 is shown in

Figure 22 since it renders the best overall (good- and bad-bursts) performance. It is clear

from Figure 22 that the CCM performance for the good-bursts random variable is

2 4 8 16 32 64 128 256 5110

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


EN

K: g

ood-

burs

ts


2 4 8 16 32 64 128 256 511

0.01

0.1

1

0.01

0.1

1

0.01


EN

K: b

ad-b

urst

s


(a) good-bursts (b) bad-bursts Figure 22. ENK based modeling performance versus complexity for the 5.5 Mbps bit-

error process.

0 2 4 6 8 10 12 140

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

memory-length

EN

K: g

ood-

burs

ts

FSMCCMSEMZCM

0 2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

memory-length

EN

K: b

ad-b

urst

s

FSMCCMSEMZCM

(a) good-bursts (b) bad-bursts Figure 23. ENK based modeling performance versus memory-length for the 5.5 Mbps

bit-error process.

72

comparable to or better than all other modeling techniques. Note however that the ZCM

performs slightly better than the CCM. Thus the CCM and ZCM, even at low orders,

capture the good-bursts behavior of the 5.5 Mbps channel very accurately. Similarly,

Figure 22 shows that the CCM ENK overhead for the bad-bursts random variable is also

very small. Figure 23 outlines the performance rendered by CCMs corresponding to

different memory-lengths. From Figure 23 it can be observed that the CCM performance

for all orders is better than or comparable to the FSM, SEM and the ZCM for the good-

bursts random variable. In case of the bad-bursts random variable, the performances of all

the models except the SEM are similar.

Thus, while keeping both complexity and modeling performance under consideration,

the ENK divergence asserts that the CCM outperforms its linear-complexity counterparts

in modeling of the 802.11b bit-errors at 5.5 Mbps.

A.5.8 Discussion At this point, we have developed accurate and low-complexity models for the

wireless bit-error channels under consideration. In the following chapters, we explore the

application and usefulness of these models. Specifically, the next chapter uses these

models in a novel wireless multimedia framework. The last contribution chapter of this

part quantifies the inaccuracies that are incurred if channel memory is ignored and a low-

order FSM model is used to simulate and analyze wireless systems.

73

CHAPTER A. 6 CHANNEL MODEL BASED HEADER ESTIMATION FOR WIRELESS

MULTIMEDIA

Wireless channels incur unpredictable and time-varying packet losses due to channel

interference and node mobility. This data loss is particularly detrimental for real-time

communications since their delay constraints generally do not allow retransmission-based

recovery of lost packets. Consequently, recent multimedia standards have introduced

enhanced error resilience and concealment features (e.g., slices in JVT/H.264 [83] and

reversible VLC in MPEG-4 [84]) to cater for bandwidth-constrained and error-prone

wireless channels. Distortion in multimedia quality at a wireless receiver can be

substantially decreased if corrupted packets, instead of being dropped, are relayed to the

multimedia application. The application can then decide to retain, drop or recover the

corrupted packets.

To improve packet throughput at a wireless receiver, enhanced robustness is provided

at the physical layer of emerging wireless protocol stacks. Nevertheless, residual/MAC-

to-MAC errors not corrected by the physical layer cause checksum failures at higher

(MAC and transport) layers, leading to a significant number of packet drops. The UDP-

Lite protocol was proposed to address this problem [41]- [44]. As explained in Section

A.2.2, the proposed UDP-Lite based transport schemes ignore errors in the application

layer payload, but drop all packets that have one or more bit-errors in the IP, the UDP, or

the application layer headers.

74

It has been shown that UDP-Lite based partial protection with application layer

forward error correction (FEC) improves wireless bandwidth utilization [41]- [51].

Support of partial protection necessitates changes to the standard protocols at the

multimedia transmitter and/or intermediate network nodes. In many realistic scenarios,

modifications to multimedia servers and/or intermediate nodes cannot be dictated by the

end-receivers. We argue that the requirement of transmitter modifications in UDP-Lite

has hampered its wide-spread deployment. Furthermore, frequent header errors result in

significant packet drops for UDP-Lite, especially at high data rates2.

UDP-Lite’s shortcomings can be addressed by a receiver-based scheme that, in

addition to ignoring payload errors, can estimate corrupted header fields. For such a

header estimation scheme to be practical, modifications below the application layer

should only be made to the wireless receiver. Thus no additional information (such as

FEC redundancy) is available for header estimation at the receiver. However, the

corrupted payloads relayed to the receiver’s application layer by a header estimation

scheme can and should be corrected using application layer FEC decoding. In this

chapter, we propose a cross-layer header estimation methodology that employs the MAC

layer bit-error channel models employed in the previous chapters to estimate the

corrupted headers of a packet.

Before outlining the actual header estimation methodology, we derive and present

sound analytical conditions for the region-of-operation under which header estimation

performs better/worse than UDP and UDP-Lite. We clearly show that for any realistic

wireless system, the FEC redundancy required by header estimation is always lower than

2 In [18] the authors showed that under realistic settings of an 802.11b network, packets dropped by a UDP-Lite based protocol stack are 5.87% and 36.7% at 5.5 and 11 Mbps, respectively.

75

UDP and UDP-Lite protocols. Since FEC is generally performed on a byte-level, analysis

is provided for an arbitrary symbol size with the implicit assumption that the symbol size

is greater than one bit. We demonstrate the efficacy of header estimation for two

important classes of symbol-level wireless channels: symmetric/memory-less channels

and Gilbert channels. We show that an ideal header estimation scheme can provide

redundancy reduction (or goodput improvement) of up to 75% over UDP and UDP-Lite.

The analysis in the first part of this chapter serves as a motivation to develop a

practical, effective and accurate header estimation framework to improve wireless

multimedia quality. We propose a header estimator that can use the accurate MAC layer

bit-error channel models developed in the preceding chapter to estimate the corrupted

critical header fields (CHF) of a packet, while non-critical header fields are simply

ignored. At a header estimation-based UDP multimedia receiver, the most likely

transmitted CHF is estimated through channel parameters. The proposed scheme requires

no modifications to the standard protocols at senders and/or intermediate nodes. Only

minor protocol stack modifications are needed at the receiver. We map header estimation

to a problem of maximum-likelihood (ML) estimation of known parameters in noise [79].

We derive likelihood functions for an arbitrary-order full-state Markov chain model and a

multifractal wavelet model [61]- [63]. The FSM likelihood function is extended to the

provide likelihood using the constant-complexity model. Trace-driven video simulations

at varying data rates of an 802.11b LAN show that the proposed scheme provides

significantly better throughput and multimedia quality than normal UDP and UDP-Lite.

76

A.6.1 FEC Redundancy Lower Bounds for UDP, UDP-Lite and Header Estimation

In this section, we derive theoretical bounds on the improvements provided by an

ideal header estimation scheme with application layer FEC operating on an q -ary

symmetric channel (SC) and a Gilbert channel (GC). Throughout this section, we

consider a MAC layer channel which sends and receives symbols of size m bits. We

assume that this symbol size is equal to the FEC symbol size. Since FEC is generally not

performed on the bit-level, for the following theoretical analysis we assume that 1m > .

The term “ideal header estimation” implies that all corrupted packets intended for a

receiver are passed to its application layer. We derive lower bounds on the expected

amount of FEC redundancy required to successfully decode one FEC block. Naturally,

we want the amount of redundancy to be as low as possible for efficient utilization of

scarce wireless bandwidth. The bounds derived in this section answer the following

question: Under what conditions does header estimation require lesser FEC redundancy

for payload correction than UDP and UDP-Lite?

As mentioned before, we assume that the transmitter packetizes and transmits

symbols of arbitrary size m , where m is also the FEC symbol size. A block-based

maximum distance separable (MDS) FEC scheme capable of simultaneously correcting

errors and erasures operates at the transmitter and receiver application layers. The

transmitter packetizes each FEC block into l packets, with each packet having a data

payload of Dn symbols. Before transmitting each packet, a header of size Hn is

appended to the packet. Thus each packet has a fixed length of H Dn n+ symbols. The

FEC algorithm only protects the data symbols, and hence the FEC block-length is Dn l

77

symbols. A total of r out of the Dn l symbols are redundant. A packet dropped by a

protocol below the application layer is treated as a packet erasure by the FEC decoder.

Since the FEC decoder is operating at symbol level, each packet erasure will result in Dn

symbol erasures. We are assuming that before decoding, the FEC decoder can identify

missing packets or packet erasures in an FEC block. This can, for example, be achieved

by transmitting an FEC-protected sequence number in each packet.

Let X and Y be two random variables which respectively characterize the number

of errors and erasures observed at the wireless receiver before FEC decoding. An MDS

code can recover all errors X and erasures Y if 2X Y r+ ≤ [80].

A.6.1.1 Redundancy Bounds on the q-ary Symmetric Channel

The inputs and outputs of an q -ary symmetric channel (SC) are derived from an

alphabet of 2mq = symbols. An SC is characterized by a single parameter p , the

probability that a transmitted symbol jx is received as i jx x≠ :

Pr is received is transmitted fori jp x x i j= , ≠ .

The overall probability of a symbol jx being corrupted over a SC is:

( )Pr symbol error 2 1mSCp p= = − . (A.25)

We now derive FEC redundancy lower bounds on UDP, UDP-Lite and header estimation

based protocol stacks operating on an SC channel.

78

A.6.1.1.1 FEC Redundancy Bound on a UDP based Protocol Stack Traditional wireless protocol stacks perform a checksum on the entire packet and

drop all packets that fail the checksum. While the checksum is generally performed at

both UDP and MAC layers, for simplicity and brevity, we refer to a protocol stack that

drops all corrupted packets as a UDP protocol stack. Throughout this chapter, dropped

packets are treated as erasures by the wireless receiver’s application layer FEC decoder;

each dropped packet results in Dn erased symbols. Since UDP drops all corrupted

packets, the number of errors in the received data are always equal to zero,

Pr 0 UDP 1X = = . In this section, we derive an expression for the expected value of

the number of erasures, Y , observed with the UDP protocol.

For UDP, an Dn -symbol erasure will occur whenever a received packet has one or

more symbol-errors. Let udp SCε , denote the probability of observing a UDP packet

erasure over an SC:

( )1 1 H Dn nudp SC SCpε +, = − − ,

where SCp is the probability of symbol error given in (A.25). The probability of having

k packet erasures over UDP is:

( ) ( )Pr pkt eras UDP 1k l kudp SC udp SC

lk k ε ε −

, , = −

,

where l is the total number of packets containing one FEC block. Then the expected

value of packet erasures is

79

E # of pkt eras UDP

E # of symbol eras UDP E UDPudp SC

D udp SC

l

Y n l

ε

ε,

,

=

⇒ = = .

Since Pr 0 UDP 1X = = , E UDP 0X = . Thus the average amount of

redundancy required by a FEC decoder operating on a UDP protocol stack is

( )1 1 H Dudp SC D udp SC

n nD SC

r n l

n l p

ε, ,+

≥ ≥ − − .

(A.26)

A.6.1.1.2 FEC Redundancy Bound on a UDP-Lite based Protocol Stack

Since a UDP-Lite protocol stack drops all packets that have header errors, the

probability of UDP-Lite packet erasures over an SC is

( )Pr corrupt hdr 1 1 Hnudplite SC SCpε , = = − − . Consequently, the expected

number of UDP-Lite erasures is

( )

E pkt eras UDPLite

E symbol eras UDPLite E UDPLite 1 1 H

udplite SCn

D udplite SC D SC

l

Y n l n l p

ε

ε,

,

=

⇒ = = = − .−

In addition to erasures, a UDP-Lite protocol stack will also have errors in the application

layer payload. The probability of having k symbol errors in the total

E UDPLiteDh n l Y= − symbols received at the FEC decoder is

( ) ( )Pr symbol errs UDPLite Pr UDPLite 1k h kSC SC

hk X k p pk

− = = = − .

The expected number of symbol errors is

80

( ) ( )

E symbol errs UDPLite E UDPLite

E UDPLite 1 HSC

nD SC D SC SC

X hp

n l Y p n l p p

= == − = − .

Thus the total expected redundancy required to recover the errors and losses in an FEC

block over a UDP-Lite protocol stack is

( ) ( )

2E UDPLite1 21 1H H

udplite SC D udplite SCn n

D SCSC SC

r n l Xn l pp p

ε, ,

≥ +≥ − +− −

( ) ( )1 1 21 Hnudplite SC D SCSCr n l pp , ≥ − −− . (A.27)

A.6.1.1.3 FEC Redundancy Bound on a Header Estimation based Protocol Stack

Under an ideal header estimation protocol stack, there are no erasures since all

packets are passed to the FEC decoder regardless of whether there are errors in the

headers or payload. That is, Pr 0 HdrEst 1 E HdrEst 0Y Y= = ⇒ = . Based on

previous derivations, the expected number of symbol errors is

E HdrEst D SCX n lp= . Thus the total expected amount of redundancy required by an

ideal header estimation scheme that passes all packets to the FEC decoder is

2hdrest SC D SCr n lp, ≥ . (A.28)

A.6.1.1.4 Comparison of the FEC Redundancy Bounds We now compare the minimum expected FEC redundancy of UDP and UDP-Lite

with header estimation. Let us first compare the minimum redundancies of UDP-Lite and

header estimation:

81

( ) ( ) ( ) ( )

( )( )( )min min 1 1 1 2 2

1 1 1 2

H

H

nudplite SC hdrest SC D SC SC SC

nD SC SC

r r n l p p p

n l p p, , − = − − − −

= − − − .

Clearly, ( ) ( )min min 0udplite SC hdrest SCr r, ,− > when 0 5SCp < . . Thus

( ) ( )min min when 0 5udplite SC hdrest SC SCr r p, ,> < . . (A.29)

The condition 0 5SCp < . is true for any realistic wireless channel, and therefore in all

practical wireless environments header estimation should always require lesser FEC

redundancy than UDP-Lite. In fact, on most wireless channels, 0 5SCp .= .

Now let us compare the minimum redundancy of header estimation with UDP:

( ) ( ) ( )( ) ( )( )

min min 1 1 2

2 1 1 1 .

H D

H D

n nudp SC hdrest SC D SC SC

n nD SC SC

r r n l p p

n l p p

+, ,+

− = − − − = − − + −

It can be easily shown that:

( ) ( ) ( )min min when 0 49 and 6udp SC hdrest SC SC H Dr r p n n, ,> < . + ≥ . (A.30)

In accordance with prior discussions, we know that the 0 49SCp < . condition is true for

any realistic wireless channel. Also, the size of a wireless packet (headers included),

H Dn n+ , is always greater than6 . For instance, in 802.11b networks, even without any

payload data, the total size of MAC, IP and UDP headers is 60 bytes.

82

Figure 24 plots the minimum expected FEC redundancies required by UDP, UDP-

Lite and header estimation for symbol error probabilities ranging between 0 and 0 1. . It

can be clearly seen that header estimation requires significantly lower redundancy than

both UDP and UDP-Lite. Note that the difference in redundancy increases with an

increase in the probability of symbol error. For 0 0013SCp = . , the percentage of

bandwidth used for redundancy is approximately0 25%. , 7 6%. and 47 9%. for header

estimation, UDP-Lite and UDP, respectively. The FEC redundancy difference becomes

much wider for 0 01SCp = . , where header estimation, UDP-Lite, and UDP respectively

use 2.02% , 47 04%. and 99 48%. of bandwidth in redundant symbol transmission. For

0 06SCp = . and higher, the gap between UDP-Lite and UDP narrows with each using

78.15% and 100% of bandwidth for redundancy, while header estimation requires

4 84%. redundancy - a remarkable goodput improvement of approximately 73% over

UDP-Lite and of approximately 95% over UDP.

0 0.02 0.04 0.06 0.08 0.10

10

20

30

40

50

60

70

80

90

100

probability of symbol error, pSC

redu

ndan

t FE

C s

ymbo

ls %

UDPUDP-LiteHeader Estimation

Figure 24. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header Estimation over an q -ary symmetric channel; 8m = , 256q = , 30L = ,

60Hn = , 452Dn = .

83

Thus, while header estimation always requires lesser FEC redundancy, the advantages

are dramatic for somewhat high error-rate channels, e.g., the 5.5 and 11 Mbps 802.11b

channels. Later in this chapter, we assert these theoretical findings using a practical

header estimator that is tested using actual wireless error traces. In the next section, we

derive similar bounds for the Gilbert channel.

A.6.1.2 Redundancy Bounds on the Gilbert Channel

Consider the one-hop symbol-level Gilbert wireless channel of Figure 1. The Gilbert

channel (GC) [81] has been used to model many wireless channels [9]- [11], [13]- [15],

[18]- [20], [26]. In this section, we compare minimum expected FEC redundancies of

UDP-Lite and UDP with header estimation over a GC.

A.6.1.2.1 Bound on a UDP based Protocol Stack Let udpGCε , denote the probability of observing a packet erasure on a UDP protocol

stack operating over a Gilbert channel (GC). Then udpGCε , is the probability of having

one or more symbol-errors in the received packet, and can be expressed as:

( ) ( )( )

( ) ( )( )

1

1

1

Pr corrupt pkt 1

1

1 1 ,1 1

H D H D

H D

H D

n n n nudpGC g b bggg gg

n ng gg

n nb b

pp p

p

ε π π

π

π π µ

+ + −,

+ −

+ −

= = − −

= −

= − − − −

(A.31)

where gπ and bπ respectively represent the steady-state probabilities of staying in the

good and bad states and µ is the Gilbert channel’s memory as defined in (A.4). Using the

derivations in Section A.6.1.1.1, we can express the average amount of redundancy

required by a FEC decoder operating on a UDP protocol stack as

84

udpGC D udpGCr n lε, ,≥ . (A.32)

A.6.1.2.2 Bound on a UDP-Lite based Protocol Stack A UDP-Lite based protocol stack drops all packets that have header errors. Thus the

probability of packet erasures, liteGCε , , of UDP-Lite over a GC is:

( ) ( ) ( )1 1Pr corrupt hdr 1 1H H Hn n nliteGC g b bg gg ggg ggp pp pε π π π− −

, = = − − = − . (A.33)

Using derivations of Section A.6.1.1.2, the expected number of UDP-Lite erasures is

( ) 1E UDPLite 1 HnD liteGC D gggY n l n l pε π − ,

= = − .

In addition to erasures, a UDP-Lite protocol stack will also have errors in the application

layer payload. The probability of a symbol error over the Gilbert channel is

Pr symbol err UDPLiteGC g gb b bb bp p pπ π π= = + = . (A.34)

Then the expected number of UDP-Lite symbol errors is

E UDPLite 1D liteGC GCX n l pε , = − ,

and the lower bound on the total expected redundancy required to recover the errors and

losses over a UDP-Lite protocol stack over a GC is

( )

2E UDPLite

2 1udpliteGC D liteGC

D liteGC liteGC b

r n l X

n l

εε ε π

, , , ,

≥ +≥ + − .

(A.35)

85

A.6.1.2.3 Bound on a Header Estimation based Protocol Stack Using the reasoning of Section A.6.1.1.3, the total (expected) amount of redundancy

required by an ideal header scheme over a GC is

2hdrest GC D br n lπ, ≥ . (A.36)

Comparison of the above bound with the bounds in (A.32) and (A.35) reveals that

minimum expected redundancy required by header estimation is independent of channel

memory. The redundancy is simply a function of the probability of error. Thus the

performance of header estimation will remain unchanged with changes in channel

memory. On the other hand, the redundancy required by UDP and UDP-Lite is high for

low-memory channels and the redundancy decreases with an increase in channel

memory.

A.6.1.2.4 Comparison of the FEC Redundancy Bounds First, let us compare minimum expected FEC redundancies of UDP-Lite and header

estimation:

( ) ( ) ( )( )

min min 2 1 21 2 0 for 0 5.

udpliteGc hdrest GC D liteGC D liteGC GC D GC

D liteGC GC GC

r r n l n l p n lpn l p p

ε εε

, , , ,

,

− = + − − = − > < .

That is,

( ) ( )min min when 0 5udpliteGC hdrest GC GCr r p, ,> < . . (A.37)

This condition is similar to the one derived for the q -ary symmetric channel, implying

that header estimation should perform better than UDP-Lite as long as the average

probability of error is less than 0 5. . For any reasonable Gilbert wireless channel, the

probability of symbol error should be considerably smaller than 0 5. .

86

We now compare minimum expected redundancies of UDP and header estimation

over a GC:

( ) ( )min min 2udpGC hdrest GC D udpGC D GCr r n l n lpε, , ,− = − ,

where udpGCε , and GCp are given in (A.31) and (A.34), respectively. Plugging in the

values of udpGCε , and GCp gives

( ) ( )( ) ( )

( ) ( )

( )

1

1

1

min min

1 2 2

1 2

1 2

1

H D H D

H D

H D

udpGC hdrest GCn n n n

D g gg b bg gg g gb b bbn n

D g bg gg

n nbgD bg gg

bg gb

D gg

r r

n l p p p p p

n l p p

pn l p pp p

n l

π π π π

µπ

µ

π π

, ,+ + −

+ −

+ −

− = − − − − = − + + − = − + + − +

= − ( ) 12 H Dn nggp + − + − .

Based on the above comparison, we obtain the following condition:

( ) ( )min min when 1udpGC hdrest GC gr r π, ,> → . (A.38)

The above inequality is generally true because on any practical wireless channels

H Dn n+ will always be greater than one symbol. Also, ggp the overall probability of

staying in the error-free state is generally very high. Thus FEC comparison of UDP

versus header estimation for the Gilbert channel essentially converges to the same

conclusion as the symmetric channel: Unless the channel has an unreasonably high error-

rate, header estimation will always utilize wireless bandwidth more efficiently than UDP.

87

Figure 25 shows the percentage of redundant symbols in each FEC block for UDP,

UDP-Lite and header estimation over a Gilbert Channel. The redundancy is plotted

against channel memory while fixing the probability of error. The leftmost points in

Figure 25 represent the memory-less case. It can be seen that header estimation always

requires lesser FEC redundancy to recover corrupted packets than UDP and UDP-Lite.

This difference in the amount of required redundancy gets more significant with an

increase in the probability of error. In general, due to the large number and bursts of good

symbols in a high memory channel, the amount of redundancy required by UDP and

UDP-Lite decreases with an increase in channel memory. In all cases, the redundancy

required by header estimation is extremely low and independent of the channel memory.

Thus, while the design of FEC schemes for UDP and UDP-Lite need to take channel

memory into account, an accurate header estimator can be deployed on a wireless

network without any knowledge of the underlying channel’s memory.

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

channel memory, µ

redu

ndan

t FE

C s

ymbo

ls %


0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

channel memory, µ

redu

ndan

t FE

C s

ymbo

ls %


(a) 0 001GC bp π= = . (b) 0 01GC bp π= = . Figure 25. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header

Estimation over a Gilbert channel; 8m = , 30L = , 60Hn = , 452Dn = .

88

A.6.1.3 Discussion

At this point, we have theoretically verified that a protocol employing header

estimation should require lesser FEC redundancy at a wireless receiver than UDP and

UDP-Lite. This naturally brings us to the practical question of how to realize an accurate

header estimation technique for wireless environments. The following section addresses

this question by designing a header estimation scheme which utilizes the channel models

proposed in preceding chapters.

A.6.2 Maximum-Likelihood Header Estimation Framework

The maximum-likelihood estimation scheme proposed in this section only estimates

the critical header fields (CHF) that can uniquely identify a UDP multimedia session at a

receiver and are not liable to change during the course of the multimedia session. In our

experiments, we treat the following as CHF: (i) destination MAC address, (ii) source IP

address, (iii) destination IP address, (iv) source port, and (v) destination port.

Nevertheless, all mathematical treatment is provided for a general case of N critical

fields.

Under the proposed methodology, a list of active CHF (i.e., CHF of sessions that are

currently being received) is provided to a header estimation module by the multimedia

application(s). On receiving the first error-free packet of a new session, the multimedia

application adds the new session’s CHF information to the list of active multimedia

sessions. Whenever a corrupted packet is received, a likelihood score of its critical fields

is computed with respect to each entry of the CHF list. The CHF rendering the highest

likelihood are chosen as the estimated CHF of the received (corrupted) packet.

89

The main objective of header estimation is to pass maximum number of (error-free

and corrupted) packets to the application layer using only parameters of a MAC layer bit-

error channel model. We defer discussion on how an application can make use of the

corrupted packets to subsequent sections.

A.6.2.1 Functionality at and below a Receiver’s MAC layer

Figure 26 outlines the interactions between the proposed header estimation module

and different layers of a wireless receiver’s protocol stack. The packets after wireless

physical layer processing are passed to the MAC layer which verifies the packet’s

checksum to determine if the received packet has errors. Instead of dropping a corrupted

packet, the packet and its checksum information (i.e., packet passed/failed the checksum)

pkt Wireless Channel

Pkt after network and transport layer processing

Updated channel model parameters

Corrupt UDP pkt with estimated CHF

Corrupt UDP pkt which has either dst MAC or dst IP address of local receiver

Error-free pkt

Received pkt after physical layer

processing

Physical Layer

MAC Layer without UDP Pkt Drops

Header Estimation Module

Network and Transport Layers

Network and Transport Layers with Disabled

Checksums

Application Layer L

ist of active CH

F

Figure 26. Interactions between the UDP-based header estimation module and

different layers of a wireless receiver’s protocol stack; modified protocol stack layers are shown in different colors and dotted lines represent communications that are not

related to packet reception.

90

is passed to a module that checks the transport type, the destination MAC address, and

the destination IP address of the received packet. Header estimation is invoked only for

UDP packets, while TCP and network layer traffic are handled by the conventional

protocol stack. Furthermore, the MAC layer does not attempt retransmission-based

recovery of corrupt UDP packets, i.e., ACKs are sent even for corrupt UDP packets.

Instead of MAC retransmissions, header estimation with application layer FEC is used to

recover from errors in the packet. Such retransmission-less recovery is well-suited for

delay-sensitive real-time communications.

Header estimation is invoked when all of the following conditions are satisfied: (i) a

corrupt UDP packet is received, (ii) either the destination MAC or the destination IP

address matches the local receiver’s addresses, and (iii) there are one or more active

multimedia sessions on the receiver. Three scenarios exist when a packet is received:

(i) Packet is error-free: No need to perform header estimation.

(ii) Packet is corrupt and the packet is intended for the local receiver: Header estimation

is invoked and an ACK is sent to the last hop network entity to avoid MAC layer

retransmissions.

(iii) Packet is corrupt and the packet is not intended for the local receiver: This case

represents a false alarm when, due to channel errors, either destination MAC or

destination IP of a packet not intended for the local receiver gets mapped to the MAC or

IP address of a receiver. Due to the receiver-based nature of the present scheme, false

alarms cannot be detected at a receiver’s MAC layer. Thus header estimation is invoked

even for false alarm packets, and a MAC layer ACK is sent to the last hop network

entity.

91

A.6.2.2 The Header Estimation Module

The header estimation module employs a likelihood function to find the most likely

transmitted CHF given: (i) the received CHF, (ii) a list of active CHF, and (iii)

parameters of the MAC layer error channel model. The list of active CHF is provided by

the receiver’s application layer as shown in Figure 26. The transmitted/active CHF that

renders the maximum value of the likelihood function is chosen as the estimated CHF.

The corrupt packet and the estimated CHF are passed to higher layers. In essence, the

present header estimation problem is the estimation-theoretic problem of maximum-

likelihood (ML) estimation of known parameters in noise [79].

A.6.2.3 Processing at a Receiver’s Network, Transport and Application Layers

The corrupted packets along with the estimated CHF are passed by the header

estimation module to the receiver’s network layer. The network layer performs its regular

operation with two modifications: (a) instead of the (possibly corrupted) IP addresses in

the network layer header, the estimated IP addresses are treated as the true IP addresses;

(b) network layer checksum on IP headers is disabled. At the UDP layer, source and

destination ports are taken from the estimated CHF and the corrupted packets are passed

to the (estimated) multimedia application.

A.6.3 Likelihood Functions for Header Estimation In this section, we derive header estimation likelihood functions for two previously

proposed classes of MAC layer channel models, namely the full-state Markov (FSM)

model and the multifractal wavelet model (MWM). Let 1 2, , ,i i i iNx x xΛ = … denote

92

an ordered set of N critical header fields for an arbitrary multimedia sessioni . As

mentioned before, in this chapter we have 5N = , where 1ix , 2ix , 3ix , 4ix and 5ix

correspond to the destination MAC, source IP, destination IP, source port, and destination

port of multimedia session i , respectively. A receiver receives 1M ≥ simultaneous

multimedia streams. Let 1 2, , , MΩ = Λ Λ Λ… denote an unordered set of CHF each

corresponding to a currently active multimedia sessions on a given receiver. Note that

each 1 2, , ,i i i iNx x xΛ = ∈ Ω… is in turn a set of critical fields corresponding to a

given session, where the first subscript of x is the session index and the second subscript

is the CHF index. Let °rΛ denote the set of CHF of a received packet, i.e.,

± ± ² ² 1 2, , ,r r r rNx x xΛ = … is a possibly corrupted version of an iΛ ∈ Ω . Let ¶rΛ denote

the estimated CHF.

Let Χ represent a stochastic MAC layer channel model characterizing the bit-error

channel over which a receiver is receiving it packets. Then, for a critical header field ijx

(i.e., critical field j for a multimedia session i ), our objective is to derive the likelihood

function ± Pr ,rj ijx x Χ in terms of the parameters of Χ . In other words, given

parameters of a channel model Χ , we want to find the likelihood that a transmitted

critical header field ijx (after possible channel corruptions) was received as ±rjx . We

assume that the likelihood functions of all CHF are independent. Thus ± Pr ,rj ijx x Χ ’s

for each critical field can be ascertained independently and then the overall likelihood

considering all critical fields is:

93

± ± 1

Pr , Pr ,N

r i rj ijj

x x=

Λ Λ Χ = Χ∏ , (A.39)

where 1 i M≤ ≤ is the session index and j is the CHF index. Once ± Pr ,r iΛ Λ Χ has

been computed for all 1 i M≤ ≤ , the CHF estimate ¶rΛ is simply the iΛ that renders

the maximum ± Pr ,r iΛ Λ Χ .

The challenge of this ML-based header estimation lies in the derivation of a

likelihood function ± Pr ,rj ijx x Χ of a critical field, given parameters of a wireless

channel model. In the following sections, we derive likelihood functions for FSM and

MWM channel models.

A.6.3.1 Header Estimation Likelihood Function for FSM Chains

In this section, we derive the CHF likelihood function for a k -th order FSM chain

nΧ , where n is the bit time index. For clarity, in this chapter we deviate slightly from

the previously used FSM chain notation and the transition probability between FSM

states i and j are represented as Pr i j→ . We focus on one arbitrary critical field

ij ix ∈ Λ by fixing the CHF index j . Henceforth ix and ±rx respectively represent the

critical field j from iΛ and the received critical field j . Let us define a new variable:

±i r iz x x= ⊕ , (A.40)

where ⊕ represents a binary exclusive-OR operation. iz comprises bit locations that are

different between ±rx and ix . Assuming that the different bits are in fact the errors

94

introduced by FSM channel, ± Pr | ,r i nx x Χ is likelihood of observing error pattern iz

on the channel.

Recall from previous discussions [see Figure 15] that an FSM chain in state iv can

only transit to two FSM states, 2 0iv + or 2 1iv + ; all FSM states are mod2k . Thus,

when the bit added to 2 iv is [ ]1iz k + , we get the [ ] Pr 2 1i i iv v z k→ + + . From state

[ ]2 1i iv z k+ + , the process will transit to

[ ]( ) [ ] [ ] [ ]2 2 1 2 4 2 1 2i i i i i iv z k z k v z k z k+ + + + = + + + + . Using similar logic, the

process will next transit to

[ ] [ ]( ) [ ] [ ] [ ] [ ]2 4 2 1 2 3 8 4 1 2 2 3 .i i i i i i i iv z k z k z k v z k z k z k+ + + + + + = + + + + + +

A recursive relationship in the transition probabilities can be identified at this point.

Generalizing the recursive relationship yields the header estimation likelihood function

for a k -th order FSM chain as follows:

± [ ]( ) [ ]

[ ]

21 101

2 1 10

Pr | , FSM Pr 2 1 mod2

2 2 1 mod2

Pr ,

2 2 1 mod2

ikr i n v i i i

aa a b ki ibW k

a aa a b ki ib

x x v v z k

v z k b

v z k b

π

−− − −=− −

= − − −=

Χ = = → + + + + + ↓ + + +

∑∏

∑

(A.41)

where n is the bit time index, iv is the FSM state represented by the first k bits of iz ,

W represents the number of bits in the critical field, xπ represents the steady-state

probability of being in FSM state x , Pr x y→ is the transition probability of going

95

from FSM state x to state y , and [ ]iz x represents the value of iz at the x -th bit

location.

The FSM likelihood function answers the following question: What is the probability

that channel errors have changed ix to ±rx ? Since ±i r iz x x= ⊕ denotes the bit pattern

that would be observed if the channel changed ix to ±rx , we have to find the probability

that the channel nΧ produced the bit-error pattern iz . Clearly, the FSM channel’s initial

state must be iv because iv denotes the FSM state represented by the first memory-

window of iz , leading to the ivπ term. This initial state must be followed by a unique

sequence of state transitions that result in the bit-error pattern iz . To quantify the

probability that an FSM channel will follow this “unique state sequence”, recall that in

one transition the FSM process can only transit to two possible states. Also, due to the

Markov property, the probability of transiting to one of the two possible states is only

dependent on the present state. The final likelihood score of ix is hence characterized by

a multiplication of the transition probabilities of this unique state sequence, as

represented by the multiplicative Pr x y→ terms in the likelihood function.

A.6.3.2 Header Estimation Likelihood Function of MWM

Recall from Section A.3.8 that the multifractal wavelet model (MWM) uses

expectation-maximization to model two random variables: (i) the scaling coefficient at

the coarsest scale 0 0,j kU , where 0j and 0k represent the coarsest scale and time,

respectively; (ii) ,j kA random variables defined over a [ ]1,1− interval, j and k

representing the scale and time, respectively. In previous chapters, we showed that the 11

96

Mbps bit-errors have long-range dependence which can be captured using the MWM.

Therefore, in this section we derive the likelihood function for an MWM. Previously we

used the bit-error sequences of zeros and ones to train the MWM. Derivation of a

likelihood function for an MWM trained using such a strategy is somewhat difficult.

Consequently, in this chapter we train an MWM using the number of bit-errors in a

packet as the training sequence.

Let nΧ denote the MWM process, where n represents discrete packet time

instances. It was shown in [61] that due to the use of the Haar wavelet transform, the

MWM-predicted number of errors [ ]e n in packet n can be expressed as

[ ] 2 ,2m

m ne n U−= for 10,1, ,2mn −= … . If the packets have a fixed size C , then the

probability of bit-errors in the packet received at packet time index n is

[ ] [ ] 2 ,2 mm np n e n C U C−= = . Now note that each received bit is basically a value

taken from a binary time series of length l , i.e., [ ] 0lix i = , [ ] 0,1x i ∈ , and i

represents the discrete bit time index. Based on equation (A.40), [ ]1W

im z m=∑ yields the

total number of bits that are different between ±rx and ix , i.e., the hamming distance

between ±rx and ix . If the bits of iz are in fact the errors introduced by a MWM wireless

channel then given a probability of having [ ]1W

im z m=∑ errors is [ ]( ) [ ]1W im z mp n =∑ ,

and the probability of having [ ]1W

imW z m=− ∑ correct bits is

[ ]( ) [ ]11 W imW z mp n =−∑− . Likelihood of the bit pattern iz is then a multiplication of the

above events. Thus the MWM likelihood function is as follows:

97

± ( ) [ ] ( ) [ ]112 2, ,Pr | , MWM 2 1 2 ,

WWii

mmW z mz mm m

r i n m n m nx x U C U C ==−− − ∑∑ Χ = = −

(A.42)

where 10,1, ,2mn −= … is the packet time index, m is the number of scales used to

train the MWM, C is the number of bits in a packet, W is the number of bits of in the

critical field, and iz is given in (A.40), and ,m nU is the scaling coefficient at scale m

and time n .

Similar to (A.41), the MWM likelihood function renders the probability that the bit-

error pattern ±i r iz x x= ⊕ is observed on an MWM channel. Since the probability of bit-

error in packet n is given by 2 ,2 mm nU C−

, the probability of observing [ ]1W

im z m=∑

bit-errors in packet n is [ ]( ) [ ] ( ) [ ]11

2 ,2WW ii mm

z mmz m m np n U C ==− ∑∑ = . Treating error-

free and corrupted bits as the two outputs of a Bernoulli random variable yields the

MWM likelihood expression.

Once ± Pr | ,r i nx x Χ ’s for all currently active sessions, 1 i M≤ ≤ , are computed

using the FSM or MWM likelihood functions, the session i that renders the maximum

± Pr ,r iΛ Λ Χ is chosen as the estimated CHF, ¶rΛ . We also introduce a provision that a

packet is dropped if the maximum likelihood is less than 0.25 because in such a case the

estimation confidence is very low.

98

A.6.3.3 Extending the FSM Likelihood Function to the CCM

The complexity of an MWM to generate a length l sequence is linear. However, the

complexity of FSM chains grows exponentially with respect to memory-length. Due to

their exponential complexity, FSM chains are unreasonably complex to be employed in

the header estimation framework. Therefore, in this section we extend the FSM

likelihood function to the CCM so that the approximating CCM model can be used for

header estimation instead of the FSM model.

Let xS denote the aggregate CCM state that contains FSM statex . Since the CCM

aggregates FSM states, using (A.41) the likelihood function for the CCM can be rewritten

as:

± [ ]

[ ] [ ]2 11 1 10 0

2 1

1

2 2 1 2 2 12

Pr | , CCM Pr

Pr ,

v i i ii

a aa a b a a bi i i ib b

r i n S v v z k

W m

v z k b v z k ba

x x S S

S S

π

− −− − − − −= =

+ +

− −

+ + + + + +=

Χ = = → → ∑ ∑

∏

where the subscripts of all aggregate states xS are modulo 2m and all other parameters

are defined in Section A.6.3.1. The low-complexity of the CCM clearly makes it a natural

alternative to FSM chains in the present header estimation methodology. In all

subsequent performance evaluations of the header estimation methodology, we use

CCMs instead of FSM chains and show that the likelihood function rendered by the CCM

is highly accurate.

99

A.6.4 Performance Evaluation of the Header Estimation Framework

A.6.4.1 Experimental Setup

We use the wireless traces described in Section A.4.1 to simulate the wireless

channel. For video evaluations, we report throughput, FEC and PSNR results for five

multimedia receivers. Each receiver receives multiple video streams with a maximum of

five video streams. At each physical layer data rate, we repeat video experiments using

three distinct wireless trace-sets that were collected at different times of day. Video

experiments for each trace-set are repeated 25 times starting at different randomly

selected locations inside the error traces. Thus the throughput and FEC results for 2, 5.5

and 11 Mbps are each averaged over 3 5 5 25 1,875× × × = received video streams. Due

to the high complexity of video decoding, for each trace-set the PSNR results are

reported for one (randomly selected) video experiment, that is, PSNR results for 2, 5.5

and 11 Mbps are each averaged over 3 5 5 75× × = received video streams.

For each packet transmission, a 512 byte packet (452 bytes of video payload and 60

bytes of headers) was corrupted using the bit-error traces. The models used for likelihood

computation on all receivers were trained using error traces which were not used in the

video experiments. In accordance with the results of Sections A.4.3.1 and A.4.3.2, FSM

chains of order-9 and order-10 were employed for the 5.5 and 2 Mbps bit-error processes,

and an MWM trained using the number of bit-errors in a packet was employed for the 11

Mbps process. Each FSM chain was folded to a 5-state CCM.

Video sequences were compressed using the H.264 video coding standard [83], [85].

The sequences had a QCIF frame size and were encoded at a frame-rate of 30 fps. The

100

streams were encoded at different source coding bitrates ranging from 100 kbps to 1

Mbps. A slice mode with fixed number of 452 bytes per slice was used for encoding [83].

Intra frame period was set to 12, i.e., each group of pictures (GOP) had 12 frames.

Varying numbers of video streams were assigned to the wireless receivers. Transmission

of packets from each stream was simulated in a round robin fashion according to source

bitrates. In order to achieve successful video decoding, in the simulations we introduced a

provision that the first frame of the video sequence (i.e., the very first I-frame of the first

GOP) is always received correctly.

A.6.4.2 Throughput Performance

The term throughput here refers to the ratio of the total number of packets relayed to

the receiver’s application to the total number of packets sent by the sender’s application

layer. That is, throughput comprises of both error-free and corrupted packets. The

percentage packet drop rate is ( )1 throughput 100− × . Figure 27 outlines the packet

drops incurred by UDP Normal, UDP-Lite and UDP with header estimation at 2, 5.5, and

11 Mbps. The results are averaged over all receivers and multimedia streams and hence

the packet drops are referred to as average packet drops. The leftmost points in Figure 27

(a), (b), and (c) depict the simplest case of each receiver is receiving only one multimedia

stream. The number of video streams per receiver is then incremented. More than one

multimedia per receiver is an important scenario for video conferencing applications.

101

A.6.4.3 Comparison of Packet Drops

It can be clearly seen in Figure 27 (a), (b) and (c) that header estimation always incurs

lesser packet drops than normal UDP and UDP-Lite. The header estimation packet drops

include: (i) packets that were dropped because both the destination IP and the destination

MAC address were corrupted, and (ii) packets whose critical fields were incorrectly

estimated (resulting in false alarms). At 2 Mbps, header estimation packet drops are

approximately 0.2% , as opposed to approximately 0.4% and 1% in case of UDP-Lite

and normal UDP. Since the 2 Mbps channel has receivers with very low packet error

1 2 3 4 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

video streams per receiver

aver

age

pack

et d

rops

%UDP NormalUDP LiteUDP Hdr Est

1 2 3 4 50.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5


aver

age

pack

et d

rops

%

UDP NormalUDP LiteUDP Hdr Est

(a) 2 Mbps (b) 5.5 Mbps

1 2 3 4 50

5

10

15

20

25

30


aver

age

pack

et d

rops

%


(c) 11 Mbps

Figure 27. Average packet drops for UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates and for varying number of video streams per receiver; each point is averaged over ( )3 # of video streams 5 25× × × received video streams.

102

rates, the margin of improvement is small. At 5.5 Mbps, UDP with header estimation

provides approximately 4% and 2% throughput improvements over normal UDP and

UDP-Lite. Due to the very high data rate at 11 Mbps, the header estimation packet drops

increase to about 3% , but this packet drop rate is still substantially lower than that of

normal UDP ( 15%≈ ) and UDP-Lite ( 30%≈ ).

A.6.4.4 False Alarm Rate

A false alarm is a packet that is not intended for a multimedia session, but is relayed

to that session. There are three sources of false alarms: (i) due to channel errors, either

destination MAC or destination IP address of a packet (not intended for the local

receiver) gets mapped to the MAC or IP address of the receiver; (ii) a corrupted packet is

inaccurately estimated; (iii) a corrupt non-multimedia UDP packet is received when one

or more multimedia sessions are active.

For the five streams per receiver case, cumulative false alarm rates are 0.07% ,

0.52% , and 1.3% at 2, 5.5 and 11 Mbps. While these false alarms are quite low, they

must be detected because they can desynchronize the video and/or FEC decoders. To

detect false alarms, we protected the 2 byte H.264 slice sequence numbers (in the RTP

header, with one slice per packet) with 4 bytes of redundancy to ensure that these

sequence numbers can always be recovered at the receiver. A receiver dropped all

packets whose slice numbers were much larger or smaller than the next/expected slice

number. For applications which do not have a slice/packet sequence number, a small

incremental packet sequence number with parity bytes can be easily inserted into each

packet by the sender’s application layer. This sequence number based scheme also

provides erasure locations (i.e., dropped packets) to the FEC decoder.

103

A.6.4.5 FEC Performance

We now evaluate the amount of FEC redundancy required by the application to

recover from errors and packet drops in the multimedia content. Since the corrupted

packets contain many error-free bytes, this error-free data should facilitate application

layer FEC decoding. As mentioned earlier, for an MDS FEC code if a codeword has 2t

number of redundant symbols then a maximum of t transmission errors in that block can

be corrected [80]. For the same amount of redundancy, 2t erasures can be recovered. In

the UDP-Lite and UDP with header estimation scenarios, for an FEC codeword with 1e

erasures (i.e., packet drops) and 2e errors, if 1 2e t≤ then the FEC decoding algorithm

can recover the 1e erasures. After erasure decoding, 2e errors can be corrected if

( )2 12 2e t e ≤ − .

We simulate MDS forward error correction for all three (UDP Normal, UDP-Lite,

UDP with header estimation) protocol variants. A codeword length of 30N = bytes is

used for all experiments. Each codeword is composed of one byte from a different packet,

where each packet consists of 452 bytes of data payload. Thus each packet contributes to

452 separate RS codewords, and each codeword spans over 30 packets. The FEC

construction is shown pictorially in Figure 28. For all protocol stack variants, we treat

packet drops as erasures in the received codewords. Note in Figure 28 that a packet drop

results in an erasure in 452 codewords.

104

Since normal UDP does not have corrupted packets, all parity bytes are used for

erasure decoding. Unlike normal UDP, FEC codewords for UDP with header estimation

have errors due to corrupted packets and erasures due to incorrect estimations and/or

false alarms. Similarly, FEC codewords for UDP-Lite have errors due to corrupted

packets and erasures due to packets with corrupted headers. For performance evaluation,

we define a simple measure called decodable probability:

( ) ( )= decodable codewords received codewords transmitteddp ,

where a codeword with 1e erasures and 2e errors is decodable only if 1 22 2t e e≥ + .

Clearly, 0 1dp≤ ≤ and 1dp = implies that all received codewords were successfully

decoded.

pkt hdr

pkt payload=452 bytes

1

2

3

30

RS codew

ord 1

RS codew

ord 452

RS codew

ord 2 R

S codeword 3

RS codew

ord 451

pkt num A pkt drop will introduce an

erasure in all the 452 RS codewords

Figure 28. Codeword construction for video FEC simulations.

105

Figure 29 outlines the decodable probability as a function of the number of message

bytes in an RS codeword for the five streams per receiver experiment. At each data rate,

the results are averaged over all the experiments. From Figure 29 (a), it is clear that at 2

Mbps normal UDP and UDP-Lite require 6 bytes per RS codeword for almost 100%

recovery; that is, approximately 20% bandwidth is wasted in redundancy. UDP with

header estimation achieves almost error-free recovery even if two redundant bytes are

sent per 28 message bytes - approximately 7% bandwidth is used for redundant

symbols. From Figure 29 (b), it can be observed that, due to the increased error-rate at 5.5

Mbps, the performance gap between UDP with header estimation and the other protocols

16 18 20 22 24 26 280.98

0.982

0.984

0.986

0.988

0.99

0.992

0.994

0.996

0.998

1.0

message bytes per block

aver

age

deco

dabl

e pr

obab

ility


16 18 20 22 24 26 28

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1


aver

age

deco

dabl

e pr

obab

ility



16 18 20 22 24 26 280.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


aver

age

deco

dabl

e pr

obab

ility


(c) 11 Mbps

Figure 29. Average FEC redundancy required by UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates of an 802.11b LAN; each point is averaged over

3 5 5 25 1875× × × = received video streams.

106

widens. Normal UDP and UDP-Lite waste approximately 33% bandwidth on FEC

redundancy to achieve almost 100% recovery. UDP with header estimation achieves

almost 100% recovery by wasting merely 20% bandwidth on FEC redundancy. Figure

29 (c) shows that at 11 Mbps the improvements provided by UDP with header estimation

are quite significant; UDP with header estimation requires approximately 27%

redundancy for almost 100% recovery, while both normal UDP and UDP-Lite require

53% redundancy. Thus header estimation salvages the high error rate 11 Mbps channel.

A.6.4.6 Video Performance

In this section, we present results for the 5 streams per receiver experiment, with a

fixed rate FEC having two redundant bytes per RS codeword of 30 bytes. The average

GOP-by-GOP peak signal-to-noise ratio (PSNR) plots at different data rates are given in

Figure 30. All PSNR results are averaged over 75 received video streams. Since we

allow the very first video (I) frame of the first GOP to be received without any errors and

losses, PSNR of the first GOP is not plotted. The dotted line in Figure 30 represents

PSNR of error-free video, which provides a performance upper bound for the protocols

under consideration. PSNR of UDP with header estimation is the closest to the PSNR of

the error-free video at all data rates. At 2 and 5.5 Mbps, respective average PSNRs of

normal UDP and UDP-Lite are approximately 10 dB and 25 dB lower than the PSNR of

UDP with header estimation. However, at 11 Mbps the PSNR of UDP with header

estimation is approximately 25 dB higher than the PSNRs of normal UDP and UDP-Lite,

both of which render equally and extremely low PSNRs at 11 Mbps.

107

A.6.5 Discussion In this chapter, we developed an effective header estimation framework for wireless

multimedia applications. The proposed framework used the channel models proposed in

preceding chapters to provide significant improvements in wireless bandwidth utilization.

In the following chapter, we show another use of the proposed channel models by

quantifying the simulation and analysis inaccuracies that are incurred if channel memory

is ignored.

5 10 15 2015

20

25

30

35

40

45

aver

age

PS

NR

GOP

Error-freeUDP NormalUDP LiteUDP Hdr Est

5 10 15 20

15

20

25

30

35

40

45

aver

age

PS

NR

GOP



5 10 15 205

10

15

20

25

30

35

40

45

aver

age

PS

NR

GOP


(c) 11 Mbps

Figure 30. Average PSNR of video sequences for UDP Normal, UDP-Lite and UDP with Header Estimation using a 30 byte RS codeword with 2 parity bytes; each graph is

averaged over 3 5 5 75× × = received video streams.

108

CHAPTER A. 7 IMPACTS OF IGNORING CHANNEL MEMORY ON ANALYSIS AND SIMULATION OF WIRELESS SYSTEMS

Results of the preceding chapters have established that the MAC layer wireless bit-

error channels have memory. We have also showed that accurate and low-complexity

models can be developed to capture the underlying channel’s memory. The burstiness

and the consequent memory of wireless channels are well-accepted concepts in the

wireless research community. However, much of the contemporary research continues to

use memory-less binary-symmetric and 1st order Gilbert channels for bit-level theoretical

analysis and experimental evaluation of wireless protocols and applications [86]- [100].

The impacts of these simplistic bit-error channel models on the design and evaluation of

wireless systems are largely unexplored.

In this chapter, we quantify the impact of bit-level Markovian channel memory on the

performance of two commonly-used and very meaningful wireless performance metrics:

the expected goodput of an unreliable protocol and the expected number of per-packet

retransmissions for a reliable wireless protocol operating on a single-hop wireless

network. Due to the analytical intractability of the multifractal wavelet model, we focus

solely on the Markov-based channel models considered in this thesis. We derive the two

protocol performance metrics in terms of the parameters of four channel models of

varying memory-lengths, namely a memory-less binary-symmetric channel (BSC) model,

a two-state Gilbert channel (GC) model [81], an order-10 (1024 state) full-state Markov

chain, and an order-20 constant-complexity model (CCM). These models are trained

109

using actual 802.11b MAC layer bit-error traces and subsequently the trained models are

used to estimate the goodput and retransmissions.

We show that extremely misleading estimates of goodput and retransmissions are

obtained when using a BSC or a GC. In particular, for the retransmission metric the

results obtained under the memory-less assumption can be orders of magnitude more

pessimistic than what is observed on the actual channel. On the other hand, the estimates

provided by channel models with high-order memory (i.e., 1024 state FSM and constant-

complexity models) are highly accurate.

A.7.1 Goodput of an Unreliable Protocol In this section, we quantify the goodput of an abstract unreliable protocol - such as

the UDP protocol [64] - operating over wireless links. Here goodput refers to the ratio

between the number of received error-free packets and the total number of transmitted

packets. We compare how accurately the following bit-error wireless channel models

estimate the goodput of a wireless channel: (i) a memory-less binary-symmetric channel

(BSC) model, (ii) a 2-state Gilbert channel (GC) model [81], (iii) a full-state Markov

(FSM) channel model, and (iv) a constant-complexity channel model (CCM). We first

analytically derive packet goodput in terms of the channel models’ parameters. We train

these models using actual traces and then estimate the traces’ goodputs using the trained

models. If a model accurately characterizes the bit-error channel then it should provide a

goodput estimate that is very close to the trace-based goodput.

110

A.7.1.1 Goodput of a Wireless Channel

Since contemporary wireless stacks perform a checksum on each packet to detect and

drop corrupted packets, the present abstract protocol drops all packets with one or more

bit-errors. To cater for end-to-end sessions with multiple hops that include a wired

(Internet) segment followed by a wireless access segment, we assume that only the last

transmission hop is a wireless link. We assume an uncogested path between the sender

and the receiver. Also, the wireless hop employs a CSMA/CA mechanism to resolve

channel contentions, and therefore the number of collisions is negligible. These

assumptions ensure that all packet drops are due to channel noise and interference; i.e.,

for simplicity of analysis, we ignore packet drops due to congestion or collisions.

Since we define goodput as the ratio between the number of received error-free

packets and the total number of transmitted packets, goodput is simply the probability γ

of receiving an error-free packet on the wireless channel. Goodput is constrained by

0 1γ≤ ≤ , where 0γ = represents the limiting case when all the received packets have

errors and are therefore dropped, and 1γ = represents the limiting case when all the

received packets are error-free.

We first derive expressions of goodput estimates γ$ in terms of the parameters of the

trained channel models. Second, we compute the actual goodput γ of the bit-error traces

used in this study. Then for each wireless trace, we train all four channel models

considered in this chapter. Finally, the actual and estimated goodputs ( γ and γ$ s) are

compared.

111

A.7.1.2 Goodput of a Binary-Symmetric Channel Model

A binary symmetric channel (BSC) is a special case of the q -ary symmetric channel

mentioned in the last chapter. Specifically, a BSC is stateless channel that corrupts every

transmitted bit with a probability ε . Consequently, goodput or the probability of

receiving an error-free packet of length L over a BSC is simply given by:

( )Pr error-free pkt BSC 1 LBSCγ ε= = −$ . (A.43)

Given training bit-error data, the parameter ε is computed by taking the ratio between the

number of bad bits and the total number of bits in the training data.

A.7.1.3 Goodput of a Gilbert Channel Model

The Gilbert channel (GC) [81] is a 1st order Markov chain with a good and a bad

state. In the present bit-error modeling context, the two Gilbert states jointly capture a

process with a memory-length of one bit. The probability of the next (good or bad) bit is

dependent on the whether the last received bit was good or bad. Transitions to the good

state result in error-free bits, while transitions to the bad state yield corrupted bits. Due to

the present notation, we represent the good and bad states as state 0 and state 1,

respectively. The GC is completely characterized using two parameters, 0,0p and 1,1p ,

where 0 represents the error-free state and 1 represents the error state. Although both

BSC and GC are special cases of FSM chains, we treat them separately because of their

widespread use in wireless studies [86]- [100].

As shown in the last chapter, goodput or the probability of receiving an error-free

packet of length L over a GC is given by:

112

( ) ( ) ( )( )

1 10 0,0 1 1,0 0,0 0,0 0 0,0 1 1,0

10 0,0

Pr error-free pkt GC

.

GCL L L

Lp p p p p p

p

γ

π π π π

π

− −

−

= = + = +

=

$

(A.44)

The above expression shows that the probability of getting a good packet over a Gilbert

channel model is simply the probability of starting in the error-free state and then staying

in that state for the length of the packet.

A.7.1.4 Goodput of a Full-state Markov Channel Model

The probability of receiving an error-free packet of L bits on a k -th order FSM

channel model is dependent on the present state of the model. If the last received bit was

error-free then the least-significant bit in the memory-window will be zero, implying that

the FSM chain is in an even state. On the other hand, if the last received bit was corrupted

then the FSM chain would be in an odd state.

Let us first focus on the scenario of currently being in an even state and then

receiving L consecutive good bits. Throughout this chapter, we invoke a realistic

assumption that L k> , where k is the memory-length of the process. Let FSM state 2i ,

10 2 1ki −≤ ≤ − , be the current even state of the FSM channel model. Since all FSM

states have the mod2k operation, unless otherwise stated, we drop the mod2k operation

throughout the following text. Recall from Observation 1 in Section A.5.4 that every

FSM state i can transit to only two other states. Thus the current state 2i can transit to

either state ( )2 2i or state ( )2 2 1i + . Since we are only concerned with bursts of error-

free bits, the probability of getting an error-free bit starting in state 2i is ( )2 ,2 2i ip . Now

113

for the length of the memory-window, the next 1k − transitions will be between even

states giving the following states sequence:

( ) ( ) ( ) ( )( )0 1 12 2 2 2 2 2 2 2 2 mod2 0k ki i i i i−= → = → =L .

Thus after these 1k − transitions the process will be in FSM state 0 . From that state, to

get the remaining error-free bits, the next ( )1L k− − transitions will be from state 0 to

state 0 . To generalize the above discussion in terms of FSM chain parameters, the

probability of getting a burst of L good bits starting in FSM state 2i is given by

( ) ( ) ( ) ( )1

2 12 0,02 2 ,2 2

0j j

k L ki i ij

p pπ +− − −

=∏ . This probability has to be summed over all

possible even FSM states yielding ( ) ( ) ( ) ( )1

12 1 2 1

2 0,02 2 ,2 200

kj j

k L ki i iji

p pπ−

+− − − −

==∑ ∏ .

An expression for the probability of getting an error-free packet starting in an odd

FSM state can be derived similarly. Adding these expressions gives the goodput of an

FSM channel model as follows:

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

( ) ( )( ) ( ) ( ) ( )

11 1

11 1

2 1 2 21 12 0,0 2 1 0,02 2 , 2 2 2 1 2 , 2 1 2

0 002 1 2 21

0,0 2 2 12 2 , 2 2 2 1 2 , 2 1 20 00

Pr error-free pkt FSMk

j j j j

kj j j j

FSMk kL k L k

i ii i i ij ji

k kL ki ii i i ij ji

p p p p

p p p

γ

π π

π π

−+ +

−+ +

− − −− − − −+ + += ==

− − −− −+ + += ==

= = +

= +

∑ ∏ ∏

∑ ∏ ∏

$

.

(A.45)

The above expression gives the overall probability of getting L consecutive error-free

bits by summing over all possible state paths starting in an even or an odd FSM state.

114

A.7.1.5 Goodput of a Constant-Complexity Channel Model

The constant-complexity model (CCM) aggregates states of the FSM chain as shown

in Figure 19. Recall that the CCM aggregates states of an FSM chain of arbitrary order to

a five state model. Specifically, FSM states 0 , 1 and 12k− are kept in three isolated

states of the CCM. The remaining even FSM states are aggregated into one CCM state,

while the remaining odd FSM states are aggregated into another CCM state. Throughout

the following text, we refer to the five CCM states as 0c , 1c , 12kc − , evenc and oddc .

Note that at any time instance, if the process is in states 0c , 12kc − or evenc then the last

received bit was error-free. Similarly, the CCM being in state 1c or state oddc implies

that the last received bit was corrupted. The probability of transiting from current CCM

state ic to CCM state jc is denoted by ,i jc cp , and icπ represents the steady-state

probability of being in CCM state ic .

To get a burst of L error-free bits on a CCM-based channel, we have to consider that

the CCM can be in any of the five states when the burst starts. If the CCM is in state 0c

at the start of the burst then the probability that the following L bits are error-free is

simply given by ( )0,0Lp . If the process is in state 1c , for the next bit to be error-free, the

CCM should transit to state evenc . This transition has to be followed by 3k − good bits,

i.e., 3k − transitions from evenc to evenc . After that the CCM should transit to state

12kc − and then to state 0c . Once in state 0c , the process will continue being in that state

for the following L k− transitions. Summarizing the above discussion gives probability

of receiving an error-free packet starting in state 1c as

115

( ) ( )1 11 1 0 0 02 23

, , , , ,k keven even evenk L k

c c even c c c c c c c cp p p p pπ − −− −

. Similar expressions

can be derived for the remaining CCM states. Now summing over all possible initial

states gives the complete expression for CCM goodput as

( ) ( ) ( )

( ) ( )( )

1 10 0 0 1 1 0 0 02 2

1 1 0 0 02 2

1 1 02 2

3, , , , , ,

3, , , , ,

2, , ,

Pr error-free pkt CCM

k keven even even even

k kodd odd even even even even

k keven even even even

CCML k L k

c c c c c c c c c c c c c ck L k

c c c c c c c c c c ck

c c c c c c c

p p p p p p

p p p p p

p p p p

γ

π π

π

π

− −

− −

− −

− −

− −

−

=

= + +

+

$

( ) ( )1 10 0 0 0 02 21

, , , .k kL k L

c c c c c c cp pπ − −− −+

(A.46)

Similar to the FSM expression of (A.45), the above probability sums over all possible

CCM state paths of receiving an error-free packet of length L bits.

A.7.1.6 Comparison of Estimated Goodputs

In this section, we compare the goodput estimates provided by the channel models

against the goodput computed from an actual trace. For comparison with a trace, we first

2 Mbps 5.5 Mbps0

20

40

60

80

100

good

put %

Actual tracesBinary-symmetric channel model2-state Gilbert channel model1024-state full-state Markov channel model5-state constant-complexity channel model

Figure 31. Comparison of the average goodput of the actual traces with the goodput

estimates provided by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five traces.

116

train all four models (BSC, GC, FSM and CCM) using that trace. We then plug in the

trained parameters of these models into equations (A.43) to (A.46) in order to get

throughput estimates of the trace from the models.

Actual and estimated goodputs are compared in Figure 31. The results in Figure 31

are averaged over five traces at each physical layer data rate. The CCM is trained by

aggregating states of an order-20 FSM chain. A packet length of 100 bytes is used to

compute actual and estimated goodputs. It can be clearly seen that for both data rates the

goodput estimates provided by the binary-symmetric and Gilbert channels are highly

pessimistic and inaccurate; at both data rates percentage goodputs estimated by the BSC

and the GC are approximately 20% and 30% respectively, while the actual goodput is

approximately 97% . Since the Gilbert channel incorporates one bit of memory, its

goodput estimate is slightly better than the memory-less binary-symmetric channel.

However, both these channels models are too inaccurate to be used in any realistic

measurement or analytical study. The order-10 full-state Markov model provides very

accurate goodput estimates because it incorporates high-order channel memory. While

being significantly less-complex than the FSM model, the CCM provides estimates that

are even better than the order-10 FSM models because the CCM is constructed by

aggregating states of an order-20 FSM chain.

A.7.2 Retransmissions of a Reliable Protocol In this section, we show that the expected number of retransmissions per packet can

be modeled as a simple function of the goodput. We then compare the retransmission

estimates provided by the models under consideration.

117

A.7.2.1 Expected Retransmissions on a Wireless Channel

In this section, we quantify the expected number of retransmissions experienced by a

packet being transported by an abstract reliable protocol - such as the transmission

control protocol (TCP) [101] or the 802.11 MAC layer protocol [58]. We only focus on

the retransmission-due-to-channel-noise aspect of reliable protocols by employing the

following simple abstraction: keep retransmitting until the packet is received correctly.

We acknowledge that this abstraction is somewhat unrealistic because reliable protocols

generally stop retransmitting after a certain threshold. However, this abstraction allows us

to quantify the worst-case performances of the channel models under consideration. Like

the previous section, at the receiver the abstract reliable protocol drops all packets with

one or more bit-errors. Also, we carry the assumption from the last section that only the

last transmission hop is a wireless link.

Let X denote the random variable representing the total number of retransmissions

required to successfully transmit a packet under the abstract retransmission protocol. Due

to the present abstraction, X can be modeled as a geometric random variable with

parameter γ , where γ is defined in the last section as the true probability of a successful

packet on the wireless channel. More specifically, the probability that a packet will

experience m retransmissions can be expressed as ( )Pr 1 mX m γ γ= = − .

Consequently, the expected number of retransmissions β is

1 1Xβ γ= Ε = − . (A.47)

118

As expected intuitively, the expected number of retransmissions is inversely proportional

to the probability of a good packet; increase in the probability of a good packet γ will

cause the 1 γ expression to decrease.

Until this point, we have assumed that we accurately know the value γ , the true

probability of a successful packet on the wireless channel. In wireless simulations, an

estimate of this parameter γ$ is provided by a wireless channel model. From the last

section, we know that equations (A.43) to (A.46) provide the γ$ estimates for the BSC,

GC, FSM and CCM channel models. Given the γ$ estimates, the estimated number of

retransmissions per packet can be computed as:

µ 1 1β γ= −$ . (A.48)

Plugging in equations (A.43) to (A.46) renders each channel model’s estimate of per-

packet retransmissions.

A.7.2.2 Comparison of Estimated Retransmissions

To compute the average number of retransmissions per packet from an actual trace,

we divide the trace into 100 byte packets. Then to emulate transmission of packet i , we

count the burst-length of corrupted packets including and following packet i . This burst-

length is the number of retransmissions that packet i will experience. Burst-

lengths/retransmissions of all the emulated packets are accumulated. Finally, the

accumulated retransmission count is normalized by the total number of error-free packet

transmissions.

119

As before, parameters of the channel models are derived from the traces against

which they are being compared. Note here that the results of (A.48) are not computed by

taking the reciprocal of the averaged goodput results of Figure 31. The retransmission

estimates are computed by applying equation (A.47) to a model that is trained specifically

for a particular trace. Since equation (A.47) takes the reciprocal of 0 1γ≤ ≤$ , a model

with low γ$ can render very high values of µβ .

Figure 32 plots the average number of retransmissions per packet observed in an

actual trace compared against the retransmission estimates provided by the binary-

symmetric, Gilbert, full-state Markov and constant-complexity channel models. It can be

clearly seen in Figure 32 that the estimates provided by the BSC model are grossly

inaccurate. For instance, at 2 Mbps the BSC models estimates the expected number of

retransmissions per packet to be approximately 700 whereas the average number of per-

packet retransmissions observed in the actual traces is about 0.02. The highly inaccurate

retransmission estimates by the BSC are mostly due to receiver-4’s traces. The goodput

estimate of the BSC model for this trace is approximately 0.0003 at 2 Mbps. Putting this

value into equation (A.47) gives an extremely inaccurate estimate of more than 3000

retransmissions per packet. This simple result shows the scale of inaccuracy that is

incurred if channel memory is completely ignored during theoretical or experimental

verification of a wireless system.

120

2 Mbps 5.5 Mbps0

100

200

300

400

500

600

700

800

retr

ansm

issi

ons

per

pack

et

Actual tracesBinary-symmetric channel model2-state Gilbert channel model1024-state full-state Markov channel model5-state constant-complexity channel model

Figure 32. Comparison of the number of retransmissions per packet estimated by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five

traces.

2 Mbps 5.5 Mbps0

5

10

15

20

25

30

35

40

retr

ansm

issi

ons

per

pack

et

Actual traces-2-state Gilbert channel model1024-state full-state Markov channel model5-state constant-complexity channel model

Figure 33. Number of retransmissions per packet without the BSC model.

2 Mbps 5.5 Mbps0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

retr

ansm

issi

ons

per

pack

et

Actual traces--1024-state full-state Markov channel model5-state constant-complexity channel model

Figure 34. Number of retransmissions per packet without the BSC model.

121

The estimates of the BSC model are so overwhelming inaccurate that the remaining

plots are not clearly visible in Figure 32. Therefore, in Figure 33 we plot the results

without the BSC model. From Figure 33, it can be seen that at 2 Mbps even the Gilbert

channel provides very inaccurate estimates of the expected number of retransmissions.

The GC estimate is closer to the actual traces at 5.5 Mbps, but is still significantly worse

than the FSM and CCM models. Figure 34 only shows the estimates by the 1024-state

FSM model and the CCM. Since these channel models incorporate high-order memory,

their estimates are extremely close to the retransmissions observed in the actual traces.

122

CHAPTER A. 8 CONCLUSIONS AND FUTURE WORK

In this part of the thesis, we showed that 802.11b MAC layer bit-errors at 2 and 5.5

Mbps are Markovian, while bit-errors at 11 Mbps are long-range dependent. We

demonstrated that high-order full-state Markov (FSM) chains can model the bit-errors at

2 and 5.5 Mbps. A multifractal wavelet model (MWM) was used to characterize 11 Mbps

bit-errors. We mitigated the complexity of FSM chains by approximating FSM behavior

using a constant-complexity model which always comprised five states and was highly

accurate. We employed the proposed channel models to estimate corrupted packet

headers in an FEC-based wireless multimedia framework. This novel framework

provided significant improvements in bandwidth utilization and multimedia quality.

Finally, we highlighted some of the inaccuracies that are incurred by using inaccurate

models. These inaccuracies can be avoided by using the constant-complexity model

proposed in this thesis.

As future work, we will study the applicability of the proposed models on other

wireless channels. Another ongoing extension of this work is to incorporate the proposed

channel models into open-source network simulators, such as ns-2 [102] and Qualnet

[103]. We are also investigating alternative methods that can reduce the complexity of the

header estimation framework. Finally, we intend to extend analysis similar to Chapter

A.7 to other wireless protocols and systems so that we can quantify the inaccuracies that

are incurred by inaccurate channel models.

123

PART-A REFERENCES

[1] B. D. Fritchman, “A Binary Channel Characterization using Partitioned Markov Chains,” IEEE Transactions on Information Theory, vol. 13, pp. 221–227, April 1967.

[2] S. Tsai, “Markov Characterization of the HF Channel,” IEEE Transactions on Communications Technology, vol. 17, pp. 24–32, February 1969.

[3] H. O. Burton and D. Sullivan, “Errors and Error Control,” Proceedings of the IEEE, pp.1293–1301, November 1972.

[4] H. A. Blank and P. J. Trafton, “A Markov Error Channel Model,” IEEE National Telecommunications Conference, December 1973.

[5] R.T. Chien, A.H. Haddad, B. Goldberg and E. Moyes, “An Analytic Error Model for Real Channels,” IEEE International Conference on. Communications (ICC), June 1972.

[6] A. H. Haddad, S. Tsai, B. Goldberg, G. C. Ranieri, “Markov Gap Models for Real Communication Channels,” IEEE Transactions on Communications, vol. 23, no. 11, pp. 1189–1197, 1975.

[7] L. N. Kanal and A. R. K. Sastry, “Models for Channels with Memory and Applications to Error Control,” Proceedings of the IEEE, vol. 66, no. 7, pp. 724–744, 1978.

[8] M. Yajnik, S. Moon, J. Kurose, and Don Towsley, “Measurement and Modelling of the Temporal Dependence in Packet Loss,” IEEE Infocom, March 1999.

[9] M. Zorzi and R. R. Rao, “On Channel Modeling for Delay Analysis of Packet Communications over Wireless Links,” Allerton Conference on Communications, Control and Computing, September 1998.

[10] H. Balakrishnan and R. Katz, “Explicit Loss Notification and Wireless Web Performance,” IEEE Globecom, November 1998.

[11] M. Zorzi and R. R. Rao, “On the Statistics of Block Errors in Bursty Channels,” IEEE Transactions on Communications, vol. 45, no. 6, pp. 660–667, June 1997.

124

[12] R. R. Rao, “Higher Layer Perspectives on Modeling the Wireless Channel,” IEEE ITW, June 1998.

[13] G. T. Nguyen, R. Katz, and B. Noble, “A Trace-based Approach for Modeling Wireless Channel Behavior,” Winter Simulation Conference, December 1996.

[14] A. Konrad, B. Y. Zhao, A. D. Joseph, and R. Ludwig, “A Markov-based Channel Model Algorithm for Wireless Networks,” ACM Wireless Networks Journal (WINET), vol. 9, pp. 189 – 199, 2003.

[15] A. Konrad, B. Y. Zhao, A. D. Joseph, and R. Ludwig, “A Markov-based Channel Model Algorithm for Wireless Networks,” ACM Mobicom Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM), July 2001.

[16] P. Ji, B. Liu, D. Towsley, Z. Ge, and J. Kurose, “Modeling Frame-level Errors in GSM Wireless Channels,” Performance Evaluation Journal, vol. 55, no. 1-2, , pp. 165–181, January 2004.

[17] P. Ji, B. Liu, D. Towsley, and J. Kurose, “Modeling Frame-level Errors in GSM Wireless Channels,” IEEE Globecom, November 2002.

[18] S. A. Khayam, S. Karande, H. Radha, and D. Loguinov, “Performance Analysis and Modeling of Errors and Losses over 802.11b LANs for High-Bitrate Real-time Multimedia,” Signal Processing: Image Communication, vol. 18, no. 7, pp. 575–595, August 2003.

[19] S. Karande, S. A. Khayam, M. Krappel, and H. Radha, “Analysis and Modeling of Errors at the 802.11b Link Layer,” IEEE International Conference on Multimedia and Expo (ICME), July 2003.

[20] S. A. Khayam and H. Radha, “Markov-based Modeling of Wireless Local Area Networks,” ACM Mobicom Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM), September 2003.

[21] S. A. Khayam, S. Aviyente, H. Radha, and J. R. Deller, Jr. “Markov and Multifractal Wavelet Models for Wireless MAC-to-MAC Channels,” Performance Evaluation, to appear.

[22] S. A. Khayam, S. Aviyente and H. Radha, “On Long-Range Dependence in High-Bitrate Wireless Residual Channels,” Conference on Information Sciences and Systems (CISS), March 2005.

125

[23] R. R. Rao, “Higher Layer Perspectives on Modeling the Wireless Channel,” IEEE ITW, June 1998.

[24] A. M. Chen and R. R. Rao, “Wireless Channel Models – Coping with Complexity,” Wireless Multimedia Network Technologies, Kluwer Academic Publishers, pp. 271–288, 1999.

[25] A. M. Chen and R. R. Rao, “On Tractable Wireless Channel Models,” IEEE PIMRC, September 1998.

[26] A. Willig, M. Kubisch, C. Hoene, and A. Wolisz, “Measurements of a Wireless Link in an Industrial Environment using an IEEE 802.11-Complaint Physical Layer,” IEEE Transactions on Industrial Electronics, vol. 49, no. 6, pp. 1265–1282, 2002.

[27] A. Willig, “A New Class of Packet- and Bit-Level Models for Wireless Channels,” IEEE PIMRC, October 2001.

[28] A. Köpke, A. Willig, and H. Carl, “Chaotic Maps as Parsimonious Bit Error Models of Wireless Channels,” IEEE Infocom, March 2003.

[29] S. A. Khayam and H. Radha, “Linear-Complexity Models for Wireless MAC-to-MAC Channels,” ACM Wireless Networks (WINET) Journal, vol. 11, no. 5, September 2005.

[30] S. A. Khayam and H. Radha, “Constant-Complexity Models for Wireless Channels,” IEEE Infocom, April 2006.

[31] R. Caceres and L. Iftode, “Improving the Performance of Reliable Transport Protocols in Mobile Computing Environments,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 13, no. 5, 1995.

[32] A. Bakre and B. R. Badrinath, “I-TCP: Indirect TCP for Mobile Hosts,” IEEE ICDCS, May 1995.

[33] R. Yavatkar and N. Bhagwat, “Improving End-to-End Performance of TCP over Mobile Internetworks,” Workshop on Mobile Computing Systems and Applications, Dec. 1994.

[34] H. Balakrishnan, V. N. Padmanabhan, S. Seshan and R. H. Katz, “A Comparison of Mechanisms for Improving TCP Performance over Wireless Links,” IEEE/ACM Transactions on Networking, 1997.

126

[35] G. Holland and N. Vaidya, “Analysis of TCP Performance over Mobile Ad Hoc Networks,” ACM Wireless Networks (WINET), vol. 8, pp. 275–288, 2002.

[36] Z. Fu, X. Meng, and S. Lu, “How Bad TCP Can Perform In Mobile Ad Hoc Networks,” IEEE ISCC, 2002.

[37] K. Chandran, S. Raghunathan, S. Venkatesan, and R. Prakash, “A Feedback based Scheme for Improving TCP Performance in Ad-hoc Wireless Networks,” IEEE ICDCS, 1998.

[38] T. D. Dyer and R. V. Boppana, “A Comparison of TCP Performance over Three Routing Protocols for Mobile Ad Hoc Networks,” ACM MobiHoc, Oct. 2001.

[39] C. Parsa and J.J. Garcia-Luna-Aceves, “Improving TCP Performance over Wireless Networks at the Link Layer,” Mobile Networls and Applications, vol. 5, pp. 57–71, 2000.

[40] M. Gerla, K. Tang, and R. Bagrodia, “TCP Performance in Wireless Multihop Networks,” IEEE WMCSA, 1999.

[41] L-A. Larzon, M. Degermark, S. Pink, L-E. Jonsson, and G. Fairhurst, “The Lightweight User Datagram Protocol (UDP-Lite),” RFC 3828, July 2004.

[42] L-A. Larzon, M. Degermark, and S. Pink, “UDP-Lite for real time multimedia applications,” IEEE ICC, Jun. 1999.

[43] L-A. Larzon, M. Degermark, and S. Pink, “Efficient use of wireless bandwidth for multimedia applications,” IEEE MOMUC, Oct. 2000.

[44] L-A. Larzon, M. Degermark, and S. Pink, “The Lightweight User Datagram Protocol (UDP-Lite),” RFC 3828, Jul. 2004.

[45] H. Zheng and J. Boyce, “An Improved UDP Protocol for Video Transmission Over Internet-to-Wireless Networks,” IEEE Transactions on Multimedia, vol. 3, no. 3, pp. 356--365, September 2001.

[46] H. Zheng, “Optimizing Wireless Multimedia Transmissions through Cross Layer Design,” IEEE International Conference on Multimedia and Expo (ICME), July 2003.

[47] A. Singh, A. Konrad, and A. D. Joseph, “Performance evaluation of UDP-Lite for cellular video,” ACM NOSSDAV, 2001.

127

[48] A Servetti and J. C. De Martin, “Error tolerant MAC extension for speech communications over 802.11 WLANs,” IEEE VTC, 2005.

[49] C. H. Shih, Y. M. Tou, and C. K. Shieh, “A self-regulated redundancy control scheme for wireless video transmission,” IEEE WirelessCom, 2005.

[50] E. Masala, M. Bottero, and J. C. De Martin, “MAC-level partial checksum for H.264 video transmission over 802.11 ad hoc wireless networks,” IEEE VTC, 2005.

[51] S. A. Khayam, S. Karande, M. Krappel, and H. Radha, “Cross-Layer Protocol Design for Real-time Multimedia Applications over 802.11b Networks,” IEEE International Conference on Multimedia and Expo (ICME), July 2003.

[52] Z. Ye, S. V. Krishnamurthy, and S. K. Tripathi, “A Framework for Reliable Routing in Mobile Ad Hoc Networks,” IEEE Infocom, 2003.

[53] J. Tang, G. Xue, and W. Zhang, “Reliable Routing in Mobile Ad Hoc Networks Based on Mobility Prediction,” IEEE MASS, Oct. 2004.

[54] S. Mueller, R. P. Tsang, and D. Ghosal, “Multipath Routing in Mobile Ad Hoc Networks: Issues and Challenges,” Lecture Notes in Computer Science, 2004.

[55] P. Papadimitratos, Z. J. Haas, and E. G. Sirer, “Path Set Selection in Mobile Ad Hoc Networks,” ACM MobiHoc, Jun. 2002.

[56] F. Zhai, Y. Eisenberg, T. N. Pappas, R. Berry, and A. K. Katsaggelos, “Rate-Distortion Optimized Product Code Forward Error Correction for Video Transmission over IP-based Wireless Networks,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004.

[57] ISO/IEC 8802-11:1999(E), “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications,” August 1999.

[58] IEEE Std 802.11b-1999, “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Higher-Speed Physical Layer Extension in the 2.4 GHz band,” September 1999.

[59] J. G. Kemeny and J. L. Snell, Finite Markov Chains, Springer-Verlag: New York, 1976.

[60] D. Cox, “Long-Range Dependence: A Review,” Statistics: An Appraisal, pp. 55 – 74, 1984.

128

[61] R. Riedi, M. Crouse, V. Ribeiro, and R. Baraniuk, “A Multifractal Wavelet Model with Application to Network Traffic,” IEEE Transactions on Information Theory, 45(3), pp. 992–1018, 1999.

[62] P. Arby, R. Baraniuk, P. Flandrin, R. Riedi, and D. Veitch, “Multiscale Nature of Network Traffic,” IEEE Signal Processing Magazine, 19(3), pp. 28 – 46, May 2002.

[63] V. Ribeiro, R. Riedi, and R. Baraniuk, “Wavelets and Multifractals for Network Traffic Modeling and Inference,” IEEE ICASSP, May 2001.

[64] J. Postel, “User datagram protocol,” RFC 768, Aug. 1980.

[65] P. Brockwell and R. Davis, Introduction to Time Series and Forecasting, Springer: Verlag, 1996.

[66] N. Merhav, M. Gutman, and J. Ziv, “On the Estimation of the Order of a Markov Chain and Universal Data Compression,” IEEE Transactions on Information Theory, vol. 35, pp. 1014–1019, September 1989.

[67] M. J. Weinberger, J. J. Rissanen, and M. Feder, “A Universal Finite Memory Source,” IEEE Transactions on Information Theory, vol. 41, no. 3, pp. 643–652, 1995.

[68] W. Willinger, V. Paxson, R. H. Riedi, and M. S. Taqqu, “Long-Range Dependence and Data Network Traffic,” Long Range Dependence: Theory and Applications, Birkhäuser, pp 373- 407, 2002.

[69] P. Abry, P. Flandrin, M. Taqqu, and D. Veitch, “Wavelets for the Analysis, Estimation and Synthesis of Scaling Data,” Self Similar Network Traffic Analysis and Performance Evaluation, Wiley, 2000.

[70] R. J. Adler, R. E. Feldman, and M. Taqqu, A Practical Guide to Heavy Tails, Birkhäuser, 1998.

[71] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Dover, pp. 564- 565, 1972.

[72] Homepage of linux-wlan-ng device drivers, http://www.linux-wlan.org.

[73] Multifractal Wavelet Model Toolbox, http://www-dsp.rice.edu/software/mwm.shtml.

129

[74] V Teverovsky and M. Taqqu, “Testing for Long-range Dependence in the Presence of Shifting Mean or a Slowly Declining Trend using a Variance-type Estimator,” Journal of Time Series Analysis, vol. 18, no. 3, pp. 279–304(25), May 1997.

[75] L. R. Rabiner, “A Tutorial on Hidden Markov Models and its Applications,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257- 286, February 1989.

[76] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-Likelihood from Incomplete Data via the EM Algorithm,” Journal of Royal Statistics Society Series, vol. 39, 1977.

[77] L. E. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of Markov Processes,” Inequalities, vol. 3, no. 1, pp. 1–8, 1972.

[78] T. Cover and J. Thomas, Elements of Information Theory, Wiley: New York, 1991.

[79] H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I, Wiley: New York, 2001.

[80] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley, May 1984.

[81] E. N. Gilbert, “Capacity of a Burst Noise Channel,” Bell. Sys. Tech. Journal, vol. 39, pp. 1253–1265, September 1960.

[82] M. Mushkin and I. Bar-David, “Capacity and coding for the Gilbert-Elliot channels,” IEEE Transactions on Information Theory, vol. 35, no. 6, pp. 1277- 1290, November 1989.

[83] ISO/IEC JTC 1/SC29/WG11 and ITU-T SG16 Q.6, “Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264|ISO/IEC 14496-10 AVC),” Mar. 2003.

[84] ISO/IEC JTC 1/SC29/WG11, “Text of ISO/IEC 14496-2:2001 (Unifying N2502, N3301, N3056, and N3664,” Doc. N4350, July 2001.

[85] H.264/AVC Software Coordination webpage, http://iphome.hhi.de/suehring/tml.

[86] A. Natu and D. Taubman, “Unequal protection of JPEG2000 code-streams in wireless channels,” IEEE Globecom, Nov. 2002.

130

[87] M. Grangetto, E. Magli, G. Olmo, “Reliable JPEG 2000 wireless imaging by means of error-correcting coder,” IEEE ICME, June 2004.

[88] D. Krishnaswamy and S. Kalluri, “Multi-level weighted combining of retransmitted vectors in wireless communications,” IEEE VTC, 2006.

[89] C. E. Koksal and H. Balakrishnan, “Quality-aware routing metrics for time-varying wireless mesh networks,” IEEE Journal on Selected Areas in Communications (JSAC), to appear.

[90] J. Farber and K. Zeger, “Optimality of the natural binary code for quantizers with channel optimized decoders,” IEEE ISIT, July 2003.

[91] W. S. Lee, M. R. Pickering, M. R. Frater, and J. F. Arnold, “Error resilience in video and multiplexing layers for very low bit-rate video coding systems,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 15, no. 9, pp. 1764- 1774, 1997.

[92] X. Luo and G. B. Giannakis, “Energy-constrained optimal quantization for wireless sensor networks,” IEEE SECON, Oct. 2004.

[93] M. U. Ilyas and H. Radha, “End-to-end channel capacity of a wireless sensor network under reachback,” CISS, Mar. 2006.

[94] M. Godavarti and A. O. Hero III, “Diversity and degrees of freedom in wireless communications,” IEEE ICASSP, May 2001.

[95] H. Dong, I. D. Chakares, A. Gersho, E. Belding-Royer, and J. D. Gibson, “Selective bit-error checking at the MAC layer for voice over mobile ad hoc networks with IEEE 802.11,” IEEE WCNC, Mar. 2004.

[96] L. Bononi, M. Conti, and E. Gregori, “Runtime Optimization of IEEE 802.11 Wireless LANs Performance,” IEEE Transactions on Parallel and Distributed Computing, vol. 15, no. 1, 2004.

[97] C-F. Chiasserini and E. Magli, “Energy-Efficient Coding and Error Control for Wireless Video-Surveillance Networks,” Telecommunication Systems, vol. 26, no. 2, pp. 369- 387, 2004.

[98] Z-H. Tan, P. Dalsgaard, and B. Lindberg, “A subvector-based error concealment algorithm for speech recognition over mobile networks,” IEEE ICASSP, May 2004.

131

[99] W. S. Lee, M. R. Frater, M. R. Pickering, and J. F. Arnold, “A robust codec for transmission of very low bit-rate video over channels with bursty errors,” IEEE Transactions on Circuits and Systems for Video Technology (CSVT), vol. 10, no. 8, pp. 1403- 1412, December 2000.

[100] L. Zhong, F. Alajaji, and G. Takahara, “A queue-based model for wireless Rayleigh fading channels with memory,” IEEE VTC, Sep. 2005.

[101] J. Postel, “Transmission control protocol,” RFC 793, Sep. 1981.

[102] Homepage of the ns-2 network simulator, http://www.isi.edu/nsnam/ns/.

[103] Homepage of the Qualnet network simulator, http://www.scalable-networks.com/.

132

PART B

SELF-PROPAGATING MALWARE DETECTION AT NETWORK ENDPOINTS

USING INFORMATION-THEORETIC TOOLS

133

CHAPTER B.1 INTRODUCTION

A recent and dramatic increase in automated network intrusions has necessitated

defense mechanisms that can curb the spread of self-propagating malicious software3

(malware) in real-time. Moreover, rapid evolution and mutation of malware stipulate

detection of novel (i.e., previously unknown) attacks with few, if any, assumptions about

the attack strategy. To that end, network-based anomaly detectors attempt to flag

behavior that is anomalous or abnormal for a networked entity or a user [1]–[29]. The

challenge of anomaly detection systems is the characterization of benign behavior. Most

of the contemporary anomaly detectors are either (a) network-based systems that detect

anomalies by observing unusual network traffic patterns [2]–[24] or (b) host-based

systems that detect anomalies by monitoring an endpoint’s operating system (OS)

behavior, for instance by tracking OS audit logs, processes, command-lines or keystrokes

[25]–[29]. Contemporary anomaly detectors tend to be computationally complex with

high false alarm rates and slow response times [30]- [32].

Since network endpoints4 are serving as extremely potent and viable launch pads and

carriers for malware infections [34], [35], it is important that real-time and effective

defenses be developed specifically for network endpoints. Recently, there has been some

interest in network- and host-based malware detection at endpoints [25]- [29], [19]- [21]

or at servers close to the endpoints [22]- [24]. Most of these studies leverage some

3 Due to the present focus on detection of self-propagating malicious software, throughout this thesis the term malware corresponds to self-propagating malware. 4 “An endpoint is an individual computer system or device that acts as a network client and serves as a workstation or personal computing device.” [33].

134

characteristics of past malware for endpoint-based malware detection. While these

malware characteristics hold true for some of the contemporary malware, their validity

and efficacy are currently being questioned [37]- [42]. Consequently, there is a growing

interest in developing behavioral signatures of benign/legitimate behavior [1]. Once a

robust behavioral model is in place, malicious activity can be detected using deviations

from benign behavior rather than relying on prior experiences of malicious activity.

The objective of behavioral anomaly detection is the characterization of an endpoint’s

benign behavior. Naturally, it is desirable to identify behavioral features that will get

perturbed if the endpoint is compromised by any (past, present, or future) self-

propagating malware. This work identifies such behavioral features using information-

theoretic tools and leverages these features for real-time malware detection at network

endpoints.

B.1.1 Overview of Contributions To obtain benign behavioral profiles of end-users, we have spent 12 months

collecting traffic statistics of a diverse set of endpoints in home, office, and university

settings. An endpoint’s traffic profile contains information about session-level network

activity, such as one-way hashed source and destination IP addresses, session direction

(incoming or outgoing), source and destination ports, timestamps, and keystrokes that are

used to initiate sessions. For malicious activity, we use a diverse set of real and simulated

worms. These worms vary in their propagation rates and scanning techniques. We

evaluate the benign data profiles for behavioral features that get perturbed when the

endpoint is compromised by a self-propagating malware. Based on the identified features,

we propose three malware detection techniques.

135

The first malware detection technique proposed in this thesis is truly network-based

since it only uses traffic features for malware detection. This technique relies on the

premise that the vulnerabilities targeted by any malware are associated with a small

number of source or destination ports. Thus, on a compromised machine, the distribution

of source or destination ports on which a host communicates should be perturbed after

infection. Information-theoretic measures can quantify such perturbations in port

distributions. We first evaluate whether entropy of port distributions can be used to detect

worms. We observe that in many cases entropy cannot identify malware-related port

perturbations because it captures the variance of a distribution rather than the frequencies

of individual ports. As an alternative technique, we propose the use of the Kullback-

Leibler (K-L) divergence measure [43] to characterize perturbations in source and

destination port profiles as a means of detecting attempts of malware propagation. In our

framework, we record a small version of each endpoint’s benign traffic profile and

continuously compare it using the K-L divergence measure to the port histograms

observed in the last window of t time-units. Our results with the collected benign and

malware data show that K-L divergence of port histograms is perturbed significantly on

compromised endpoints, which allows very accurate detection of malicious activity by

simply observing each host’s traffic. We also experiment with three other information

divergence measures, namely the Jenson-Shannon (J-S) divergence, the K -divergence

and the resistor-average (R-A) divergence [44], [45]. However, these divergence

measures do not provide any substantial improvements over the K-L divergence.

We use a very small subset of K-L divergences derived from the normal user profiles

and the malware data to train support vector machines (SVMs) [46] for each endpoint.

136

The trained SVMs are then tested using all other malware which are embedded at

multiple random instances in the normal profiles. For all our experimental evaluations,

we observe almost 100% detection accuracy and negligible false alarm rates. We

compare the performance of our proposed malware detector with two existing anomaly

detectors, namely the maximum-entropy detector [14] and the rate-limiting detector [20].

We show that the proposed K-L/SVM-based detector provides consistently and

substantially better performance than the techniques of [14] and [20].

The remaining two malware detection techniques proposed in this thesis are joint

network-host anomaly detectors which exploit the observation that when a user is

actively using his/her computer most of the benign traffic is triggered by a small subset of

keystrokes and mouse clicks. Based on this observation, we propose to correlate the last

input from the keyboard or mouse hardware buffer with every new network session. We

use marginal keystroke data to show that the session initiation keys are not necessarily

used as frequently by an end-user. To effectively exploit the session-keystroke correlation

in a real-time and automated fashion, we propose two information-theoretic measures,

namely keystrokes’ entropy and session-keystroke mutual information [43].

We compute the keystrokes’ entropy and mutual information on a window-by-

window basis. We observe that the entropy is consistently low and mutual information is

somewhat high in the time windows containing benign data. However, once malicious

traffic with a marginal keystroke distribution is inserted into the benign profile, there is a

significant increase in the entropy and simultaneously there is a decrease in the mutual

information. These entropy and mutual information perturbations are because of the fact

that many keys that are generally used very frequently by the users are never used to

137

initiate legitimate network activity. For a user who is active on his/her endpoint, the

malicious network sessions that are not initiated by the user are logged with unlikely and

diverse keystrokes thereby changing the keystrokes’ distribution.

To create an automated detection tool based on the keystroke distributions, we use a

small subset of the benign profiles to generate the joint and marginal distributions of

keystrokes and network sessions. Based on the statistics of these distributions, we

develop entropy/mutual information threshold above/below which an alarm is raised. For

both entropy and mutual information based detectors, we observe almost 100% detection

accuracy and very low false-alarm rates. Overall the mutual information detector has

lower false alarm rates than the entropy detector. Nevertheless, both detectors provide

significantly better performance than the existing maximum-entropy and rate-limiting

detectors.

B.1.2 Organization of this Part The rest of this part is structured as follows. Chapter B.2 describes related work in

this area. Chapter B.3 provides brief background on self-propagating malware and

support vector machines. Chapter B.4 details the benign endpoint profiles and malware

collected/simulated for this study. Chapter B.5 proposes a network-based information-

theoretic technique which detects malware by leveraging the K-L divergence between

benign and real-time traffic features in an SVM framework. Chapter B.6 presents two

other techniques which respectively employ entropy and mutual information of

keystrokes and network sessions to detect malware. Chapter B.7 identifies possible

attacks on the proposed malware detectors and discusses defenses against these attacks.

138

Chapter B.8 summarizes key conclusions of this part and outlines our future research

directions.

139

CHAPTER B.2 RELATED WORK

Most of the contemporary studies perform network-based anomaly detection at the

enterprise network perimeter or the local network perimeter. Zou et al. [2] propose a

malware warning center (MWC) and distributed ingress and egress sensors at a local

network’s perimeter. Similarly, Wu et al. [3] propose a network architecture and a

distributed algorithm to detect multi-vector worms. Schechter et al. [4] use a combination

of rate limiting and portscan detection on local network worm detector. Jung et al. [5]

develop a network-level fast portscan detector that uses a threshold random walk (TRW)

on typical access patterns to infer whether a host is malicious or benign. Weaver et al. [6]

simplify the TRW algorithm to make it more amenable to hardware and software

implementations. The simplified algorithm of [6] can accurately detect very low rate

worms. Soule et al. [11] apply a Kalman filter to normal traffic and then use multiple

anomaly detection techniques to detect abnormal behavior. Kim et al. [12] propose that

gateway routers score each packet based on its legitimacy. Similarly, anomaly detectors

that monitor blocks of unused IP addresses are also becoming increasingly popular

[15]- [18].

There has been some recent interest in detecting malware at servers near the

endpoints. Whyte et al. [22] detect worms by monitoring (at the gateway router)

connections that are not preceded by a DNS address resolution request. Gupta and Sekar

[23] detect changes in traffic volume at a mail server to detect mass mailing worms.

Xiong [24] trace attachment at mail servers to detect mass mailing worms.

140

Barford et al. [9] use time-frequency signal analysis to develop a change detection

algorithm. Krishamurthy et al. [10] propose a sketch-based change detection algorithm.

Lakhina et al. [7], [8] propose a subspace method to detect and characterize network-

wide volumetric traffic anomalies. The authors then extend their work in [13] and use

entropy to detect anomalies. Another recent study by Gu et al. [14] uses maximum-

entropy estimation to quantify a baseline distribution at a network gateway or router,

which is in turn used to classify anomalous activity using the K-L divergence.

The most commonly used endpoint-based network-level malware detection technique

is rate limiting. This technique proposed by Twycross and Williamson [19], [20] limits

the rate of an endpoint’s network traffic to curb and detect malware propagation. Sellke

et al. [21] extend rate limiting by proposing a branching worm propagation model and in

turn using this model to develop a window-based rate limiting mechanism.

Wong et al. [37], [38] show that rate limiting is not very effective on endpoints or

local network perimeter, but can provide effective malware throttling if deployed on

backbone routers. Panjwani et al. [42] evaluated whether portscans are precursor to

malicious attacks. It was concluded in [42] that over 50% of attacks are not preceded by a

portscan and, therefore, “port scans should not be considered as precursors to an attack.”

Moreover, Li et al. [39] show that statistical filtering-based defense mechanisms are

effective when they are adapted in accordance with an attack. In [39] it is also shown that

the performance of a statistical filter degrades significantly if the attacker is more

adaptive than the filter.

In the host-based anomaly detection context, most of the existing detectors

characterize benign user behavior by modeling commands given by a user in a textual OS

141

environment [26]–[29]. Due to the high market penetration of graphical operating

systems, it is important to model graphical behavioral features of end-users.

A recent technique called BINDER [25] correlates keystrokes with OS processes and

raises an alarm whenever a process is initiated without an end-user’s input. There are

important differences between BINDER and the detector proposed in this thesis. First,

BINDER is purely host-based and does not employ any network session information.

Second, BINDER cannot detect memory-resident malicious codes because its detector is

invoked only when a new process is created. (There have been many well-known worms

that were memory-resident; two most famous examples are CodeRed II and Witty.)

Since our technique uses both network and host information, it can detect memory-

resident malware. Lastly, BINDER requires a whitelist of legitimate applications before

deployment. The detector proposed in this thesis can be deployed out-of-the-box after

which all training is done online.

142

CHAPTER B.3 BACKGROUND

In this section, we provide background material which is required to understand the

contributions of this part.

B.3.1 Self-Propagating Malware Self-probating malware is a recent term that is used to refer to a malicious code that

has the ability to spread from one compromised computer to another computer without

any human intervention. These malicious codes generally target vulnerabilities in

background processes or services that are continuously running on vulnerable hosts. After

compromising a vulnerable computer, a self-propagating malware tries to locate and

infect other vulnerable hosts on the network. The process of locating vulnerable hosts is

called scanning. Over the last few years, malware have evolved to use very sophisticated

scanning and infection techniques [41].

There are two prevalent types of self-propagating malware:

• Worms: A worm is a standalone malicious code that propagates copies of itself to

vulnerable computers;

• Bots: A bot is a malware which after infecting a computer contacts a central

command and control server. This server in turn makes the compromised computer

part of a bot network (botnet) of compromised computers. These botnets are

subsequently remote controlled by the central server.

143

After compromising vulnerable computers, malware can use these computers to

launch distributed denial of service (DDoS) attacks, relay spam or steal personal

information.

B.3.2 Support Vector Machines

Given training vectors ni ∈x ¡ , 1,2, ,i l= … in two classes, and a vector l∈y ¡

such that each 1, 1iy ∈ + − , a C-SVM for non-separable data considers the following

primal optimization problem [46]:

( )( )1

1min ,2lT Ti i i

iC y K bα

=+ +∑w w w s x

such that derivatives of the objective function vanish with respect to iα and subject to the

constraint that 0, 1,2, ,i i lα ≥ = … . In the objective function w is a perpendicular to the

hyperplane that separates the positive and negative points, C is a parameter that is used

to cost the iα ’s, ( ),iK s x is a non-linear kernel that maps the input data to another

(possibly infinite dimensional) Euclidean space, and is ’s are points called the support

vectors that maximize the separation between the positive and negative examples. We use

a degree-3 radial basis kernel function to train the C-SVM.

144

CHAPTER B.4 DATA COLLECTION AND SIMULATION

In this section, we explain the two main datasets collected for this study. The first

dataset comprises benign traffic and keystroke profiles collected from several hosts with

regular human users. The second dataset comprises real and simulated malware traffic.

Since university policy and user reservations prohibited us from infecting operational

endpoints with malware, we first identify network- and host-based features perturbed by

the introduction of malicious code into each system and then perform offline analysis by

inserting malicious traffic at random instances in the endpoints’ benign traffic profiles.

B.4.1 Benign Traffic-Keystroke Profiles Our first step towards the development of a network-based malware detector was to

collect pertinent network and OS-based data. We started by investing up to 12 months in

monitoring network/OS profiles of a diverse set of 13 endpoints. Users of these

endpoints included home users, research students, and technical/administrative staff with

Windows 2000/XP laptop and desktop computers. The laptop endpoints were used by

their users both at home and at work. Some endpoints, in particular home computers,

were shared among multiple users. The endpoints used in this study were running

different types of applications, including peer-to-peer file sharing software, online

multimedia applications, network games, SQL/SAS clients etc.

Data were collected by a multi-threaded windows application called argus, which

runs as a background process storing network and keystroke activity in a log file. The log

145

file is periodically and securely uploaded to a secure copy (SCP) server. argus only

logs session-level information where a session corresponds to bidirectional

communication between two IP addresses. Communication between the same IP address

on different ports is considered part of the same network session. This session-level

granularity reduces the complexity of the malware detector, while providing complete

information about sessions originating from or terminating at an endpoint. Each session is

logged using the information contained in the first packet of the session. A session

expires if it does not send/receive a packet for more than τ seconds. In the collected data,

τ is set to 10 minutes.

For each logged session, argus also logs the last keystroke or mouse click that was

pressed before the first packet of the session. We generically refer to keyboard and mouse

inputs as keystrokes or keys in this thesis. The last keystroke is associated with a session

only if the key was pressed no more than λ seconds before the session. If there was no

key pressed in the last λ seconds before a session then a void keystroke value of zero is

inserted. In the collected traces, λ is set to 10 seconds. Throughout this thesis, we only

focus on sessions with non-zero keys. We assume that the last pressed key has initiated

the associated session, that is, an inherent correlation relationship is assumed between the

last key and the consequent session. Clearly, this correlation will not be present when a

malicious code is trying to propagate from an oblivious end-user’s computer, and hence

perturbations in the session-keystroke correlation can be leveraged at that point to detect

the malicious code.

Each entry of the log file has the following seven fields:

146

<session id, direction, protocol, src port, dst port,

timestamp, virtual key code>,

whose explanation is given below:

• session id: 20-byte SHA-1 hash [47] of the concatenated hostname and remote IP

address. Hashing preserves privacy, which is important because the collected data are

going to be publicly available;

• direction: one byte flag indicating outgoing unicast, incoming unicast, outgoing

broadcast, or incoming broadcast packets;

• protocol: transport-layer protocol (i.e., TCP or UDP) of the packet;

• src port: source port of the packet;

• dst port: destination port of the packet;

• timestamp: millisecond-resolution time of session initiation;

• virtual key code: one byte virtual key code, as defined by Microsoft’s MSDN

library [48], of the last (keyboard or mouse) keystroke that was pressed before the

session. In view of our stringent privacy considerations, we only log the very last

keystroke that was pressed right before the first packet of a new session. Throughout

this thesis, we refer to this jointly collected session and keystroke data as session-key

or key-session data. Moreover, keystrokes observed in this joint profile are referred to

as the session initiation keys.

Some pertinent statistics of the collected benign data are listed in Table 6. Diversity

of the endpoints used in this study is evident from Table 6, which shows that the

endpoints operate in different environments (and hence run different types of

applications). Also, the total size of the dataset (i.e., total number of sessions) varies from

147

11 996, for endpoint 13 to 444 345, for endpoint 4 . In general, we observed that home

computers generate significantly higher traffic volumes than office and university

computers because: (i) they are generally shared between multiple users, and (ii) they run

peer-to-peer and multimedia applications. The high traffic volumes of home computers

are also evident from the high mean sessions per second [column 4].

Another interesting observation is that, with the exception of home computers, the

observed endpoints generally use a small set of source and destination ports very

frequently [columns 5 and 6]. (The source and destination port frequencies in Table 6 are

computed for outgoing unicast packets.) This observation holds particularly true for

destination ports because in most cases ten destination ports are used approximately 90%

of the times – endpoints 3 and 4 being the exceptions here. This is a preliminary

Table 6. Statistics of the Benign Profiles

Endpoint ID

Endpoint Type

Home/Office/Univ

Total Sessions

Mean Session

Rate (sps)

Cumulative Freq of

Ten Most-Used Src Ports (%)

Cumulative Freq of

Ten Most-Used Dst Ports (%)

Cumulative Freq of Ten Most-Used

Session Keys (%)

1 Office 33 487, 0 25. 90 37. 88 06. 96.01 2 Office 21 066, 0 22. 47 8. 87 53. 92.32 3 Home 373 009, 1 92. 3 95. 37 29. 94.01 4 Home 444 345, 5 28. 5 86. 10 82. 94.86 5 Home/Univ 27 873, 0 44. 15 91. 99 27. 95.25 6 Univ 60 979, 0 19. 54 95. 94 0. 95.49 7 Univ 171 601, 0 28. 40 7. 96 75. 95.56 8 Univ 41 809, 0 52. 66 1. 96 44. 96.13 9 Univ 235 133, 0 41. 44 1. 94 84. 95.48 10 Univ 152 048, 0 21. 75 19. 95 11. 95.27 11 Univ 207 187, 0 31. 38 85. 95 2. 95.14 12 Home/Univ 100 702, 0 33. 24 78. 95 0. 95.13 13 Univ 11 996, 0 23. 44 56. 95 98. 95.95

148

indication that port usage is a statistic that is somewhat consistent across endpoints, and

therefore can be leveraged to detect malicious activity. Also, later in the thesis it is shown

that the different benign behavior of home endpoints poses a considerable challenge to

malware detectors.

The last important observation is that without exception all of the observed endpoints

use a small set of session initiation keys very frequently [column 7]. (The session

initiation key frequencies in Table 3 are computed for outgoing unicast packets with non-

zero keys.) In fact, on all hosts more than 90% of the sessions are initiated using 10

keys. This is a preliminary indication that the correlation of the session-key data is

consistent across endpoints and therefore can be leveraged to detect malicious activity.

The joint session-key data described above provides us correlated information of

keystroke and sessions. In other words, this data can be used to develop a joint session-

key probability distribution. In addition to the correlated/joint data, the keystroke-based

detectors proposed later in this thesis also requires marginal distributions of keystrokes.

That is, we need a distribution of all the keystrokes that are pressed on an endpoint. The

following section describes this data.

B.4.2 All-Keystrokes’ Profiles To develop a marginal distribution of keystrokes, we had to log all the keys that are

pressed on a host. Due to strict privacy constraints imposed by the university, and due in

part to user reservations, it was not possible to collect such data on all the participating

hosts. We installed a custom-developed keylogger on two computers [endpoints 5 and

12] and collected keystroke data for more than a month. Each entry of the keylogger

149

contains two fields: <timestamp, keystroke>, which are in the same format as

described in the last section.

This dataset is referred to as the all-keys data. For the remaining endpoints, an

average of the all-keys data of endpoints 5 and 12 is used for the keystrokes’ marginal

distribution. This marginal keystroke distribution is simply a normalized histogram of the

frequency of usage of the keystrokes.

In addition to benign data, we have also collected malware data generated by real

malicious codes. The following section explains collection of the malicious traffic data.

B.4.3 Malware Classification To generate traffic patterns for each malware, we infected a vulnerable machine with

a malware and observed the traffic generated by the malware using the argus data

utility described in the previous section. (The vulnerable machines used here are different

from the operational endpoints used for benign profile collection.) This section details the

malware collected and simulated in this study. Before we describe malware data

collection, explanation of some terminology is in order.

After compromising a vulnerable host, a malware tries to infect other computers by

sending out scan packets with infectious payloads. A vulnerable machine gets infected if

it receives and processes a scan packet. Throughout this text, scan packets generated by a

malware after compromising a vulnerable host are referred to as outgoing scan packets.

Based on the outgoing scan packets, we classify malware into two broad categories:

• Destination-port malware: destination ports of scan packets are fixed, but the source

ports may be arbitrary;

150

• Source-port malware: source ports of scan packets are fixed, but the destination ports

may be arbitrary.

In the former case, we call the destination ports of a malware attack ports and source

ports non-attack ports. In the latter case, the roles are reversed and we call source ports

attack and destination ports non-attack. With the exception of the Witty worm [57],

[60], all contemporary malware are destination-port malware. However, the above

classification is important to understand later results. Note that a source/destination port

malware can be multi-vector [41] targeting multiple vulnerabilities simultaneously. We

now describe the malware used in this study.

B.4.4 Real Malware A critical aim of our study is to use real and diverse malware data to test our detection

techniques. To this end, we installed original and unpatched releases of Windows 2000

and Windows XP on a computer using Microsoft Virtual PC 2004 [49]. The advantage of

using virtual machines (VMs) was that once a virtual host was infected, we could

reinstall it by overriding just a few key files. We assigned static IP addresses to both

virtual machines and connected them to the Internet. These hosts were then compromised

by the following malware: Zotob.G [50], Forbot-FU [51], Sdbot-AFR [52], and

Dloader-NY [53]. We also requested network administrators and research

collaborators in our university to share malware binaries and source codes with us. This

way we acquired SoBig.E@mm [54] and the C source code of MyDoom.A@mm [55],

which are mass-mailing worms. Finally, we downloaded binaries or source codes of the

following worms from the Internet: Blaster [56], Rbot-AQJ [57], and RBOT.CCC

[58].

151

Table 7 shows the diversity of the malware used in this thesis. The malware have

different (and sometimes multiple) attack ports and transport protocols. Also, these

malware include both high- and low-rate malware; Dloader-NY has the highest scan

rate of 46 84. scans per second (sps), while MyDoom-A and Rbot-AQJ have very low

scan rates of 0 14. and 0 68. sps, respectively. We show later that the low-rate MyDoom-

A and Rbot-AQJ are more difficult to detect than high-rate malware. Blaster is one

of the two worms that are used to generate negative examples for SVM training later in

the document.

All real malware collected for this study fall into the widely prevalent category of

destination-port malware. While these malware provided us with a good base for

evaluating our proposed techniques, we wanted to test our methods against an even

broader class of attacks. Consequently, we simulated three additional malware that were

somewhat different from the ones described above. These simulated malware and their

distinguishing characteristics are described next.

Table 7. Information of Malware Used in This Study

Malware Release Date Avg. Scan Rate (sps) Port(s) Used Blaster Aug 2003 10 5. TCP 135 , 4444 , UDP 69

Dloader-NY Jul 2005 46 84. TCP 135 , 139 Forbot-FU Sep 2005 32 53. TCP 445 MyDoom-A Jan 2006 0 14. TCP 3127 3198− RBOT.CCC Aug 2005 9 7. TCP 139 , 445 Rbot-AQJ Oct 2005 0 68. TCP 139 , 769 Sdbot-AFR Jan 2006 28 26. TCP 445 SoBig.E Jun 2003 21 57. TCP 135 ,UDP 53 Zotob.G Jun 2003 39 34. TCP 135 , 445 ,UDP 137 Witty Mar 2004 357 0. UDP 4000

CodeRed II Jul 2004 4 95. TCP 80 Sim Src Port Simulated 3 57. TCP 1500

152

B.4.5 Simulated Malware

The first malware simulated for this study is the Witty worm [59], [60]. Among

other distinguishing characteristics, this worm has two unique properties that are of direct

consequence here: (a) it uses a fixed source port 4000 to propagate, while the destination

port is selected randomly; and (b) after every 20 000, transmitted packets Witty

overwrites a random block on the hard disk of the compromised host. Therefore, Witty

not only falls in the rare source-port malware category, but it also potentially crashes

compromised hosts after dispatching only 20 000, scan packets. On an endpoint with

broadband connectivity, Witty demonstrates an average scan rate of 357 sps, peaking

out at 970 sps [60]. At this rate, 20 000, scan packets can be transmitted (and the infected

host crashed) in less than a minute, which presents a tremendous challenge to real-time

detectors. We simulated the Witty worm using the exact pseudo random number

generator parameters and pseudo code provided in [60]. We only test the worst-case

scenario with 20 000, scan packets at the average scan rate of 357 sps.

In addition to Blaster, we employ Witty as the second worm for training the

SVMs in the network-based malware detector proposed in the following chapter. To

comprehensively evaluate the performance of the proposed detector for source port

malware, we simulate a worm that sends scan packets with a fixed TCP source port of

1500 at an average scan rate of 3 57. sps; note that this scan rate is exactly 100 times

less than Witty’s average scan rate, which makes this simulated worm challenging to

detect.

The last simulated malware of this study is an HTTP worm. We acknowledge that it

is unlikely that an endpoint will be running a service that can be infected by an HTTP

153

malware. Nevertheless, we simulate an HTTP worm because they use destination port

80 , which is a very common port in the benign profile of an endpoint. Thus it is quite

challenging for network-based frequency/histogram detectors to detect malicious HTTP

traffic. We simulate the HTTP-based CodeRed II worm [61] using an average scan

rate of 4 95. sps [62]. Table 7 gives additional information about the simulated malware.

B.4.6 Inserting Malware Data in Benign Traffic Profiles A vulnerable VM was infected with each of the malicious codes. We then used

argus to log malicious traffic traces from the VM in the same format as the benign

session-key data. While this provided us complete information about the malicious

sessions, we did not have information about the keystrokes that a user will be pressing

when a malicious code is trying to propagate after compromising his/her machine. The

only way to realistically generate such data is to infect participating endpoints with

malicious codes without informing the user of that endpoint. Clearly, such a procedure is

not possible. Therefore, for each malicious session we generate an associated keystroke

using the marginal keystroke distribution generated from the all-keys data.

Armed with this information, we insert T minutes of malicious traffic data of each

malicious code in the benign session-key profile of each endpoint at a random time

instance. Specifically, for a given endpoint’s benign session-key profile, we first generate

a random infection time It (with millisecond accuracy) between the endpoint’s first and

last session times. Given n malicious sessions starting at times 1 nt … t, , , where nt T≤ ,

we create a special infected profile of each host with these sessions appearing at times

1I I nt t … t t+ , , + . Thus in most cases once a malware’s traffic is completely inserted

154

into a benign profile, the resultant profile contains interleaved benign and malicious

sessions starting at It and ending at I nt t+ . For all malware used in this study, we use

15T = minutes.

We are now ready to use the infected profiles to characterize traffic and keystroke

perturbations observed when an endpoint is compromised by a malicious code. In the

following two chapters, we propose two malware detection techniques that use the data

described in this section.

155

CHAPTER B.5 MALWARE DETECTION USING TRAFFIC FEATURES

In this chapter, we propose the first of the two information-theoretic malware

detection techniques developed in this thesis. This technique is purely network-based and

does not utilize the keystroke data described in the last section. Thus in this chapter

malware are detected using only traffic perturbations. Like prior endpoint-based studies,

throughout this thesis we focus solely on outgoing unicast traffic since incoming packets

can be easily blocked using firewalls.

We observe that the vulnerabilities targeted by all malware are associated with a

small number of source or destination ports. Thus, on a compromised machine the

distribution of source or destination ports on which a host communicates should be

perturbed after infection. These perturbations can be quantified using information-

theoretic measures. This chapter evaluates the efficacy of using port perturbations as

features for malware detection and identifies appropriate measures to quantify these

perturbations.

B.5.1 Malware Detection Using Sample Entropy Lakhina et al. [13] in a recent work showed that sample entropy of source and

destination ports observed at a border router can reveal traffic anomalies. We first

evaluate whether entropy is an appropriate feature to detect traffic anomalies at an

endpoint and then propose an alternative framework that significantly surpasses the

performance of prior detectors when used at endpoints.

156

B.5.1.1 Entropy of Source and Destination Ports

Entropy characterizes the degree of dispersal (or concentration) of a probability

distribution without regard to the actual values of the random variable under

consideration. This degree of dispersal is characterized by the variance of a probability

distribution. To compute sample traffic entropy as proposed in [13], we generate usage

frequency histograms of source and destination ports for outgoing packets using a 20 -

second window (other window sizes produce qualitatively similar results). Source and

destination port histograms for each window are computed by counting the number of

times a particular port is used during the window.

Let nS and nD denote the sets of source and destination ports observed in window

n , respectively. Define nn i nX p i S= , ∈ and nn j nY q j D= , ∈ to be respectively

the source and destination port histograms derived from window n , where nip is the

number of times source port i was used in time-window n and njq is the number of

times destination port j was used in time-window n . Also let us define

nnn ii Sp p∈= ∑ as the aggregate frequency of source ports observed in window n and

nnn jj Dq q∈= ∑ as the corresponding frequency of destination ports. Then sample

entropies of the source and destination port histograms for window n can be computed

as:

( ) 2logn

n ni inn ni S

p pH X p p∈= − ∑ and ( ) 2log

n

n nj jn

n nj D

q qH Y q q∈= − ∑ .

(B.49)

157

If there is no traffic in a window n (i.e., 0np = or 0nq = ) then malware detection is

not performed. For simplicity, we refer to sample entropy as entropy and the normalized

port histograms as port distributions.

B.5.1.2 Entropy-based Traffic Perturbations in the Infected Profiles

Figure 35 shows source and destination port entropies of four different endpoints

infected with one random instance of Blaster, MyDoom-A, Rbot-AQJ, and Witty.

2000 4000 6000 8000 10000120001400016000180000

0.5

1

dst p

ort e

ntro

py

2000 4000 6000 8000 10000120001400016000180000

1

2

3

src

port

ent

ropy

time window

1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

dst p

ort e

ntro

py

1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

2

src

port

ent

ropy

time window

(a) endpoint 1, Blaster (b) endpoint 5, MyDoom

2 4 6 8 10

x 104

0

0.5

1

dst p

ort e

ntro

py

2 4 6 8 10

x 104

0

0.5

1

1.5

src

port

ent

ropy

time window

1000 2000 3000 4000 50000

1

2

3

4

dst p

ort e

ntro

py

1000 2000 3000 4000 50000

0.5

1

1.5sr

c po

rt e

ntro

py

time window

(c) endpoint 9, Rbot-AQJ (d) endpoint 13, Witty Figure 35. Source and destination port entropies at infected endpoints. Infection start

times are marked with a circle. Infections in (a), (b), and (c) last approximately 15 minutes, while that in (d) lasts approximately one minute. Each non-overlapping time-

window is 20 seconds.

158

Figure 35 shows that the attack port entropies do not reveal any discernable perturbations.

However, in some cases entropy perturbations in non-attack ports can provide useful

information about infection. For instance, Figure 35 (a) shows that the entropy of source

ports exhibits a sudden increase at the time of infection. Similar behavior is observed in

Figure 35 (d), where the non-attack (destination) ports’ entropy jumps at the infection

time. This phenomenon is not observed for low-rate (MyDoom-A and Rbot-AQJ)

malware as shown in Figure 35 (b) and (c). The jump in non-attack ports’ entropy for

high-rate malware is due to the fact that most endpoints initiate only a few sessions

during any given time-window [see Table 6]. Once compromised, while the attack port is

fixed, an endpoint starts communicating through a large number (i.e., one per scan

packet) of non-attack ports. Thus the degree of dispersal of non-attack ports increases

dramatically, in turn leading to an increase in the entropy. Since low-rate malware do not

initiate a lot of simultaneous sessions, no perturbations in the non-attack port are

discernable for low-rate malware. Thus we conclude that entropy cannot detect

perturbations caused by low-rate malware.

Results of Figure 35 are at odds with [13] which showed that the entropy of

destination ports was perturbed significantly on compromised networks. The failure of

entropy-based anomaly detection in Figure 35 is due to the huge difference in the volume

of traffic observed at a network’s perimeter as opposed to that observed at an endpoint.

During an attack, a perimeter router still observes a considerable amount of traffic on

benign ports, thus perturbing the port distributions enough so as to allow entropy-based

detectors to discern the attack. However, this phenomenon does not occur at individual

endpoints as explained by the following example. Consider two windows of activity

159

observed by an endpoint’s entropy-based anomaly detector. The first window has benign

activity with 9 HTTP sessions on port 80 and 1 FTP session on port 21 . The second

window contains malicious activity with 900 malicious sessions on port 135 and 100

malicious sessions on port 500 . After normalization, both of these windows will render

the same port distribution 0 9 0 1ni n np p i S/ , ∈ = . , . , which is a Bernoulli random

variable with parameter 0 9. . Consequently, the entropy of both malicious and benign

windows will be exactly the same although the traffic behavior in each case is completely

different. Thus anomalous activity can go undetected because entropy does not take the

actual values of source/destination ports into consideration.

Results presented in this section show that the entropy-based framework of [13] is not

a robust indicator of infection when applied to attack ports and therefore cannot be used

to detect non-attack port perturbations for low-rate malware. Entropy fails to highlight

anomalies because it does not take the actual values of the source and destination ports

into consideration. To address this problem, the following section employs an

information-theoretic measure that compares port probabilities in the current window to

the corresponding port probabilities in an endpoint’s benign profile.

B.5.2 Malware Detection Using Information Divergence At this point, we have established that for effective malware detection we need to use

a measure that compares the frequencies of individual ports. Information divergence

measures can provide such comparison. In this section, we evaluate four information

divergence measures. First, we evaluate the widely-used Kullback-Leibler divergence.

160

B.5.2.1 Kullback-Leibler Divergence of Source and Destination Ports

The Kullback-Leibler (K-L) divergence [43] is an information-theoretic measure of

the similarity or dissimilarity between two probability distributions. Let us denote the

benign source and destination port histograms derived from an endpoint’s benign profile

as iX p i S= , ∈ and jY q j D= , ∈ , where S and D respectively denote the sets

of source and destination ports observed in the benign profile. Then the K-L divergence

between the benign and currently observed port histograms can be expressed as:

( ) 2logn

n ni i nnn ii S

p p pD X X p p p∈/|| = /∑ and ( ) 2log

n

n nj j nn

n ij D

q q qD Y Y q q q∈

/|| = /∑ , (B.50)

where ii Sp p∈= ∑ and jj Dq q∈= ∑ respectively represent the aggregate source and

destination port frequencies observed in the benign profile. Note that K-L divergence is

an asymmetric measure. The advantages of using window-based metrics nX and nY as

primary distributions of the K-L divergence are twofold: (a) fewer sessions are observed

in a window as opposed to the benign profile, nS S| |<| | and nD D| |<| | , which

reduces the complexity of real-time detection; and (b) better detection accuracy can be

achieved if we focus on the specific ports engaged in communication during the current

window n .

We generate port histograms of benign profiles using the first 100 sessions on an

endpoint. The training time for the endpoints of this study ranged between 12 hours to 5

days with an average of approximately 2 days. We train with only 100 sessions to

quantify worst-case performance of the proposed detector.

161

To effectively leverage K-L divergence in the present endpoint-based anomaly

detector, we introduce the following provisions. First, in (B.50) if 0ip = and 0nip > ,

for any i , then ( )nD X X|| is set to ∞ . This problem persists for ( )nD Y Y|| in (B.50).

In other words, X and Y must be continuous with respect to nX and nY , respectively.

To achieve this, before training we initialize the benign histograms with 1ip = and

1iq = , for 0 65535i …= , , , which assigns never-used ports very small, non-zero

frequencies.

Second, it is well-known that scaling of training data improves the performance of

learning tools by making the training process better behaved and by mitigating the bias

towards larger input values. Therefore, we normalize the K-L divergence values by a

constant factor.

Finally, to reduce complexity and to filter out noise due to benign data, we introduce

a provision to ignore overtly benign behavior. From the training data, we generate a

histogram of session volume (i.e., total number of sessions) in a window. After

normalization, we compute the histogram’s mean eµ and variance 2eσ for each endpoint

e . We invoke malware detection only when the total number of sessions observed in a

window is greater than e eγ µ σ = + . The value of γ varied between 3 and 13

sessions per 20 second window, with an average of 6 6. sessions per 20 second window,

for the endpoints considered in this study.

162

50 100 150 2002.5

3

3.5

dst p

ort K

-L

50 100 150 2001.5

2

2.5

3

3.5

src

port

K-L

time window

200 400 600 800 1000 1200 1400

2

2.5

3

3.5

dst p

ort K

-L

200 400 600 800 1000 1200 14002

2.5

3

3.5

src

port

K-L

time window

(a) endpoint 1, Blaster (b) endpoint 5, MyDoom

200 400 600 800 1000 1200 1400 1600

1.82

2.22.42.62.8

dst p

ort K

-L

200 400 600 800 1000 1200 1400 1600

2.5

3

3.5

src

port

K-L

time window

1000 2000 3000 4000 5000 6000 70002

2.5

3

dst p

ort K

L

1000 2000 3000 4000 5000 6000 7000

2

2.5

3sr

c po

rt K

L

time window

(c) endpoint 9, Rbot-AQJ (d) endpoint 3, SoBig

100 200 300 400 500 600 700

2

2.5

3

3.5

dst p

ort K

L

100 200 300 400 500 600 7001.5

2

2.5

3

3.5

src

port

KL

time window

50 100 150 200 250 300 350 400 450

1

1.5

2

2.5

3

dst p

ort K

-L

50 100 150 200 250 300 350 400 4502

2.5

3

3.5

src

port

K-L

time window

(e) endpoint 10, Zotob (f) endpoint 13, Witty Figure 36. Source and destination ports’ K-L divergences at infected endpoints.

163

B.5.2.2 K-L-based Traffic Perturbations in the Infected Profiles

The K-L divergences of different endpoints randomly infected with a single infection

of each malware are outlined in Figure 36. We first focus on malware with high scan

rates. From Figure 36 (a), (d), (e), and (f), it is clear that the K-L divergence highlights

anomalous behavior in both attack and non-attack ports for malware with high scan rates.

Comparing Figure 36 (a), (d), (e), and (f) with entropy-based perturbations of Figure 35

(a) and (d) establishes the effectiveness of using a port-by-port divergence measure to

highlight traffic anomalies. Specifically, a K-L-based anomaly detector can reveal

perturbations in the attack port distribution, which is an important characteristic that was

completely missed by entropy. Moreover, for high scan rate malware of Figure 36 (a),

(d), (e), and (f), perturbations in the non-attack port distribution in the K-L divergence are

much more profound than the entropy perturbations. These perturbations are revealed for

both destination [i.e., Blaster, Zotob.G, and SoBig.E] and source [i.e., Witty]

port malware.

For the low-rate malware [MyDoom-A and Rbot-AQJ], Figure 36 (b) and (c) show

obvious perturbations in the attack port divergence. Comparing these two figures with

Figure 35 (b), (c) clearly establishes the advantages of using K-L-based detection features

as opposed to entropy. Due to the low rate of these malware, even the K-L divergence

cannot reveal non-attack-port perturbations. Nevertheless, our results show that

perturbations in the attack port feature are more than sufficient to detection infection.

164

B.5.2.3 Evaluating Traffic Perturbations with Other Information Divergences

In the last section, we observed that the attack ports’ K-L divergence always gets

perturbed on a compromised endpoint. However, the non-attack ports are not perturbed

for low-rate worms. In this section, we evaluate three other information-theoretic

divergence measures with the objective of identifying a measure which can

simultaneously highlight perturbations in both attack and non-attack port distributions for

low-rate worms. Brief description of the three measures follows.

As before, let X and Y respectively represent the source and distribution port

distributions observed in the benign profiles, and nX and nY respectively represent the

source and distribution port distributions observed in window n . The first information

measure that we employ is the Jenson-Shannon (J-S) Divergence measure [44] defined

as:

( ) ( ) ( ) ( )1 2 1 2n n nJ X X H X X H X H Xπ π π π|| = + − −

and ( ) ( ) ( ) ( )1 2 1 2n n nJ Y Y H Y Y H Y H Yπ π π π|| = + − − ,

(B.51)

where 1π and 2π are weighting factors such that 1 2 1π π+ = , and ( ).H is the entropy

function.

The second information divergence measure used in this work is the K directed

divergence measure defined as [44]:

( ) 2log

2 2n

nn i nin nn i ii Sn

p ppK X X p p pp p

∈|| = +

∑ (B.52)

165

and ( ) 2log

2 2n

nn j njn nn j jj D

n

q qqK Y Y q q qq q

∈|| = +

∑ ,

where parameters of the above expressions are defined in (B.49) and (B.50).

The third and last information measure that we use is the Resistor-Average (R-A)

divergence measure defined as [45]:

( ) ( ) ( )1 1 1n n nR X X D X X D X X≡ +|| || ||

and ( ) ( ) ( )1 1 1n n nR Y Y D Y Y D Y Y≡ +|| || || ,

(B.53)

where ( ). .D is the K-L divergence.

Figure 37 shows source and destination port perturbations characterized by J-S, K

and R-A divergences. We only show perturbations for the low-rate My-Doom worm

because perturbations due to high-rate worms are adequately highlighted by the K-L

divergence. Figure 37 clearly shows that none of the three divergences under

consideration can highlight perturbations in the source (non-attack) ports. Thus in the

non-attack ports’ context, the divergences under consideration do not provide any

advantages. Also, from Figure 37 (a) and (b) it can be seen that even the destination

(attack) port perturbations in the J-S and K divergences are not as clear and profound as

the K-L case [see Figure 36 (b)]. The R-A divergence, however, provides clear

perturbations in the destination ports [Figure 36 (c)]. This divergence measure also has

the advantage of being symmetric, i.e., ( ) ( )n nR X X R X X|| = || and

( ) ( )n nR Y Y R Y Y|| = || . On the other hand, R-A divergence is more complex that K-L

166

divergence because it requires computation of two K-L divergences. One of these

divergences has to be computed over the entire sample space of the benign profile,

thereby presenting a significant complexity overhead for an endpoint. Hence, in problem

areas where divergence symmetry is important and complexity is not an issue, R-A

divergence is more appropriate than K-L divergence. In the present malware detection

context, we continue to use the K-L divergence for the remainder of this chapter.

In the following section, we train a machine learning tool using the K-L divergence of

benign and malicious data, which is then used for automated malware detection and

comparison of our approach with prior methods.

200 400 600 800 1000 1200 14000.97

0.98

0.99

1

1.01

dst p

ort J

-S

200 400 600 800 1000 1200 14000.99

0.995

1

1.005

src

port

J-S

time window

200 400 600 800 1000 1200 14000.992

0.994

0.996

0.998

1

dst p

ort K

-Div

erge

200 400 600 800 1000 1200 14000.994

0.996

0.998

1

src

port

K-D

iver

ge

time window

(a) J-S Divergence, endpoint 5, MyDoom (b) K-Divergence, endpoint 5, MyDoom

200 400 600 800 1000 1200 14008

10

12

14

16

dst p

ort R

-A

200 400 600 800 1000 1200 14008

10

12

14

16

src

port

R-A

time window (c) Resistor-Average Divergence, endpoint 5, MyDoom

Figure 37. Jenson-Shannon (J-S), K- and resistor-average (R-A) divergences of source and destination ports at infected endpoints.

167

B.5.3 Leveraging K-L Perturbations in an SVM-based Framework

To use K-L divergences of source and destination ports for automated malware

detection, we first used a simple thresholding mechanism where K-L values above and

below a certain threshold were flagged as anomalous. This simple technique, however,

resulted in high false alarm rates. Consequently, in this section we resort to the

sophisticated support vector machines (SVMs) [46] for real-time malware detection. We

first train the SVMs using K-L divergence values derived from a subset of the benign

profiles and malware data. The SVMs are then used to detect malware in the infected

profiles. We also compare the performance of the proposed detector with the techniques

proposed in [14] and [20].

From contemporary machine learning tools, we select SVMs to classify K-L

divergence values because: (a) SVMs are not probabilistic in nature. Probabilistic

intrusion detectors generally do not take the basic rate of incidence into account, thus

yielding low Bayesian detection rates. (b) SVMs employ a small subset of training

examples (called support vectors) for classification, and all remaining examples are

irrelevant to the classification task. Thus SVMs can train with very few positive (benign)

and negative (malware-based) examples, allowing timely and low-complexity training.

Few negative examples also improve detection rates for novel malware. (c) SVMs are

inherently designed for binary-decision tasks, such as anomaly detection.

B.5.3.1 SVM Training

In this section, we use a small subset of malicious and benign data to train support

vector machines for real-time detection of malware using the K-L divergence. Given

168

positive and negative training examples, an SVM finds a classification boundary that

maximizes the distance between the two classes, while minimizing classification error.

We use a degree-3 radial basis kernel function to train a C-SVM [46].

It should be noted that the use of malware-based, negative training examples does not

compromise the proposed detector’s ability to detect novel malware. The negative

examples only provide a rough quantification of the magnitude of K-L perturbations on a

compromised endpoint. This quantification can be provided by any malware that

highlights perturbations in the source and/or destination ports’ distributions. In general, a

high-rate source-port malware in conjunction with a high-rate destination-port malware

can encompass the increase and decrease in the K-L divergences of source and

destination ports. Malware traces can be hardcoded into the detector, and the training

algorithm can merge the malware traces with an endpoint’s traffic logs to compute the

negative K-L divergence examples.

We use the source and destination ports’ K-L divergence values to train two SVMs

for each endpoint. To train the SVMs, we take ten K-L divergence values from the

benign traffic profile. These values comprise the positive examples. We then take a total

of 13 negative examples by computing K-L divergence of benign traffic windows with

Blaster- and Witty-infected windows. Performance evaluations in the next section

illustrate that this small subset of the available training data can provide highly accurate

detection of novel malware, where the term novel refers to all the remaining malware not

used for SVM training.

169

B.5.3.2 Performance Evaluation and Comparison with Existing Techniques

In this section, we evaluate the performance of the proposed malware detector with

two existing techniques proposed in [14] and [20]. The rate-limiting detector [20] is the

only other technique that is designed specifically for endpoints and the maximum-entropy

detector [14] is one of the only two information-theoretic anomaly detection techniques.

We use the same parameters values and learning/detection algorithms that were

employed in [14] and [20]. We also tried to compare with the entropy-based technique by

Lakhina et al. [13]. However, we observed that it was impractical to migrate the detector

of [13] to endpoints because the detector required projection of high-dimensional feature

metrics into benign and anomalous subspaces at a border router. On an endpoint, the

same technique will result in only 3 possible subspaces, and in most cases it is not

possible to classify them as benign and anomalous using the thresholding technique of

[13].

1 2 3 4 5 6 7 8 9 10 11 12 1382

84

86

88

90

92

94

96

98

100

endpoint ID

aver

age

dete

ctio

n ra

te % Proposed K-L/SVM-based Detector

Maximum-Entropy DetectorRate-Limiting Detector

1 2 3 4 5 6 7 8 9 10 11 12 130

5

10

15

20

25

30

endpoint ID

aver

age

fals

e al

ram

rat

e %

Proposed K-L/SVM-based DetectorMaximum-Entropy DetectorRate-Limiting Detector

(a) detection rate (b) false-alarm rate Figure 38. Comparison of detection and false-alarm rates of the proposed K-L/SVM-

based malware detector with maximum-entropy and rate-limiting detectors. Each point is averaged over 12 malware with 100 random infections per malware per endpoint.

170

We inserted 100 non-overlapping infections of each malware in every endpoint’s

benign profile. As discussed earlier, each infection was approximately 15T = minutes,

with the exception of Witty that had each infection lasting approximately 1T =

minute (i.e., 20 000, packets at 357 sps). Hence, all results provided in this section are

averaged over one hundred experiments per endpoint per malware. We compute detection

and false alarm rates for each experiment as follows. For 100 infections of a particular

malware on an endpoint, the percentage detection rate for that malware is computed by

simply counting the number of infections that are detected by the malware detector. The

false alarm rate is computed by taking the ratio of the total number of false alarms with

the total evaluated time-windows (i.e., windows with one or more sessions).

The average detection and false alarm rates for each endpoint are shown in Figure 38.

It can be seen in Figure 38 (b) that the proposed K-L/SVM-based detector has negligible

false alarm rates at all endpoints. The highest false alarm rate we observed was

approximately 0 45%. , with endpoints 1 , 6 , 7 , and 8 exhibiting almost no false alarms

at all. Also, the detection rate of the proposed technique is 100% for all endpoints except

endpoints 2 and 4 ; for endpoints 2 and 4 , some instances of the low-rate MyDoom-A

and Rbot-AQJ worms were not detected. Nevertheless, even for endpoints 2 and 4 , the

average detection rate is above 90% . Hence, overall the proposed K-L/SVM-based

malware detector provides very high accuracy for the diverse set of endpoints considered

in this study.

Let us now compare the proposed detector and the maximum-entropy detector of

[14]. Figure 38 (a) shows that the proposed K-L/SVM-based detector provides much

higher detection rates than the maximum-entropy detector. Also, for the maximum-

171

entropy detector, the false alarm rates for the home endpoints [endpoints 3 and 4 ] are

extremely high. We believe that the high false alarm rates are due to peer-to-peer

applications running on the home endpoints of this study. Moreover, maximum-entropy

detector was designed for deployment at the perimeter where even in a short period of

time most of the 2 348, packet classes of [14] were observed. On an endpoint, many of

these classes are not present in the benign training data. We observed that even if the

maximum-entropy training is performed using a lot of benign data, the performance still

does not improve. (The maximum-entropy model was trained using 100 and 1000

benign sessions, but the performance in both cases was identical.) Also note that due to

the use of a sliding window, the maximum-entropy detector has higher training

complexity and incurs an inherent detection delay that is not present in our detector. The

run-time complexities of the two techniques are comparable as the maximum-entropy

technique requires frequent computation of K-L divergence over a large sample space of

2 348, outcomes, whereas our technique computes K-L divergence over small sample

spaces followed by SVM classification.

For the rate-limiting detector [20], the detection rates of all endpoints except endpoint

2 are much lower than the proposed K-L/SVM-based detector. Also, much like the

maximum-entropy detector, the false-alarm rates for home endpoints are quite high. (A

false alarm is raised when the rate-limiter reports an anomaly, but the session queue of

the rate limiter has no malicious sessions.) Thus the performance of the rate-limiting

detector, although better than the maximum-entropy detector, is still much worse than the

K-L/SVM-based detector proposed in this thesis. The inferior performance of the rate-

limiting detector shows that simply monitoring traffic volume at an endpoint is not

172

sufficient. In addition to session volume, the actual characteristics of the traffic must also

be taken into account for accurate detection.

Based on the results of this section, we conclude that the K-L/SVM-based malware

detector proposed in this chapter provides significantly better performance than the

techniques of [14] and [20].

Yes

No reset

update

No

Yes

No

Start

Network traces of a source port worm and a destination port worm

Generate source and destination port histogram from benign profile data

Benign data > d sessions

Time since last detection > t seconds Yes

SVM parameters

Observe all sessions on an endpoint

Source and destination port histograms in last window

Train SVMs using K-L divergence of

benign and worm data

Sessions in last window > µ+s

Store benign histograms of source and destination ports, and µ and s

Compute source and destination ports’ K-L

divergences using benign and window-based histograms

Use SVMs to classify the source and destination ports’ K-L values as benign or anomalous

No

Figure 39. A generalized flow diagram of the proposed K-L/SVM-based malware

detector. The shaded area contains real-time components.

173

B.5.4 Summary and Discussion Figure 39 outlines the data flow of the proposed malware detection technique. In

summary, once deployed the detector initially uses d sessions to characterize benign

source and destination port histograms. Traces of a high-rate source port and a high-rate

destination port malware are hardcoded in the detector. K-L divergence of the benign and

malware-based histograms is used to train SVMs. Parameters of the SVMs are then used

for real-time detection of malware. After every window of t seconds, the detector checks

whether the total number of sessions in the window are more than what was statistically

observed in the benign profile. If so, the detector computes the K-L divergence between

the window-based histograms and the benign histograms. The trained SVMs are then

used to classify the source and destination port K-L divergences.

In this chapter, we proposed a network-based malware detector that can detect self-

propagating malicious codes in real-time by leveraging the K-L divergence of benign and

real-time traffic features. In the following chapter, we use the data collected in this thesis

to develop another malware detection technique that correlates both network and host/OS

features to detect self-propagating malware.

174

CHAPTER B.6 MALWARE DETECTION USING JOINT NETWORK-HOST FEATURES

Traditional anomaly detectors are either host- or network-based. We argue that

significant improvements can be achieved if both network and host features are correlated

and then employed in a joint framework. To that end, in this chapter we propose two

endpoint-based joint network-host anomaly detectors both of which exploit the

observation that when a user is actively using his/her computer most of the benign traffic

is triggered by a small subset of keystrokes and mouse clicks. Based on this observation,

we propose to correlate the last input from the keyboard or mouse hardware buffer with

every new network session in a novel entropy-based information-theoretic framework.

B.6.1 Correlation in the Session-Key Data As mentioned before, we focus solely on outgoing unicast traffic. Also, for the

present anomaly detector we only focus on the scenario when the end-user is actively

using his/her computer, although he/she may not be accessing the Internet. This is

achieved by only processing sessions with non-zero keystroke values; recall that a zero

keystroke value implies that no key was pressed right before the session. Detection when

a user is inactive cannot employ keystroke data, thereby requiring purely network-based

approaches.

175

Figure 40 shows the normalized frequencies of the 20 most-used session initiation

keys for two endpoints. In both cases more than 85% of the times network sessions are

initiated by the left mouse click or the Enter key. (Similar results are

observed for the remaining endpoints.) Figure 41 shows the normalized histograms of all

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 13 40 32 4 77 46 73 9 2 38 65 82 8 69 34 37 66 89 17

virtual key code

freq

uenc

y

(a) endpoint 5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 13 40 83 9 32 65 2

162 38 46 8 77 34 37 73 82 4 39 70

virtual key code

freq

uenc

y

(b) endpoint 12

Figure 40. Normalized histograms of 20 most-used session initiation keystrokes. Histograms are generated from the session-key data. Virtual keys codes 1 and 13

correspond to the left mouse click and the Enter key, respectively [48].

176

the keystrokes that are pressed on a host. Note that the all-keys distribution looks quite

different from the session-key distribution of Figure 40. For one thing, the marginal all-

keys distribution of Figure 41 is much more spread out than the session-key distribution

of Figure 40. That is, the variance of the marginal all-keys distribution is more than the

session-key distribution. Also, contrary to the session-key-based keystroke histogram,

less than 50% sessions are initiated by the two most-commonly used keys. Lastly, left

mouse click or Enter key are not in the two most-commonly used keys in either

Figure 41 (a) or (b). These results can be summarized as follows: (i) users frequently

employ only a few session initiation keys to trigger network sessions, thus there is strong

correlation between these few session initiation keys and network sessions; (ii)

frequencies of session initiation keys are very consistent across different users,

consequently making this a common benign feature that can be leveraged to detect

abnormal behavior; (iii) frequencies of keys that are generally used on a host are quite

different from frequencies of session initiation keys.

177

Based on the above discussion, we deduce that session-key correlation is a feature

that is common across users and can be used for malware detection. There are two

information-theoretic measures that can formally leverage this observation for real-time

worm detection. The first measure is the entropy of the keystroke histogram observed in a

time window. Since entropy quantifies the degree of dispersal or concentration of a

00.020.040.060.080.1

0.120.140.160.18

40 17 162 1 38 39 16 161 37 32 8 34 69 83 65 13 9 33 84 77

virtual key code

freq

uenc

y

(a) endpoint 5

00.050.1

0.150.2

0.250.3

0.350.4

40 38 1 37 8

160 16 32 39 162 69 17 65 13 84 83 73 33 34 79

virtual key code

freq

uenc

y

(b) endpoint 12 Figure 41. Normalized histograms of 20 most-used keystrokes. Histograms are generated from the all-keys data. Virtual keys codes 40, 38 and 17 correspond to the down arrow

key, the up arrow key and the control key, respectively [48].

178

probability distribution, according to Figure 41 the keystroke entropy in a malware-

infected window should be higher than the benign windows where only a few keystrokes

are being used to initiate sessions. The second information-theoretic measure that we use

to quantify the keystroke perturbations is mutual information. From Figure 40 it can be

deduced that in a benign time window mutual information of sessions and keystrokes that

are used to initiate the sessions should be very high. On the other hand, in a malware-

infected window this mutual information should decrease as the keystrokes will be drawn

from the marginal all-keys distributions. The following sections formally describe the

entropy and mutual information based detectors.

B.6.2 Malware Detection Using Keystroke Entropy

B.6.2.1 Definition of Keystroke Entropy

Entropy is an information-theoretic measure that can capture the spread/variance of a

distribution quite effectively [43]. Define nn i nX p i K= , ∈ as the histogram of

keystrokes in a time-window n , where nip is the number of times keystroke i was used

in time-window n . Note that due to MSDN’s virtual key code definition,

1 2 255nK …= , , , . Let n

nn ii Kp p∈= ∑ be the aggregate frequency of keystrokes

observed in window n . Then sample entropy of the keystroke histogram for window n

is

( ) 2logn

n ni inn ni K

p pH X p p∈= − ∑ .

(B.54)

179

If there is no traffic in a window n (i.e., 0np = ) then malware detection is not

performed. Based on previous results, we know that for legitimate sessions, nX has

small variance and therefore the keystrokes’ entropy should be low. On the other hand

once a self-propagating malicious code starts initiating sessions, the keystrokes will be

drawn from the marginal keystroke distribution of the all-keys data. Hence the variance

and consequently the entropy of nX should increase.

We compute keystroke entropy on a window-by-window basis. The results reported

in this chapter use a window size of 60 seconds. In each window with one or more

sessions, we compute the keystroke histogram nX which is used in equation (B.54) to

compute the entropy. The marginal keystroke histogram is generated from the first 500

entries of the all-keys data.

B.6.2.2 Entropy Perturbations in the Infected Profiles

We use the infected profiles described in Section B.4.6 to evaluate the performance of

the entropy-based detector throughout this chapter. Since the present detector does not

rely on source and destination ports, there is no need to evaluate against the simulated

malware described in Section B.4.5. Therefore, throughout this chapter we only focus on

detection using the 9 real worms collected for this study. When we used keystroke-

entropy for detection of randomly inserted infections, we observed a number of noisy

spikes due to variations in benign user behavior. We use a median filter to remove the

spikes that arise due to inherent changes in legitimate user behavior. Henceforth, all

results use an order-7 median filter.

The entropies of different endpoints randomly infected with a single infection of a

malicious code are outlined in Figure 42. It can be observed in Figure 42 that keystrokes’

180

entropy clearly highlights anomalous behavior in all cases. The increase in entropy is

revealed for both high- and low-rate malware, and for endpoints with high and low

session rates. Thus we conclude that entropy of keystroke histograms is a robust feature

that can be leveraged for self-propagating malware detection on network endpoints.

181

100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10

Key

stro

ke E

ntro

py

time window 500 1000 1500 2000 2500 3000

0

1

2

3

4

5

6

7

8

9

10

Key

stro

ke E

ntro

py

time window

(a) endpoint 1, Blaster (b) endpoint 3, Forbot-FU

2000 4000 6000 8000 100000

1

2

3

4

5

6

Key

stro

ke E

ntro

py

time window500 1000 1500 2000 2500 3000 3500 4000 4500

0

1

2

3

4

5

6

7

8

9

Key

stro

ke E

ntro

py

time window

(c) endpoint 6, MyDoom-A (d) endpoint 9, Rbot-AQJ

500 1000 1500 2000 2500 3000 35000

1

2

3

4

5

6

7

8

9

10

Key

stro

ke E

ntro

py

time window200 400 600 800 1000 1200

0

1

2

3

4

5

6

7

8

9

10

Key

stro

ke E

ntro

py

time window

(e) endpoint 11, SoBig.E (f) endpoint 13, Zotob.G Figure 42. Entropy of the keystroke histograms at infected endpoints. Infection start times are marked with a circle. Infections last approximately15 minutes. Each non-overlapping

time-window is 60 seconds.

182

B.6.3 Malware Detection Using Session-Key Mutual Information

In this section, in addition to the keystroke distribution, we also characterize the

session information in a probabilistic framework. We show that the conditional mutual

information of the session and keystroke distributions can clearly highlight anomalous

behavior.

B.6.3.1 Mutual Information of Sessions and Keys

Mutual information [43] is an information-theoretic measure of the similarity between

two probability distributions. Consider two random variables X and Y with marginal

distributions ( )p x and ( )p y , and a joint distribution ( )p x y, . The mutual information of

these random variables is defined as

( ) ( ) ( )( ) ( )2log

x y

p x yI X Y p x y p x p y,; = ,∑ ∑ .

(B.55)

Mutual information is a non-negative measure of the similarity between X and Y , with

( ) 0I X Y; = when X and Y are independent. In general, ( )I X Y; increases with an

increase in the correlation between X andY .

To leverage mutual information in the present context, we define X as a binary

random variable which characterizes the probability of whether or not a session was

initiated in the last time window. That is,

0 no session in time window 1 one or more sessions in time windowX ∈ ⇒ , ⇒ .

Moreover, we define Y as a random variable characterizing the keystrokes’ probability

distribution. Specifically, the marginal ( )p Y distribution is simply the normalized all-

183

keys histogram, such as the ones shown in Figure 41. Then the session-keystroke mutual

information can be written as:

( ) ( ) ( )( ) ( ) ( ) ( )

( ) ( )2 20 10 log 1 log0 1

ny K

p x y p x yI X Y p x y p x yp x p y p x p y∈

= , = , ; = = , + = , = = ∑ .

(B.56)

We derive the marginal ( )p X distribution using the first 500 entries of each endpoint’s

benign session-key profile. More specifically, ( )p X is computed by counting the total

number of windows n with one or more sessions between the 1 -st session and the 500 -

th session. We also count the total number of windows N (with and without sessions) in

that time frame. Then, ( 1)p X N n= = / and ( 0) 1 ( 1)p X p X= = − = . The joint

distribution ( 1 )p x y j= , = then simply corresponds to the joint probability that a

network session was initiated using keystroke j .

From the data collection chapter, we know that the keystroke information is not

logged when there are no network sessions in a window. That is, we do not have the

distribution ( )0p x y= , . Hence we cannot use the mutual information expression of

(B.56) in its present form. To resolve this problem, we employ a partial mutual

information measure ( )1I X Y= ; , which only uses the ( )1p x y= , probability

distribution. Since the partial mutual information employs only one outcome of the

random variable X , it can be written as

( ) ( ) ( )( ) ( )2

11 1 log 1y

p x yI X Y p x y p x p y= ,= ; = = , =∑ .

(B.57)

184

Note that due to the binary nature of the session random variable X , the partial mutual

information ( )1I X Y= ; is in fact the self-information of ( )1,p x y= normalized by

( )1p x = . For brevity, we continue to refer to this measure as mutual information.

The above characterization describes the correlation between network sessions and

keystrokes in a simple and intuitive manner. Based on previous results, we know that for

legitimate activity X and Y are highly correlated. Therefore, their mutual information

should be high. Once a self-propagating malicious code starts initiating sessions, the

keystrokes will be drawn from the marginal ( )p X distribution and therefore the

correlation between X and Y should drop.

Like the last section, results reported in this chapter use a window size of 60 seconds.

In each window with one or more sessions, we compute the joint conditional

distribution ( )1p x y x, = . The joint distribution ( )1p x y x, = is to compute the

conditional mutual information. The marginal ( )p X and ( )p Y are generated from the

first 500 values of the all-keys and session-key data, respectively.

B.6.3.2 Mutual Information Perturbations in the Infected Profiles

Similar to the entropy-based keystroke perturbations, we observed some noisy mutual

information spikes. Therefore, like the entropy-based technique we use an order-7

median filter to remove these spikes. The mutual information of different endpoints

randomly infected with a single infection of a malicious code is outlined in Figure 43.

Clearly, session-keystroke mutual information clearly highlights anomalous behavior for

both high- and low-rate malware and endpoints. In the benign data, the mutual

185

information is consistently high because only a few keys are used to initiate most of the

sessions. Once compromised, the endpoint’s marginal keystrokes get flagged as session

initiation keys. The mutual information drops in Figure 43 are because the marginal all-

keys distribution has very little correlation with network sessions.

The keystroke-based measures proposed in this chapter are fairly independent of the

rate of session initiation. This is a unique attribute of the present techniques because other

network-based anomaly detectors implicitly or explicitly use this rate for detection.

Consequently detection and false alarm rates of such detectors are dependent on the

scanning rate of the malicious code. The techniques proposed in this chapter jointly

consider sessions and keystrokes and are therefore not entirely dependent on the session

rate.

In the following section, we develop an automated tool that uses keystroke entropy

and mutual information values for real-time malware detection.

186

100 200 300 400 500 60012

14

16

18

20

22

24

26

28

30

Ses

sion

-Key

Mut

ual I

nfor

mat

ion

time window 500 1000 1500 2000 2500 3000

14

16

18

20

22

24

26

28

30

32

Ses

sion

-Key

Mut

ual I

nfor

mat

ion

time window

(a) endpoint 1, Blaster (b) endpoint 3, Forbot-FU

2000 4000 6000 8000 10000

10

15

20

25

30

35

Ses

sion

-Key

Mut

ual I

nfor

mat

ion

time window0.5 1 1.5 2 2.5

x 104

20

25

30

35

40

45

Ses

sion

-Key

Mut

ual I

nfor

mat

ion

time window

(c) endpoint 6, MyDoom-A (d) endpoint 9, Rbot-AQJ

500 1000 1500 2000 2500 3000 35008

10

12

14

16

18

20

22

24

26

Ses

sion

-Key

Mut

ual I

nfor

mat

ion

time window200 400 600 800 1000 1200

8

10

12

14

16

18

20

22

24

Ses

sion

-Key

Mut

ual I

nfor

mat

ion

time window

(e) endpoint 11, SoBig.E (f) endpoint 13, Zotob.G Figure 43. Mutual information of the session and keystroke random variables at infected endpoints. Infection start times are marked with a circle. Infections last approximately15

minutes. Each non-overlapping time-window is 60 seconds.

187

B.6.3.3 Automated Detection using Keystroke Perturbations

As mentioned in previous sections, we use an order-7 median filter to filter out the

noise in the keystroke entropy and mutual information values. To leverage the filtered

entropy values in a real-time and automated fashion, we train the entropy detector using

the first 50 benign keystroke entropy values and the mutual information based detector is

trained using the first 10 benign mutual information values of an endpoint. We find the

sample mean and sample standard deviation of the entropy values of an endpoint. An

alarm is raised when the filtered entropy value observed in a window is more than the

mean plus three standard deviations. Similarly, we find sample mean and sample standard

deviation of the mutual information values. An alarm is raised when the filtered mutual

information value in a window is less than the mean plus one standard deviation.

We use the infected profiles used in the last chapter for performance evaluation of the

present malware detectors. Thus there are 100 non-overlapping random infections of

each malicious code in every endpoint’s benign profile. As discussed earlier, each

infection is approximately 15T = minutes. Hence, all results provided in this section are

averaged over one hundred experiments per endpoint per malicious code. We compute

detection and false alarm rates for each experiment as follows. For 100 infections of a

particular malicious code on an endpoint, the percentage detection rate for that malicious

code is computed by simply counting the number of infections that are detected by the

malware detector. The false alarm rate is computed by taking the ratio of the total number

of false alarms with the total evaluated time-windows (i.e., windows with one or more

sessions).

188

The average detection and false alarm rates of the entropy and mutual information

based detectors are shown in Figure 44. Figure 44 (a) shows that the detection rate of the

entropy-based technique is 100% for all endpoints and all malware. Detection rate of the

mutual information detector is 100% for all endpoints except endpoint 1 which has an

average detection rate of 99 66%. . Thus both the proposed detectors provide very high

detection accuracy. Figure 44 (b) shows that the mutual information detector has

negligible false alarm rates. The keystroke-entropy detector has slightly higher false

alarm rates than the mutual information detector; the highest false alarm rate of 2.39%

was observed at endpoint 12 . Hence, overall the both malware detector proposed in this

chapter provide very high accuracy for the diverse set of endpoints and malware

considered in this study.

The proposed detectors provide much higher detection rates than the maximum-

entropy and the rate-limiting detectors. As mentioned before, the false alarm rates of the

maximum-entropy and rate-limiting detectors for the high session rate endpoints

[endpoints 3 and 4 ] are extremely high. The reasons for the inferior performance of

these detectors have been highlighted in the last chapter.

The detection accuracy of the keystroke- based detectors proposed in this chapter is

better than the K-L/SVM-based detector of the last chapter. The false alarm rate of the

keystroke-entropy detector is slightly higher than the K-L/SVM-based detector. The

mutual information detector provides false alarm rates which are comparable to the K-

L/SVM detector. Also, the keystroke-based detectors have lower complexity than the K-

L/SVM based detector since they does not require a complex learning tool for automated

detection. The complexity of computing the keystrokes’ entropy and mutual information

189

is also low because these measures are computed on a very small sample space

comprising only the session initiation keystrokes used in the last time-window. However,

the training time required for the keystroke-based detectors is higher than the K-L/SVM

detector. The high detection accuracy and low-complexity of the keystroke-based

malware detectors are a consequence of jointly using network- and host/OS-level

information. In summary, if high detection accuracy and low-complexity are the main

objectives, then the keystroke-based detectors should be used. If low false alarm rates and

small training times are desired, then the K-L/SVM-based detector is more suitable.

Nevertheless, all detectors proposed in this thesis provide highly accurate and fast

detection of self-propagating malware.

123 4 5 6 7 8 9 10 11 12 13

80

85

90

95

100

endpoint ID

aver

age

dete

ctio

n ra

te % Mutual Info Detector

Key-Entropy DetectorMaxEnt DetectorRate-Limiting Detector

1 2 3 4 5 6 7 8 9 10 11 12 130

5

10

15

20

25

30

35

endpoint ID

aver

age

fals

e al

arm

rat

e %

Mutual Info DetectorKey-Entropy DetectorMaxEnt DetectorRate-Limiting Detector

(a) detection rate (b) false-alarm rate Figure 44. Comparison of detection and false-alarm rates of the mutual information based

and keystroke-entropy based malware detectors with maximum-entropy [14] and rate-limiting [20] detectors. Each point is averaged over 9 malicious codes with 100 random

infections per malicious code per endpoint.

190

CHAPTER B.7 ATTACKS AND COUNTERMEASURES

In this chapter, we discuss attacks that can circumvent the proposed malware

detectors and possible countermeasures to mitigate these attacks.

B.7.1 Mimicry Attack In a mimicry attack [66], a malware tries to hide its traffic inside benign traffic to

avoid detection. There are two mimicry attacks that can be launched against the K-

L/SVM based malware detector. Under the first attack, a malware can use ports that are

frequently used by an endpoint. While this attack can mimic non-attack ports, mimicry of

attack ports is not possible because vulnerabilities targeted by a malware are associated

with fixed ports, and consequently the destination ports of outgoing scan packets are

fixed. Thus, even with mimicked non-attack ports, the proposed detector can detect

perturbations in the attack port distribution, as shown by the CodeRed II results in

Section B.5.2.2.

Another type of mimicry attack on the K-L/SVM detector can be launched by a very

low-rate malware which can hide its traffic within benign traffic, while keeping the total

number of sessions under γ , where γ is the threshold number of sessions below which

malware detection is not invoked. As mentioned in Section B.5.2.1, for the endpoints of

this study the values of γ were very small; ranging between 0 15. and 0 65. sessions per

minute, with an average of 0 33. sessions per minute. A mimicking malware with less

191

than γ sessions per time-window will have a very slow propagation rate, and hence will

allow human countermeasures.

A mimicry attack can be launched against the keystroke-based detectors by a malware

which always initiates its scanning sessions after a certain predefined time has elapsed

since the last keystroke. Such a malicious session will not be evaluated by the proposed

keystroke- based detectors. To mitigate this attack, the time threshold for logging the

session initiation keystroke can be made adaptive. Also, we are currently investigating

the efficacy of the keystroke-based detectors in a scenario when the last keystroke is

always logged irrespective of the time elapsed since that keystroke.

B.7.2 Attack by Acquiring System-Level Privileges On an endpoint where security policies and user-privileges are not appropriately

defined, a malware after compromising the endpoint can gain system-level privileges and

can in turn disable the malware detector or overwrite keyboard/mouse buffers [33]. This

vulnerability is a consequence of the design of contemporary operating systems and the

lack of appropriate user rights management. All endpoint-based malware detectors suffer

from this vulnerability. This attack can be mitigated by appropriate security policing and

user management. To completely defeat this attack, a trusted computing platform [67] or

a virtual machine [49] must be employed. Design of such operating systems is presently

an area of active research [68]- [71].

192

CHAPTER B.8 CONCLUSIONS AND FUTURE WORK

In this part, we proposed information-theoretic malware detection techniques for

network endpoints. The first technique leveraged the K-L divergence from an endpoint’s

benign port usage to detect malicious activity. The second set of techniques used the

entropy and mutual information of keystrokes that are used to initiate network sessions to

detect malware propagation. All of the proposed techniques were highly accurate and

provided significant improvements over existing methods.

As future work, we intend to increase the number of endpoints on which data are

collected. Moreover, we are currently collecting data on local area networks to see if the

network-based malware detector of Section B.5 can provide good performance when

deployed on LANs. We are also investigating effective countermeasures against the

attacks outlined in the last section.

193

PART-B REFERENCES

[1] D. Ellis, J. G. Aiken, K. S. Attwood, and S. D. Tenaglia, “A behavioral approach to worm detection,” ACM WORM, October 2004.

[2] C. C. Zou, L. Gao, W. Gong, and D. Towsley, “Monitoring and early warning of Internet worms,” ACM CCS, October 2003.

[3] J. Wu, S. Vangala, and L. Gao, “An effective architecture and algorithm for detecting worms with various scan techniques,” NDSS, February 2004.

[4] S. E. Schechter, J. Jung, and A. W. Berger, “Fast detection of scanning worm infections,” RAID, September 2004.

[5] J. Jung, V. Paxson, A. W. Berger, and H. Balakrishnan, “Fast portscan detection using sequential hypothesis testing,” IEEE Symposium on Security and Privacy, May 2004.

[6] N. Weaver, S. Staniford, and V. Paxson, “Very fast containment of scanning worms,” Usenix Security Symposium, August 2004.

[7] A Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” ACM Sigcomm, August/September 2004.

[8] A. Lakhina, M. Crovella, and C. Diot, “Characterization of network-wide traffic anomalies in traffic flows,” ACM/Usenix IMC, October 2004.

[9] P. Barford, J. Kline, D. Plonka, and A. Ron, “A signal analysis of network traffic anomalies,” ACM/Usenix IMC, November 2002.

[10] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change detection: Methods, evaluation, and applications,” ACM/Usenix IMC, October 2003.

[11] A. Soule, K. Salamatian, and N. Taft, “Combining filtering and statistical methods for anomaly detection,” ACM/Usenix IMC, October 2005.

[12] Y. Kim, W. C. Lau, M. C. Chuah, and H. J. Chao, “PacketScore: Statistics-based overload control against distributed denial-of-service attacks,” IEEE Infocom, March 2004.

194

[13] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” ACM Sigcomm, August 2005.

[14] Y. Gu, A. McCullum, and D. Towsley, “Detecting anomalies in network traffic using maximum entropy estimation,” ACM/Usenix IMC, October 2005.

[15] D. Moore, C. Shannon, G. M. Voelker, and S. Savage, “Network Telescopes,” CAIDA technical report, http://www.caida.org/outreach/papers/2004/tr-2004-04/.

[16] E. Cooke, M. Bailey, Z. M. Mao, D. Watson, F. Jahanian, and D. McPherson, “Toward Understanding Distributed Blackhole Placement,” ACM WORM, October 2004.

[17] M. Bailey, E. Cooke, F. Jahanian, J. Nazario, and D. Watson, “The Internet Motion Sensor: A distributed blackhole monitoring system,” NDSS, February 2005.

[18] D. Dagon, X. Qin, G. Gu, and W. Lee, “HoneyStat: Local worm detection using Honeypots,” RAID, September 2004.

[19] J. Twycross and M. M. Williamson, “Implementing and testing a virus throttle,” Usenix Security Symposium, August 2003.

[20] M. M. Williamson, “Throttling viruses: Restricting propagation to defeat malicious mobile code," ACSAC, December 2002.

[21] S. Sellke, N. B. Shroff, and S. Bagchi, “Modeling and automated containment of worms,” DSN, June/July 2005.

[22] D. Whyte, E. Kranakis, and P. C. van Oorschot, “DNS-based detection of scanning worms in an enterprise network,” NDSS, February 2005.

[23] A. Gupta and R. Sekar, “An approach for detecting self-propagating email using anomaly detection,” RAID, September 2003.

[24] J. Xiong, “ACT: Attachment chain tracing scheme for email virus detection and control,” ACM WORM, October 2004.

[25] W. Cui, R. H. Katz and W-T. Tan, “BINDER: An Extrusion-based Break-In Detector for Personal Computers,” Usenix Security Symposium, April 2005.

195

[26] K. Ilgun, R. A. Kemmerer, and P. A. Porras, “State Transition Analysis: A Rule-based Intrusion Detection Approach,” IEEE Transactions. on Software Engineering, vol. 21, no. 3, pp. 181-199, March 1995.

[27] S. Jha, K. Tan, and R.A. Maxion, “Markov Chains, Classifiers, and Intrusion Detection,” IEEE CSFW, June 2001.

[28] N. Ye, “A Markov Chain Model of Temporal Be-havior for Anomaly Detection,” IEEE Workshop on Information Assurance and Security, June 2000.

[29] W. DuMouchel, “Computer Intrusion Detection Based on Bayes Factors for Comparing Command Transition Probabilities,” Tech. Rep. 91, National Institute of Statistical Sciences, 1999.

[30] A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Kumar, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” SIAM Conference on Data Mining, May 2003.

[31] R. P. Lippmann et al., “The 1998 DARPA/AFRL Off-line Intrusion Detection Evaluation,” RAID, September 1998.

[32] R. P. Lippmann, J.W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 DARPA Off-line Intrusion Detection Evaluation,” ACM Computer Networks, vol. 34, 4, October 2000.

[33] Endpoint Security Homepage, http://www.endpointsecurity.org/.

[34] “Symantec Internet Security Threat Report - Trends for January 05 - June 05,” Volume VIII, September 2005.

[35] T. Raschke, “The New Security Challenge: Endpoints,” IDC/F-Secure, August 2005.

[36] N. Weaver, D. Ellis, S. Staniford, and V. Paxson, “Worms vs. Perimeters: The case for Hard-LANs,” IEEE Symposium on High Performance Interconnects (Hot Interconnects), August 2004.

[37] C. Wong, C. Wang, D. Song, S. Bielski, and G. R. Ganger, “Dynamic quarantine of Internet worms,” DSN, July 2004.

[38] C. Wong, S. Bielski, A. Studer, and C. Wang, “Empirical Analysis of Rate Limiting Mechanisms,” RAID, September 2005.

196

[39] Q. Li, E-C Chang, and M. C. Chan, “On effectiveness of DDOS attacks on statistical filtering,” IEEE Infocom, March 2005.

[40] A. Kuzmanovic and E. W. Knightly, “Low-rate TCP-targeted denial of service attacks,” ACM Sigcomm, August 2003.

[41] S. Staniford, V. Paxson, and N. Weaver, “How to 0wn the Internet in your spare time,” Usenix Security Symposium, August 2002.

[42] S. Panjwani, S. Tan, K. M. Jarrin, and M. Cukier, “An experimental evaluation to determine if port scans are precursor to an attack,” DSN, June/July 2005.

[43] T. M. Cover and J. A. Thomas, “Elements of Information Theory,” Wiley-Interscience, 1991.

[44] J. Lin, “Divergence Measures Based on the Shannon Entropy,” IEEE Transactions on Information Theory, vol. 37, no. 3, January 1991.

[45] D. H. Johnson and S. Sinanovic, “Symmetrizing the Kullback-Leibler Distance,” Technical Report, March 2001.

[46] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121- 167, 1998.

[47] “The Secure Hash Algorithm,” FIPS PUB 180-1, April 1995.

[48] MSDN Library, http://msdn.microsft.com.

[49] Microsoft Virtual PC 2004, http://www.microsoft.com/Windows/virtualpc.

[50] Symantec Security Response, W32.Zotob.G, http://securityresponse.symantec.com/avcenter/venc/data/w32.zotob.g.html.

[51] Sophos Virus Info, W32/Forbot-FU, http://www.sophos.com/virusinfo/analyses/w32forbotfu.html.

[52] Sophos Virus Info, W32/Sdbot-AFR, http://www.sophos.com/virusinfo/analyses/w32sdbotafr.html.

[53] Sophos Virus Info, Troj/Dloader-NY, http://www.sophos.com/virusinfo/analyses/trojdloaderny.html.

197

[54] Symantec Security Response, W32.SoBig.E@mm, http://securityresponse.symantec.com/avcenter/venc/data/[email protected].

[55] Symantec Security Response, W32.MyDoom.A@mm, http://securityresponse.symantec.com/avcenter/venc/data/[email protected].

[56] Symantec Security Response, W32.Blaster.Worm, http://securityresponse.symantec.com/avcenter/venc/data/w32.blaster.worm.html.

[57] Symantec Security Response, W32/Rbot-AQJ, http://www.sophos.com/virusinfo/analyses/w32rbotaqj.html.

[58] TrendMicro Virus Encyclopedia, WORM_RBOT.CCC, http://au.trendmicro-europe.com/smb/vinfo/encyclopedia.php?LYstr=VMAINDATA&vNav=3&VName=WORM_RBOT.CCC.

[59] Symantec Security Response, W32.Witty.Worm, http://securityresponse.symantec.com/avcenter/venc/data/w32.witty.worm.html.

[60] C. Shannon and D. Moore, “The spread of the Witty worm,” IEEE Security & Privacy, vol. 2, no. 4, pp. 46- 50, July/August 2004.

[61] Symantec Security Response, CodeRed II, http://securityresponse.symantec.com/avcenter/venc/data/codered.ii.html.

[62] D. Moore, C. Shannon, and J. Brown, “Code-Red: A case study on the spread and victims of an Internet worm,” ACM/Usenix IMC, November 2002.

[63] A. Kumar, V. Paxson, and N. Weaver, “Exploiting underlying structure for detailed reconstruction of an Internet-scale event,” ACM/Usenix IMC, October 2005.

[64] W. S. Sarle, “AI FAQ,” http://www.faqs.org/faqs/ai-faq/neural-nets/.

[65] S. Axelsson, “The base-rate fallacy and its implications for the difficulty of intrusion detection,” RAID, September 1999.

[66] D. Wagner and P. Soto, “Mimicry Attacks on Host-Based Intrusion Detection Systems,” ACM CCS, Nov. 2002.

[67] Trusted Computing Alliance, https://www.trustedcomputinggroup.org.

198

[68] G. Dunlap, S. King, S. Cinar, M. Basrai, and P. Chen, “ReVirt: Enabling intrusion analysis through virtual-machine logging and replay,” Usenix OSDI, December 2002.

[69] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh, “Terra: A virtual machine-based platform for trusted computing,” ACM SOSP, October 2003.

[70] B. W. Lampson, “Computer security in the real world,” IEEE Computer, vol. 37, no. 6, pp. 37–46, June 2004.

[71] M. Rosenblum and T. Garfinkel, “Virtual Machine Monitors: Current technology and future trends,” IEEE Computer, (38)5, pp. 39–47, May 2005.

wireless channel modeling and malware detection … · information decreases when an endpoint is...

Documents