wireless channel modeling and malware detection … · information decreases when an endpoint is...
TRANSCRIPT
WIRELESS CHANNEL MODELING AND MALWARE DETECTION USING
STATISTICAL AND INFORMATION-THEORETIC TOOLS
By
Syed Ali Khayam
A DISSERTATION
Submitted to Michigan State University
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Department of Electrical and Computer Engineering
2006
ABSTRACT
WIRELESS CHANNEL MODELING AND MALWARE DETECTION USING STATISTICAL
AND INFORMATION-THEORETIC TOOLS
By
Syed Ali Khayam
This is a bipartite thesis that tackles two different research problems: (i) medium
access control (MAC) layer wireless channel modeling and applications of the models in
design, analysis and simulations of wireless systems; and (ii) malicious software
(malware) detection at network endpoints. For both problems, we collect extensive new
datasets which are analyzed and modeled using statistical and information-theoretic tools.
In the first part of this thesis, we provide analysis and modeling of bit-errors at the
802.11b MAC layer. We show that the bit-errors at 2 Mbps and 5.5 Mbps can be modeled
by high-order full-state Markov (FSM) chains. Bit-errors at 11 Mbps are shown to have
long-range dependence (LRD), and consequently a multifractal wavelet model (MWM) is
used to model these LRD bit-errors. The complexity of FSM chains is an exponential
function of the bit-error process’ memory-length. To mitigate the exponential FSM
complexity, we derive guidelines for accurate approximation of an FSM chain of
arbitrary memory-length. These guidelines lead to a novel and accurate constant-
complexity model (CCM) which always consists of five states irrespective of a process'
memory-length.
Two applications of the proposed channel models are explored. First, we use the
models in a novel maximum-likelihood header estimation framework which can be used
by wireless multimedia applications to realize considerable throughput improvements.
Trace-driven wireless video simulations show that the proposed header estimation
framework provides significant improvements over existing techniques. Second, we use
protocol goodput and retransmission metrics to show that inaccurate channel models can
lead to extremely misleading simulation and analytical results. The models proposed in
this thesis, however, provide highly accurate estimates of goodput and retransmissions.
In the second part of this thesis, we propose three endpoint-based anomaly detection
techniques that detect self-propagating malware in real-time by observing deviations
from a behavioral model derived from a benign data profile. In the first technique, we
leverage the Kullback-Leibler (K-L) information divergence of real-time source and
destination ports’ distributions to characterize deviations from the distributions observed
in the benign traffic profile. Experiments using actual endpoint and malware data
demonstrate that the source and destination ports’ distributions are perturbed significantly
on a compromised endpoint. K-L perturbations are used to train support vector machines
which provide almost 100% detection rates and negligible false alarm rates.
The remaining two malware detection techniques proposed in this thesis employ
perturbations in the distribution of keystrokes that are used to initiate network sessions.
We show that the keystrokes’ entropy increases and the session-keystroke mutual
information decreases when an endpoint is compromised by a self-propagating malware.
These two types of perturbations are used for real-time malware detection. Both detectors
provide almost 100% detection rates and very low false alarm rates.
Copyright by SYED ALI KHAYAM 2006
v
ACKNOWLEDGMENTS
I would like to thank my family for always respecting and supporting my professional
and academic goals. I also thank my academic advisor, Professor Hayder Radha, for
always encouraging me to think out-of-the-box and for helping me identify and refine
research ideas. I sincerely thank my friends, family members and colleagues in WAVES
lab who allowed me to collect network traffic data on their computers. Aparna, Mujahid,
Dmitri and Farshad deserve special mention here for discussing and critiquing the theory,
experiments and writing of my research papers. I also thank Shardha who was a great
friend during my first year, and who I regretfully forgot to acknowledge in my Masters
thesis. I must also acknowledge the Higher Education Commission of Pakistan and the
National Science Foundation of USA for their continued financial support during my
M.S. and Ph.D. studies. I thank my Ph.D. committee members and Professor Rong Jin for
their technical and editorial guidance. Finally, I thank those associate editors and
anonymous reviewers who gave constructive feedback on my papers. That feedback has
definitely improved the quality of this thesis.
vi
TABLE OF CONTENTS
LIST OF TABLES........................................................................................................ x LIST OF FIGURES ..................................................................................................... xi Part A Statistical Models of MAC Layer Wirless Channels and their Applications .... 1 CHAPTER A.1 Introduction......................................................................................... 2
A.1.1 Overview of Contributions ............................................................................. 4 A.1.2 Organization of this Part ................................................................................. 6
CHAPTER A.2 Related Work ...................................................................................... 7
A.2.1 Channel Modeling........................................................................................... 7 A.2.2 Cross-Layer Design for Wireless Multimedia ................................................ 9
CHAPTER A.3 Background....................................................................................... 11
A.3.1 802.11b Wireless Networks .......................................................................... 11 A.3.2 Autocorrelation of Random Processes.......................................................... 12 A.3.3 Discrete-Time Markov Chains...................................................................... 12 A.3.4 Burst Representation of Binary Wireless Traces .......................................... 14 A.3.5 The Gilbert Channel Model .......................................................................... 15 A.3.6 Full-State Markov Chains for Wireless Channels ........................................ 16 A.3.7 Long-Range Dependent Processes................................................................ 17 A.3.8 The Multifractal Wavelet Model .................................................................. 19 A.3.9 Performance Evaluation Measure ................................................................. 20
CHAPTER A.4 Empirical Analysis and Accurate Modeling of Wireless Channels.. 22
A.4.1 Wireless Trace Collection............................................................................. 22 A.4.2 Empirical Analysis of 802.1b Bit-Errors ...................................................... 25
A.4.2.1 Autocorrelation Analysis .................................................................. 25 A.4.2.2 Preliminary Empirical Analysis of FSM Chains .............................. 27 A.4.2.3 Long-Range Dependence in 11 Mbps Bit-Errors ............................. 28
A.4.2.3.1 LRD Evaluation by Observing Energy at Different Scales ......... 28 A.4.2.3.2 LRD Evaluation using Variance-Time Diagrams........................ 30 A.4.2.3.3 LRD Evaluation using the Periodogram ...................................... 32
A.4.3 Accurate Modeling of 802.11b Bit-Errors .................................................... 33 A.4.3.1 Bit-Error Modeling at 5.5 Mbps ....................................................... 33 A.4.3.2 Bit-Error Modeling at 2 Mbps .......................................................... 34 A.4.3.3 Bit-Error Modeling at 11 Mbps ........................................................ 35
A.4.3.3.1 The Multifractal Wavelet Model.................................................. 36
vii
A.4.3.3.2 ENK-based Performance Evaluation ........................................... 36 A.4.3.3.3 Performance in Capturing Energy at Different Scales................. 39 A.4.3.3.4 Performance in Capturing the Variance-Time Characteristics .... 39
A.4.4 Discussion ..................................................................................................... 41 CHAPTER A.5 Complexity Reduction for Markov Channels................................... 43
A.5.1 The Hierarchical Markov Model .................................................................. 44 A.5.2 The Hidden Markov Model .......................................................................... 46 A.5.3 FSM Observations ........................................................................................ 48 A.5.4 Observations about FSM Chains .................................................................. 48 A.5.5 Markov Chain Lumpability........................................................................... 51
A.5.5.1 Lumpability for Wireless Bit-Error Channels................................... 51 A.5.5.2 Folded Markov Chains...................................................................... 55 A.5.5.3 Evaluation of Folded Markov Chains ............................................... 58
A.5.6 Complexity Reduction by Approximating an FSM Chain’s Good- and Bad-Burst Behavior .............................................................................................................. 59
A.5.6.1 Simplification of Good-bursts Distribution ...................................... 64 A.5.6.2 Simplification of Bad-bursts Distribution......................................... 65 A.5.6.3 Guidelines for Approximating an FSM chain................................... 66
A.5.7 Constant-Complexity Model......................................................................... 67 A.5.7.1 Performance of the CCM at 2 Mbps ................................................. 68 A.5.7.2 Performance of the CCM at 5.5 Mbps .............................................. 71
A.5.8 Discussion ..................................................................................................... 72 CHAPTER A.6 Channel Model Based Header Estimation for Wireless Multimedia 73
A.6.1 FEC Redundancy Lower Bounds for UDP, UDP-Lite and Header Estimation...................................................................................................................................... 76
A.6.1.1 Redundancy Bounds on the q-ary Symmetric Channel .................... 77 A.6.1.1.1 FEC Redundancy Bound on a UDP based Protocol Stack .......... 78 A.6.1.1.2 FEC Redundancy Bound on a UDP-Lite based Protocol Stack... 79 A.6.1.1.3 FEC Redundancy Bound on a Header Estimation based Protocol
Stack 80 A.6.1.1.4 Comparison of the FEC Redundancy Bounds ............................. 80
A.6.1.2 Redundancy Bounds on the Gilbert Channel.................................... 83 A.6.1.2.1 Bound on a UDP based Protocol Stack........................................ 83 A.6.1.2.2 Bound on a UDP-Lite based Protocol Stack................................ 84 A.6.1.2.3 Bound on a Header Estimation based Protocol Stack.................. 85 A.6.1.2.4 Comparison of the FEC Redundancy Bounds ............................. 85
A.6.1.3 Discussion ......................................................................................... 88 A.6.2 Maximum-Likelihood Header Estimation Framework................................. 88
A.6.2.1 Functionality at and below a Receiver’s MAC layer........................ 89 A.6.2.2 The Header Estimation Module........................................................ 91 A.6.2.3 Processing at a Receiver’s Network, Transport and Application
Layers 91 A.6.3 Likelihood Functions for Header Estimation................................................ 91
A.6.3.1 Header Estimation Likelihood Function for FSM Chains ................ 93
viii
A.6.3.2 Header Estimation Likelihood Function of MWM........................... 95 A.6.3.3 Extending the FSM Likelihood Function to the CCM...................... 98
A.6.4 Performance Evaluation of the Header Estimation Framework ................... 99 A.6.4.1 Experimental Setup........................................................................... 99 A.6.4.2 Throughput Performance ................................................................ 100 A.6.4.3 Comparison of Packet Drops .......................................................... 101 A.6.4.4 False Alarm Rate............................................................................. 102 A.6.4.5 FEC Performance............................................................................ 103 A.6.4.6 Video Performance ......................................................................... 106
A.6.5 Discussion ................................................................................................... 107 CHAPTER A.7 Impacts of Ignoring Channel Memory on Analysis and Simulation of
Wireless Systems ............................................................................................................ 108 A.7.1 Goodput of an Unreliable Protocol ............................................................. 109
A.7.1.1 Goodput of a Wireless Channel ...................................................... 110 A.7.1.2 Goodput of a Binary-Symmetric Channel Model ........................... 111 A.7.1.3 Goodput of a Gilbert Channel Model ............................................. 111 A.7.1.4 Goodput of a Full-state Markov Channel Model ............................ 112 A.7.1.5 Goodput of a Constant-Complexity Channel Model ...................... 114 A.7.1.6 Comparison of Estimated Goodputs ............................................... 115
A.7.2 Retransmissions of a Reliable Protocol ...................................................... 116 A.7.2.1 Expected Retransmissions on a Wireless Channel ......................... 117 A.7.2.2 Comparison of Estimated Retransmissions .................................... 118
CHAPTER A.8 Conclusions and Future Work ........................................................ 122 Part-A References ..................................................................................................... 123 Part B Self-Propagating Malware Detection at Network Endpoints using Information-
Theoretic Tools ............................................................................................................... 132 CHAPTER B.1 Introduction..................................................................................... 133
B.1.1 Overview of Contributions.......................................................................... 134 B.1.2 Organization of this Part ............................................................................. 137
CHAPTER B.2 Related Work .................................................................................. 139 CHAPTER B.3 Background ..................................................................................... 142
B.3.1 Self-Propagating Malware .......................................................................... 142 B.3.2 Support Vector Machines............................................................................ 143
CHAPTER B.4 Data Collection and Simulation ...................................................... 144
B.4.1 Benign Traffic-Keystroke Profiles .............................................................. 144 B.4.2 All-Keystrokes’ Profiles.............................................................................. 148 B.4.3 Malware Classification................................................................................ 149 B.4.4 Real Malware .............................................................................................. 150
ix
B.4.5 Simulated Malware ..................................................................................... 152 B.4.6 Inserting Malware Data in Benign Traffic Profiles..................................... 153
CHAPTER B.5 Malware Detection using Traffic Features...................................... 155
B.5.1 Malware Detection Using Sample Entropy................................................. 155 B.5.1.1 Entropy of Source and Destination Ports........................................ 156 B.5.1.2 Entropy-based Traffic Perturbations in the Infected Profiles ......... 157
B.5.2 Malware Detection Using Information Divergence .................................... 159 B.5.2.1 Kullback-Leibler Divergence of Source and Destination Ports...... 160 B.5.2.2 K-L-based Traffic Perturbations in the Infected Profiles ............... 163 B.5.2.3 Evaluating Traffic Perturbations with Other Information Divergences
164 B.5.3 Leveraging K-L Perturbations in an SVM-based Framework .................... 167
B.5.3.1 SVM Training ................................................................................. 167 B.5.3.2 Performance Evaluation and Comparison with Existing Techniques
169 B.5.4 Summary and Discussion............................................................................ 173
CHAPTER B.6 Malware Detection using Joint Network-Host Features ................. 174
B.6.1 Correlation in the Session-Key Data........................................................... 174 B.6.2 Malware Detection Using Keystroke Entropy ............................................ 178
B.6.2.1 Definition of Keystroke Entropy .................................................... 178 B.6.2.2 Entropy Perturbations in the Infected Profiles................................ 179
B.6.3 Malware Detection Using Session-Key Mutual Information...................... 182 B.6.3.1 Mutual Information of Sessions and Keys...................................... 182 B.6.3.2 Mutual Information Perturbations in the Infected Profiles ............. 184 B.6.3.3 Automated Detection using Keystroke Perturbations..................... 187
CHAPTER B.7 Attacks and Countermeasures ......................................................... 190
B.7.1 Mimicry Attack ........................................................................................... 190 B.7.2 Attack by Acquiring System-Level Privileges............................................ 191
CHAPTER B.8 Conclusions And Future Work ....................................................... 192 Part-B References ..................................................................................................... 193
x
LIST OF TABLES
Table 1. Packet-Level Statistics at 2, 5.5 and 11 Mbps .................................................... 24
Table 2. Performance of MWM and FSM for the 11 Mbps Bit-Error Process ................ 37
Table 3. Performance of the hMM for 5.5 Mbps Bit-Error Process ................................. 45
Table 4. Performance of the HMM for the 5.5 Mbps Bit-Error Process .......................... 47
Table 5. Empirical Evidence in Support of Observation 2 ............................................... 50
Table 6. Statistics of the Benign Profiles........................................................................ 147
Table 7. Information of Malware Used in This Study.................................................... 151
xi
LIST OF FIGURES
Figure 1. The Gilbert channel model [81]. ....................................................................... 15
Figure 2. Set up for collection of wireless bit-error traces. .............................................. 24
Figure 3. Autocorrelation of bit-error traces..................................................................... 26
Figure 4. Percentage of unused FSM states at 2 and 5.5 Mbps. ....................................... 28
Figure 5. Aggregates of the 11 Mbps energy process at different time scales. ................ 29
Figure 6. Variance-time diagrams of two 11 Mbps bit-error traces. ................................ 31
Figure 7. Logscale periodogram of two 11 Mbps bit-error traces. ................................... 32
Figure 8. Performances of varying order FSM chains for the 5.5 Mbps MAC layer bit-error process. ............................................................................................................. 35
Figure 9. Performances of varying order FSM chains for the 2 Mbps MAC layer bit-error process....................................................................................................................... 35
Figure 10. Probability mass functions for good- and bad-bursts random variables derived from an 11 Mbps trace. (Only the probabilities of small bursts are shown here.) .... 38
Figure 11. Energy processes of actual and synthesized bit-error traces. .......................... 38
Figure 12. Variance-time diagrams of varying order FSM chains for the 11 Mbps bit-error process. ............................................................................................................. 41
Figure 13. Variance-time diagrams of the MWM for the 11 Mbps bit-error process. ..... 41
Figure 14. The hierarchical Markov model (hMM) [18].................................................. 45
Figure 15. Transition possibilities for an FSM chain (memory-length, 4k = ). ............. 50
Figure 16. Aggregate states iS and jS containing FSM states , n m and 2 ,2n m , respectively. .............................................................................................................. 55
Figure 17. Performance of FMCs formed by folding a 512 state FSM to 256, 128, 64, 32, 16, 8, 4 and 2 states; the FSM process is trained using a 5.5 Mbps trace. ................ 58
Figure 18. State transitions of an FSM with memory-length k and a good-burst of length l k≥ . ........................................................................................................................ 60
xii
Figure 19. State aggregation and transitions for the CCM. Each box represents an aggregate CCM state. The number(s) inside a CCM state are the aggregated FSM states.......................................................................................................................... 68
Figure 20. ENK based modeling performance versus complexity for the 2 Mbps bit-error process....................................................................................................................... 69
Figure 21. ENK based modeling performance versus memory-length for the 2 Mbps bit-error process. ............................................................................................................. 69
Figure 22. ENK based modeling performance versus complexity for the 5.5 Mbps bit-error process. ............................................................................................................. 71
Figure 23. ENK based modeling performance versus memory-length for the 5.5 Mbps bit-error process. ....................................................................................................... 71
Figure 24. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header Estimation over an q -ary symmetric channel; 8m = , 256q = , 30L = , 60Hn = ,
452Dn = ................................................................................................................. 82
Figure 25. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header Estimation over a Gilbert channel; 8m = , 30L = , 60Hn = , 452Dn = . ......... 87
Figure 26. Interactions between the UDP-based header estimation module and different layers of a wireless receiver’s protocol stack; modified protocol stack layers are shown in different colors and dotted lines represent communications that are not related to packet reception. ....................................................................................... 89
Figure 27. Average packet drops for UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates and for varying number of video streams per receiver; each point is averaged over ( )3 # of video streams 5 25× × × received video streams. ......................................................................................................... 101
Figure 28. Codeword construction for video FEC simulations. ..................................... 104
Figure 29. Average FEC redundancy required by UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates of an 802.11b LAN; each point is averaged over 3 5 5 25 1875× × × = received video streams............................................... 105
Figure 30. Average PSNR of video sequences for UDP Normal, UDP-Lite and UDP with Header Estimation using a 30 byte RS codeword with 2 parity bytes; each graph is averaged over 3 5 5 75× × = received video streams. .......................................... 107
Figure 31. Comparison of the average goodput of the actual traces with the goodput estimates provided by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five traces................................................................... 115
xiii
Figure 32. Comparison of the number of retransmissions per packet estimated by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five traces. ............................................................................................................... 120
Figure 33. Number of retransmissions per packet without the BSC model.................... 120
Figure 34. Number of retransmissions per packet without the BSC model.................... 120
Figure 35. Source and destination port entropies at infected endpoints. Infection start times are marked with a circle. Infections in (a), (b), and (c) last approximately 15 minutes, while that in (d) lasts approximately one minute. Each non-overlapping time-window is 20 seconds. .................................................................................... 157
Figure 36. Source and destination ports’ K-L divergences at infected endpoints. ......... 162
Figure 37. Jenson-Shannon (J-S), K- and resistor-average (R-A) divergences of source and destination ports at infected endpoints. ............................................................ 166
Figure 38. Comparison of detection and false-alarm rates of the proposed K-L/SVM-based malware detector with maximum-entropy and rate-limiting detectors. Each point is averaged over 12 malware with 100 random infections per malware per endpoint................................................................................................................... 169
Figure 39. A generalized flow diagram of the proposed K-L/SVM-based malware detector. The shaded area contains real-time components...................................... 172
Figure 40. Normalized histograms of 20 most-used session initiation keystrokes. Histograms are generated from the session-key data. Virtual keys codes 1 and 13 correspond to the left mouse click and the Enter key, respectively [48].................................................................................................................................. 175
Figure 41. Normalized histograms of 20 most-used keystrokes. Histograms are generated from the all-keys data. Virtual keys codes 40, 38 and 17 correspond to the down arrow key, the up arrow key and the control key, respectively [48].177
Figure 42. Entropy of the keystroke histograms at infected endpoints. Infection start times are marked with a circle. Infections last approximately15 minutes. Each non-overlapping time-window is 60 seconds................................................................. 181
Figure 43. Mutual information of the session and keystroke random variables at infected endpoints. Infection start times are marked with a circle. Infections last approximately15 minutes. Each non-overlapping time-window is 60 seconds...... 186
Figure 44. Comparison of detection and false-alarm rates of the mutual information based and keystroke-entropy based malware detectors with maximum-entropy [14] and rate-limiting [20] detectors. Each point is averaged over 9 malicious codes with 100 random infections per malicious code per endpoint. .............................................. 189
1
PART A
STATISTICAL MODELS OF MAC LAYER WIRLESS CHANNELS AND THEIR
APPLICATIONS
2
CHAPTER A.1 INTRODUCTION
Error modeling has been used to improve design of communication channels and
systems for many decades [1]–[7]. Stochastic models of wireless medium access control
(MAC) layer packet-losses and bit-errors have recently attracted significant research
attention [8]- [30]. The main objective of analyzing and modeling MAC-to-MAC [29] or
residual [11] bit-errors is to develop accurate simulators which allow experimentation
without having the actual network in place. Moreover, bit-error analysis and modeling
provide important insights into characteristics of the underlying error random process.
This insight is essential for design and performance evaluation of a wide range of
wireless protocols, applications and services. For instance, accurate channel models can
facilitate design, parameter tuning and verification of the following wireless protocols:
• Wireless congestion control protocols, instead of relying on MAC layer
retransmissions, can use accurate MAC layer error models to differentiate between
losses due to congestion, medium degradation or mobility. The inability of wired
congestion control algorithms to differentiate between different types of losses and the
consequent bandwidth underutilization have been repeatedly highlighted by prior
studies [10], [31]–[40]. Knowledge of losses due to channel errors, which is assumed
in many wireless congestion control solutions, can be provided by a real-time MAC-
to-MAC channel model. Understanding of error frequency and burstiness is also
instrumental in parameter tuning of congestion control protocols.
3
• Cross-layer protocols can use a real-time channel model to choose between reliable
(e.g., using MAC layer retransmissions) versus cross-layer (e.g., ignoring data payload
errors [41]–[51]) protocols.
• Reliable routing protocols [52]–[55] for mobile networks can use MAC-to-MAC
channel models to differentiate reliable versus shortest routes to different destinations,
if the model is able to provide real-time error characterization at different hops of the
network.
• MAC protocols can decide when to increase/decrease the physical transmission data
rate based on real-time channel estimation. An accurate channel model can predict
future error characteristics, thereby saving the MAC layer protocol the overhead of
switching to an inaccurate lower/higher data rate based on short-term observations.
Similarly, design of many wireless applications can be improved by accurate channel
models. For instance:
• Real-time channel estimation provided by an accurate model can be employed by rate-
adaptive applications to perform channel- and/or source-coding rate adaptation for
efficient bandwidth utilization.
• Design of effective error-control schemes for different wireless applications requires a
thorough understanding of errors above the physical layer [56].
• Error-resilience features of contemporary multimedia codecs can be effectively
designed and verified with knowledge of MAC layer error characteristics.
Note that most benefits of a wireless MAC layer channel model can be realized if the
model is able to provide real-time and online channel characterization and prediction. In
4
complexity- and power-constrained wireless environments, such channel characterization
is only possible with a low-complexity model. Despite some recent interest in reducing
the complexity of wireless models [23]- [29], development of accurate, pragmatic and
low-complexity wireless channel models is still an open problem.
A.1.1 Overview of Contributions In this part of the thesis, we analyze and model bit-errors propagated to the 802.11
MAC layer at three physical layer data rates of an 802.11b LAN: 2, 5.5 and 11 Mbps
[57], [58]. Our objective is to develop low-complexity MAC-to-MAC channel models
without compromising modeling accuracy. To that end, Chapter A.4 focuses on empirical
analysis and “accurate” modeling of the bit-errors observed at 2, 5.5 and 11 Mbps. After
identifying accurate bit-error models, in Chapter A.5 we reduce the complexity of these
models by approximating their behavior. In Chapter A.6, the accurate and low-
complexity channel models are used in a header estimation framework to improve
wireless multimedia quality. As a final contribution of this part, Chapter A.7 shows that
inaccurate channel models can provide extremely misleading results for critical wireless
performance metrics.
Chapter A.4 shows that the MAC-to-MAC bit-error characteristics vary with changes
in the physical layer data rate. We show that the error-rate is quite low at 2 Mbps as
compared to 5.5 and 11 Mbps. At 2 Mbps, approximately 95% of the packets are
received without errors, which is a testament of the high physical layer robustness at 2
Mbps. The loss-rate subsequently increases with an increase in data rate.
We observe that the 2 and 5.5 Mbps bit-errors exhibit decaying correlation and a low
memory-length can be identified. Thus the bit-errors at 2 and 5.5 Mbps can be modeled
5
using Markov chains [59]. However, the bit-errors at 11 Mbps exhibit very high
correlation even at large lags. Such high correlation is reminiscent of long-range
dependence (LRD) [60] in the 11 Mbps bit-error process. We substantiate the LRD
notion through aggregation, variance-time and periodogram analyses.
Bit-errors at 2 and 5.5 Mbps are accurately modeled using high-order full-state
Markov (FSM) chains [59]. The LRD nature of the 11 Mbps bit-errors renders traditional
stochastic models (e.g., Markov, Poisson) ineffective. Therefore, we employ a
multifractal wavelet model (MWM) [61]–[63] to characterize the 11 Mbps bit-error
random process. For comparison, we also model the 11 Mbps bit-error process using
FSM chains of varying orders. We demonstrate that the MWM outperforms the Markov
models in both complexity and channel approximation.
The complexity of FSM chains increases exponentially with respect to the memory-
length. In Chapter A.5, we mitigate the exponential FSM complexity by approximating
the FSM behavior using low-complexity models. We first show that hierarchical, hidden
and lumped Markov models cannot capture the complex behavior of FSM chains.
Consequently, we directly analyze FSM chains and derive important guidelines that
should be followed to realize accurate, effective and low-complexity models. These
guidelines are used to propose a constant-complexity model (CCM) [30] that always
comprises of five states irrespective of the underlying process’ memory-length. At both 2
and 5.5 Mbps, the 5-state CCM provides performance that is comparable to the
exponential-complexity FSM chains and better than the linear-complexity models [29].
In Chapter A.6, we leverage the proposed low-complexity channel models in a novel
cross-layer wireless multimedia framework. Under the proposed header estimation
6
framework, corrupted headers of received packets are estimated using the MAC-to-MAC
channel models. The corrupted packets are in turn passed to the application layer, which
uses forward error correction (FEC) to recover the corrupted data. Trace-driven wireless
video simulations show that significantly better bandwidth utilization and video quality
than UDP [64] and UDP-Lite [41]- [43] protocols can be achieved by employing the
header estimation framework. We also show analytically that an ideal header estimation
scheme will always perform better than UDP and UDP-Lite under realistic wireless
channel conditions.
As a final contribution of this part, Chapter A.7 shows that an inaccurate channel
model that ignores channel memory can provide extremely misleading results. We use
two critical wireless performance metrics, namely goodput and retransmissions, and show
that highly inaccurate estimates of these metrics are obtained if memory-less or 1st order
channel models are used. On the other hand, FSM and CCM channel models which cater
for channel memory provide very accurate goodput and retransmission estimates.
A.1.2 Organization of this Part The rest of this part is organized as follows. Chapter A.2 provides an overview of the
related work in this area. Chapter A.3 provides background that is required to understand
the material presented in this part. Chapter A.4 focuses on empirical analyses and
“accurate” modeling of the bit-errors at 2, 5.5 and 11 Mbps. Chapter A.5 reduces the
complexity of the proposed models by evaluating low-complexity alternatives. Chapter
A.6 proposes a header estimation framework for wireless multimedia. Chapter A.7 shows
the impact of ignoring channel memory on the design of wireless systems.
7
CHAPTER A. 2 RELATED WORK
A.2.1 Channel Modeling Recently, link layer modeling for reliable protocols has received some research
attention [9], [10]. In the context of delay-sensitive traffic, a previous study derived
conditions under which block-based residual/MAC-to-MAC errors can be modeled using
a Markov chain [11]. For AT&T WaveLAN, a trace-based link layer investigation was
conducted in [13]. In the context of link layer modeling, Konrad et al. performed analysis
and presented a Markov-based Trace Analysis (MTA) model algorithm for frame-errors
on GSM networks [14], [15]. Ji et al. [16], [17] compared the performance of the MTA, a
full-state k -the order model, a hidden Markov model and an extended ON(error-
free)/OFF(error-filled) model in capturing the GSM (link layer) frame losses. Based on
the comparison provided in [16], [17], it was concluded that an extended ON/OFF model
with geometric distributions governing the state holding times provides significantly
better results than the other three modeling paradigms.
In view of the increasing popularity of 802.11 networks, we studied the 802.11b link
layer in order to facilitate design of effective cross-layer error-control schemes for the
support of real-time services [18], [19], [45]. Since most error-control schemes operate
on byte and/or packet boundaries, we proposed Markov-based models at the packet and
byte levels. We showed that a 2-state Markov model can characterize the packet-loss
process and a hierarchical Markov Model was proposed for the byte-level errors [18].
Willig et al. [26] have performed the only prior study that attempts to analyze and model
8
bit-error behavior of 802.11b networks with modeling accuracy as a performance
criterion. There are fundamental differences between the measurements, analyses and
modeling of [26] and this thesis. In [26] the authors attempt to capture the impact of
physical layer parameters (e.g., modulation type, antenna diversity etc.) on the bit-error
rate of a wireless LAN in an industrial setting. This study performs all experiments with
default physical layer parameters, thereby capturing a realistic channel that is
omnipresent in most common home/business/classroom settings. Also, in [26] the error-
prone 11 and 5.5 Mbps channels were not evaluated.
Chen et al. [24], [25] investigated Markov chain lumpability to reduce the complexity
of wireless channel models. Since lumpability constraints are too stringent for practical
wireless channels, Chen et al. [24], [25] resorted to an ON-OFF model that stochastically
bounds the sojourn time distributions of the lumped good and bad states. However, and
as asserted by [11], an ON-OFF model assumes geometric (memory-less) distributions
for good and bad periods which is not a valid assumption in most real-life channels.
Bipartite models were proposed for wireless channels by Willig [26]. The accuracy of
bipartite models depends on a selected value of complexity. We argue that model
accuracy is not optional and even a low-complexity model should provide the requisite
accuracy. Moreover, bipartite models require a large number of parameters to achieve a
certain level of accuracy. Köpke et al. [28] used chaotic maps to model 802.11b bit-errors
at low data rates (1 Mbps and 2 Mbps). Due to the focus on low data rates, in [28] it was
observed that: (a) probability of bit-error bursts of more than two bits is very low, and (b)
there is almost no correlation in error traces. The chaotic map model in [28] ignores the
correlation and captures only the heavy-tail behavior of bit-errors. While this assumption
9
of “no autocorrelation in data” might be suitable for the particular experimental setup
used in [28], it is not generically applicable to network error and loss data. In [20], it was
shown that low-complexity Markov models (such as hidden and hierarchical Markov
models) are inadequate for modeling of an 802.11b link layer bit-error wireless channel.
In [29], two linear-complexity models were proposed which were reasonably effective in
capturing 802.11b bit-errors.
A.2.2 Cross-Layer Design for Wireless Multimedia The traditional UDP protocol detects and drops corrupted packets using a checksum
operating at the MAC layer [64]. Such packet drops results in significant bandwidth
wastage, especially in the context of error-resilient multimedia applications which can
inherently tolerate some errors and losses in the received content. Larzon et al. proposed
a UDP-Lite protocol which allows delivery of partially corrupted packets to the
application layer [41]- [44]. In its commonly-used form, UDP-Lite disables the MAC
layer checksum while the transport layer partial checksum only covers transport and
application layer headers. Errors in the application layer payload are simply ignored.
Note that support of partial checksum requires modifications to the multimedia senders,
receivers and/or (multicast or multihop) intermediate nodes.
Many wireless cross-layer studies have shown that UDP-Lite performs better than
UDP on contemporary wireless networks [18], [41]- [50]. In [18], it was shown that over
802.11b LANs an application layer FEC must be employed in conjunction with UDP-
Lite. Otherwise the partially corrupted packets delivered by UDP-Lite result in almost
unintelligible multimedia quality. It was also shown in [18] that UDP-Lite over 802.11b
10
LANs can only work at the 2 and 5.5 Mbps data rates. At the 11 Mbps data rate, the
errors and losses in the received content are too high for effective FEC-based recovery.
11
CHAPTER A. 3 BACKGROUND
This chapter provides the background that is required to understand the contributions
of this part.
A.3.1 802.11b Wireless Networks Due to their high data rates and use of the time-tested TCP/IP protocol suite, 802.11b
networks have experienced widespread deployment. These LANs are finding their way
into homes and businesses ubiquitously. However, like other wireless technologies,
802.11b networks also suffer from severe quality degradation in the presence of physical
obstructions and inter-symbol-interferences. Two modes of operation are supported in
802.11 networks [57], [58]: (i) ad hoc mode in which wireless nodes can communicate
with each other directly, and (ii) infrastructure mode in which wireless nodes are
arbitrated using a central entity called an access point (AP).
All 802.11b-complaint networks support four basic physical layer data rates of 1
Mbps, 2 Mbps, 5.5 Mbps and 11 Mbps. Increase in the data rate reduces the robustness of
the 802.11b physical layer. In the infrastructure mode, if the number of retransmission
requests exceeds a certain threshold, the AP drops down to a lower data rate than its
current data rate. For retransmissions, 802.11b relies on a 32-bit frame check sequence
(FCS) that computes checksum over the entire MAC layer frame. Positive
acknowledgement (ACK) frames are employed to signal successful transmission of data
frames. If a frame fails checksum then it is dropped at the receiver’s MAC layer. The
sender after timing out schedules a retransmission.
12
A.3.2 Autocorrelation of Random Processes
Let ( )1X n and ( )2X n be two random variables derived from a random process
( )tΧ . The correlation coefficient of these random variables is defined as [65]
( ) ( ) ( ) ( ) ( ) ( ) ( )0
0 0X X
X X X Xη
η ηρ η σ σΕ − Ε Ε= ,
(A.1)
where XΕ and Xσ represent the mean and standard deviation of the random variable
X . When evaluating a dataset, the sample mean and the sample standard deviation are
used to compute the correlation coefficient of (A.1). This sample autocorrelation
coefficient for different values of lag is a direct measure of the level of temporal
dependence in the random process. Lag beyond which the autocorrelation coefficient
drops to an insignificant value corresponds to the memory-length of a random process.
Autocorrelation of a Markov source yields the order of the model required to accurately
characterize the source [66], [67].
A.3.3 Discrete-Time Markov Chains Markov chains are employed to model statistical data with short-term temporal
dependence. Let a stochastic process nΧ take on values denoted by non-negative
integers 0,1, ,M… . If n iΧ = then the process is said to be in state i at timen .
Whenever the process is in state i there is a fixed probability that the next state of the
process will be state j . If that probability can be expressed as
1 1 1 1 1 0 0 1Pr , , , , Prn n n n n nj i i i i j i+ − − +Χ = Χ = Χ = Χ = Χ = = Χ = Χ =… ,
(A.2)
13
for all states 0 1 1, , , , ,ni i i i j−… and all 0n ≥ , then nΧ is a Markov chain [59]. The
property given in (A.2) is commonly referred to as the Markov Property. Thus, for a
Markov chain the conditional distribution of any future state 1n+Χ , given the past states
1 1 0, , ,n−Χ Χ Χ… and the present state nΧ , is independent of the past states and depends
only on the present state. Equation (A.2) is also referred to as homogeneity property since
it ensures that the transition probabilities do not vary with time.
Let , 1Pri j n np j i+= Χ = Χ = denote the probability of transiting to state j
from i . Since ,i jp represents a probability measure, it exhibits the following properties:
(i) , 0i jp ≥ for all , 0i j ≥ , and (ii) ,0
1M
i jj
p=
=∑ for all 0,1, ,i M= … . The probability
of transiting to the next state can be represented in a matrix form. This matrix is referred
to as the one-step state transition probability matrix.
The steady-state or stationary probabilities of a Markov chain represent the long-run
proportion of the time spent in each state. Once the transitional probabilities of a Markov
chain are known, the steady-state probabilities of being in a particular state are the unique
non-negative solutions of the following linear system of equations:
,0
0
, 0,1, ,
1.
Mj i i j
iM
jj
p j Mπ π
π
=
=
= =
=
∑
∑
…
For stationary Markov chains, the steady-state and transition probabilities do not vary
with time. Throughout this thesis, we use stationary Markov chain for modeling bit-error
14
processes. The memory-length of a Markov chain is also referred to as its order.
Discussion in the preceding section outlined that autocorrelation analysis can be
performed on the realizations of a random process to determine the appropriate order of
the respective Markov chain. This observation will be used later to identify the orders of
Markov chains.
A.3.4 Burst Representation of Binary Wireless Traces
Wireless bit-error traces are generally represented as a binary time-series ( ) 1lix i = ,
where ( ) 0,1x i ∈ and l is the length of the time-series. Throughout this thesis, we
define ( )x i as:
( )0 error-free bit1 corrupted bit.x i
=
Without loss of generality, a binary time-series can be represented as an interleaved
sequence of runs (bursts) of good and bad bits ( ( ) 0x i = and ( ) 1x i = ), i.e.,
( ) ( ) ( )1 1 2 2, , , , , ,l lI B I B I BL , where iI and iB represent the lengths of the thi good and
bad bursts, respectively. Wireless channel modeling studies have established that this
binary data representation is rather suitable for definition and evaluation of a model
[14]- [17], [20]. The burst-lengths of good and bad bits are used for empirical
performance evaluations in this thesis. (Subsequent sections discuss this in further detail.)
15
A.3.5 The Gilbert Channel Model The Gilbert channel was proposed in [81] to model channels with 1st order memory.
Since then, it has been used to model many wireless channels at bit, byte and packet
levels [9]- [11], [13]- [15], [18]- [20], [26]. The Gilbert model captures channel memory
through a two-state Markov chain having a good and a bad state. The probability of the
next (good or bad) symbol is dependent on the whether the last received symbol was
good or bad. The steady-state probabilities of staying in the bad and good states are
respectively expressed as:
gbb
gb bgp
p pπ = + and bgg
gb bgp
p pπ = + . (A.3)
Clearly, 1b gπ π+ = .
Higher probabilities of staying in the present state (i.e., ggp and bbp ) indicate the
intensity of channel memory. A more appropriate measure to quantify channel memory
was proposed by Mushkin and Bar-David in [82], where memory µ of a Gilbert channel
was defined as:
gbp
bgp
bbp ggp
Good Bad
Figure 1. The Gilbert channel model [81].
16
1 gb bgp pµ = − − . (A.4)
It can be easily seen that 1 1µ− ≤ ≤ . Moreover, a closer look at above equations reveals
that
0 and b gb g bgp pµ π π= ⇒ = = . (A.5)
In other words, when 0µ = , the probability of getting a good or a bad symbol at any
time instance is independent of the last symbol value, that is, the channel behaves as a
memory-less channel. In [82], channels with 0µ > and 0µ < were referred to the
persistent and oscillatory memory channels, respectively. Real-life channels generally
have persistent memory.
A.3.6 Full-State Markov Chains for Wireless Channels Wireless bit-error processes are generally bursty and have a memory-length of greater
than one bit, and therefore these processes cannot be modeled using the Gilbert model.
To make such a process comply with the Markov property of (A.2), a Markov chain is
defined such that at each time instance the process is characterized by as many bits as the
memory-length. At each time instance, a new bit is added to the memory-window and the
oldest bit is dropped from the memory-window. As mentioned before, memory-length of
a Markov chain is also referred to as its order.
For a memory length of k bits, a full-state Markov (FSM) chain [20] corresponds to
all the 2k different possible combinations of k consecutive bits. Transition probabilities
between states are computed by sliding a k bit memory-window over the data and
counting the number of times a bit-pattern [ ]1 2, , , kx x x… is followed by another bit-
17
pattern [ ]1 2, , , ky y y… . Note that the number of states of an FSM chain increases
exponentially with an increase in memory-length – 2k states for a memory-length of k .
A.3.7 Long-Range Dependent Processes Long-range-dependent processes belong to a generic class of scaling or self-similar
processes [68], [69]. Self-similar processes exhibit similar statistical behavior at different
scales – zooming into or out-of a sample path of the process gives a new process
realization which is statistically similar to the original. A self-similar process ( )tΧ
satisfies the relation:
( ) ( )/d Ht c t cΧ = Χ , (A.6)
where d= represents equivalence in finite-dimensional distributions, c is a scaling
(compression/dilation) factor and H is known as the Hurst parameter. Self-similar
processes are also referred to as H-ss processes. It is not possible to define a characteristic
scale for H-ss processes which implies that these processes are scale-invariant. A self-
similar process with stationary increments is referred to as an H-sssi process [68]- [70].
Long-range dependent (LRD) processes model stationary increments of a second-
order self-similar process. The Hurst parameter of an LRD process is 1 2 1H< < . Also,
the autocovariance [ ]r k of an LRD process is of the form:
[ ] ( )2 2Hrr k c k− −∼ , (A.7)
where ∼ represents asymptotic equivalence and rc is a positive and constant scaling
factor. From (A.7) and the constraint on H it can be seen that summing the
18
autocorrelation function results in a divergent power series [71], [ ]k r k = ∞∑ . Thus
all samples of an LRD process depend heavily on previous samples, thereby resulting in
occurrence of clusters of similar values. For the present binary process, this observation
simply implies long bursts of zeros and ones.
An important property of LRD processes is that they can be equivalently
characterized in terms of the behavior of the aggregated process:
( ) [ ] [ ]( )1 1
1 kmmi k m
k im = − +Χ = Χ∑ ,
(A.8)
where m is the aggregation level. For an LRD process (and in general for all second-
order self-similar processes), ( ) [ ] [ ] 2 2var varm Hk m k−Χ = Χ . Thus for an LRD
process, a log-log plot of ( ) [ ] var m kΧ as a function of m is strictly linear with a slope
of 2 2H − [70]. This plot, generally known as the variance-time diagram, can be used to
ascertain the presence of LRD in the data and can also render an estimate of H .
The power spectral density of an LRD random process is the Fourier transform of the
autocorrelation of (A.7), and has been shown to be [70]:
( )2
1 22 112 sin as 02 2
HH HHi
I C Ci
ωω ω ωω π
∞−+
=−∞ = → +∑ ∼ ,
(A.9)
where ω is a frequency and HC is a constant. Note that the spectral density is
proportional to 1 2Hω − for frequencies close to the origin. A log-log plot of the power
spectral density as a function of the frequency has a slope of 1 2H− , which can be used
to estimateH .
19
A.3.8 The Multifractal Wavelet Model The multifractal wavelet model (MWM) was proposed in [61]- [63] to analyze and
model LRD network data. The MWM has shown promise in modeling various LRD
network phenomena [61]- [63]. The MWM relies on the premise that network data is
inherently non-negative and generally spiky. Both these properties are clearly true for
wireless bit-error data. Moreover, the scaling properties of wireless bit-errors can be
adequately characterized by wavelet-based analysis.
The MWM employs the Haar wavelet family and applies a constraint that the input
training data are always non-negative. For the Haar wavelet, the scaling and wavelet
coefficients are computed recursively as
( ), 1,2 1,2 112j k j k j kU U U+ + += + and ( ), 1,2 1,2 1
12j k j k j kW U U+ + += − , (A.10)
where ,j kU and ,j kW respectively represent the scaling and wavelet coefficients at time
k and scale/level j . With the Haar scaling function, the scaling coefficients are simply
averaged versions of the input signal and thus, due to the non-negative nature of the data,
the scaling coefficients are always non-negative, , 0j kU ≥ . Rearranging (A.10) yields
( )1,2 , ,12j k j k j kU U W+ = + and ( )1,2 1 , ,
12j k j k j kU U W+ + = − . (A.11)
In the first equation of (A.11), to keep the next level’s scaling coefficients ( 1,2j kU + ’s)
non-negative, negative wavelet coefficients are constrained as , ,j k j kW U≤ . Similarly,
to keep the 1,2 1j kU + + ’s non-negative, the positive wavelet coefficients are constrained
as , ,j k j kW U≤ . Combining these two constraints gives a non-negativity constraint that
20
, ,j k j kW U≤ . (A.12)
The above constraint simply ensures that once the inverse transform is taken, the resultant
process is always non-negative. Alternatively, the constraint can be implemented as
, , ,j k j k j kW A U= , (A.13)
where ,j kA is a random variable defined over the interval [ ]1,1− .
In order to train the MWM to match the wireless bit-error traces, two random
variables need to be captured. The first random variable is the scaling coefficient at the
coarsest scale 0 0,j kU . The second set of random variables is the ,j kA ’s at each level
which in turn yield the wavelet coefficients (A.13) at that level. Once a general sense of
probability distribution is ascertained for these random variables, the expectation-
maximization algorithm [76], [77] can be used to fit that distribution to the actual dataset.
The training and synthesis algorithm is detailed in [61]. The complexity of synthesizing a
length N trace using the MWM is ( )O N .
A.3.9 Performance Evaluation Measure Entropy is a measure of the average number of bits required to represent all outcomes
of a probability distribution. The Kullback-Leibler divergence quantifies the difference in
the entropies of two probability distributions [78]. In [20] we proposed an entropy
normalized Kullback-Leibler (ENK) divergence measure to quantify the accuracy of a
channel model. The ENK divergence quantifies the source-coding-like overhead incurred
21
by employing a model instead of the actual source. For two probabilities distributions
( )p X and ( )q X defined over a common alphabet Ψ , the ENK divergence is defined as:
( ) ( )( )( ) ( )
( )( ) ( )( )
2
2
log
logX
X
p Xp X q XENK p X q X p X p X
∈Ψ
∈Ψ
= −
∑∑ ,
(A.14)
where the numerator and denominator respectively represent the Kullback-Leibler
divergence and entropy functions.
The ENK divergence inherits basic properties of the Kullback-Leibler divergence: (a)
non-negativity, ( ) 0ENK p q ≥ , (b) non-symmetry, ( ) ( )ENK p q ENK q p≠ , and (c)
( ) 0ENK p q p q= ⇔ = . Small values of ENK divergence indicate that a model
accurately approximates the actual source. We would expect the ENK between two actual
traces to be a very small value as the traces are realized by the same random source.
Therefore we employ the ENK divergence between two 802.11b traces as a performance
evaluation reference for the ENK divergence between an actual trace and a trace
artificially generated by a model.
The ENK divergence relies on the fact that an appropriate random variable X is
being used to characterize the underlying source. We employ two random variables for
performance evaluation of all the models in this thesis: (i) burst-length of good bits I ,
where I takes positive integer values; (ii) burst-length of bad/corrupted bits B , where
B also takes positive integer values. Throughout this thesis, we refer to I and B as
good-bursts and bad-bursts random variables, respectively.
22
CHAPTER A. 4 EMPIRICAL ANALYSIS AND ACCURATE MODELING OF WIRELESS
CHANNELS
In this chapter, we first describe the wireless trace collection experiment. We then
evaluate the correlation in the bit-error traces collected at 2, 5.5 and 11 Mbps. We
observe that the correlation at 2 and 5.5 Mbps exhibit a decaying trend, but the 11 Mbps
traces have high correlation even at large lags. Due to their manageable correlation, we
use Markov chains to model the 2 and 5.5 Mbps bit-errors. We show that full-state
Markov (FSM) chains provide highly accurate models of the 2 and 5.5 Mbps bit-errors.
Moreover, we show that FSM chains have unused states which can be ignored to reduce
the complexity of the FSM-based channel modeling paradigm.
Unlike the bit-errors at 2 and 5.5 Mbps, the 11 Mbps bit-error process requires a
model that can capture long-memory. We evaluate the 11 Mbps bit-errors using scaling,
variance-time and periodogram analyses. These evaluations substantiate the presence of
long-memory or long-range dependence (LRD) in 11 Mbps bit-errors. Consequently, we
employ a multifractal wavelet model (MWM) to characterize the 11 Mbps bit-errors. We
show that the MWM captures second-order statistics of the 11 Mbps bit-errors much
more accurately than Markov chains.
A.4.1 Wireless Trace Collection For this study, five wireless receivers were used to simultaneously collect error traces
on an 802.11b LAN. The receivers were placed at different locations in a room, while the
23
access point (AP) was placed in a room across a hallway from the receivers to simulate a
realistic home/classroom/office setting as shown in Figure 2.
The receivers’ MAC layer device drivers were modified to pass corrupted packets to
higher layers. The receivers were Linux clients using DLink DWL-650 wireless cards
with the open source linux-wlan-ng device drivers [72]. To capture packets at high
transmission rates, packet dissectors were implemented inside the device drivers. These
packet dissectors ensured that only packets pertinent to our wireless experiment are
processed, while all other packets are simply dropped. Each experiment comprised one
million packets with a payload of 1,000 bytes each, i.e., each trace had approximately 1
GB of data.
A wired sender was used to send multicast packets with a predetermined payload on
the wireless LAN; multicasting disabled MAC layer retransmissions. The sender used
different transmission rates ranging from 4 Kbps to 1 Mbps for each experiment. At the
physical layer, the auto rate selection feature of the AP was disabled and for each
experiment the AP was forced to transmit at a fixed data rate. Each trace collection
experiment was repeated multiple times at 2, 5.5 and 11 Mbps physical layer data rates
and at different times of day.
24
Table 1 provides some statistics of the traces collected for this study. The packet error
rate is computed as
( ) ( )pkt error rate = pkts received with one or more errors total received pkts .
As expected, the average packet error rate increases with an increase in the physical layer
data rate. In particular, the average packet error rate increases from approximately 10%
at 5.5 Mbps to almost 40% at 11 Mbps. Thus traditional higher layer protocols that drop
all corrupted packets (e.g., 802.11 MAC, UDP, TCP etc.) experience profound losses at
11 Mbps, and consequently there is room for considerable improvement. Since the
wireless receivers were placed at different locations, the receivers experienced different
Room1
802.11b AP
Receiver 0
Room2
Sender
Receiver 4
modified linux-wlan-ng drivers
bit error traces
Figure 2. Set up for collection of wireless bit-error traces.
Table 1. Packet-Level Statistics at 2, 5.5 and 11 Mbps
Data rate Average Packet Error Rate
Min Packet Error Rate
Max Packet Error Rate
2 Mbps 5.97% 0.75% 14.31% 5.5 Mbps 9.79% 0.61% 22.74% 11 Mbps 39.5% 10.99% 77.83%
25
packet error rates. The minimum and maximum error rates in Table 1 outline that the
receivers were experiencing both good and bad link conditions.
In our initial experiments all wireless receivers maintained Line of Sight (LoS) with
the access point (AP). The AP was forced to transmit at 2, 5.5 and 11 Mbps for each
trace. It was observed that with clear LoS, the error-rate (at all bitrates) was extremely
low. Such excellent performance deemed further LoS study inconsequential. Hence, we
positioned the receivers in separate rooms to simulate a more realistic
business/classroom/home-network wireless setup as shown in Figure 2.
A.4.2 Empirical Analysis of 802.1b Bit-Errors To maintain focus, throughout Chapters A.4 and A.5 we show results for two traces at
each physical layer data rate. These traces are collected at the same receiver under similar
conditions. The results for the remaining traces and receivers are similar.
A.4.2.1 Autocorrelation Analysis
The sample autocorrelations of 2 Mbps, 5.5 Mbps and 11 Mbps bit-error traces are
illustrated in Figure 3. Clearly, the correlation at 11 Mbps is higher than that at 2 and 5.5
Mbps. Let us first concentrate on the autocorrelation of 2 and 5.5 Mbps traces. It is clear
that the autocorrelation at both data rates is a decaying function, i.e., the level of temporal
dependence is decreasing with time. From the examples provided in Figure 3, we assume
that the memory-length is determined by the lag beyond which the normalized correlation
is less than 0.15 , an empirically-determined threshold. We observed that in some traces
the correlation does not drop significantly below 0.15 , even for very large lags.
However, in general the bit-errors exhibited rapidly decaying correlation as in Figure 3.
26
Extensive performance evaluation suggests that correlation of less than 15% does not
play a significant role in the error process characteristics.
Based on the threshold, the memory-lengths of the 5.5 Mbps traces of Figure 3 are 12
and 14 respectively. The correlation of both 2 Mbps traces drops below 0.15 at the lag of
16. Hence, we use memory-length 14 and 16 as the maximum order of the 5.5 and 2
Mbps Markov chains respectively. Since the memory-lengths of the 2 and 5.5 Mbps bit-
error processes are not very large as compared to the 11 Mbps process, high-order
Markov chains can appropriately model these processes.
Contrary to the correlations at 2 and 5.5 Mbps, Figure 3 clearly shows that the 11
Mbps bit-error process has high correlation even at large lags. This is reminiscent of
long-range dependence since a low-order memory-length cannot be identified for the 11
Mbps bit-error process. Consequently, Markov models cannot be used to model 11 Mbps
bit-errors.
1 10 20 30 400.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
lag
sam
ple
auto
corr
elat
ion
trace 2 Mbps 1trace 2 Mbps 2trace 5.5 Mbps 1trace 5.5 Mbps 2trace 11 Mbps 1trace 11 Mbps 2
Figure 3. Autocorrelation of bit-error traces.
27
A.4.2.2 Preliminary Empirical Analysis of FSM Chains
In accordance with the discussion in Section A.3.4, we represent the bit-error traces
as a binary series ( ) 1lix i = , where ( ) 1x i = represents a bit-error and l is the length of
the series. Also, for a memory-length of k , a full-state Markov (FSM) chain has states
corresponding to all possible 2k combinations of k consecutive bits. The complexity
(i.e., number of states) of the FSM chains increases exponentially with an increase in
memory-length. Previous studies employed low-order Markov chains [8], [14]–[17].
However, due to the present interest in capturing high-order behavior, we provide
analysis and modeling with high-order FSM chains.
For efficient and accurate representation of the transition probability data and to
reduce the complexity we examined the FSM transition probability matrices for bit-
patterns that never occur in the collected traces. We refer to such bit-patterns as the
unused states. These states result in all-zero columns in the transition probability matrix.
An all-zero column implies that the probability of jumping to that state from any state is
zero. While other methods for judicious selection of Markov states exist [67], used states
provide a simple and effective method of minimizing the model complexity.
The percentage of unused states for each order is shown in Figure 4. It can be
observed that the number of unused states grows as the order of the Markov chain is
increased. For example, in case of a 122 state model, at 2 and 5.5 Mbps approximately
80% and 30% states are never used. We lay special emphasis on this observation since
the total number of states directly corresponds to the complexity of the model. All
following FSM results will employ the used states only. Here we recognize that the
number of unused states will decrease as the channel is observed for a significantly long
28
period of time, i.e., number of unused states is inversely proportional to the trace length.
However, and as substantiated by the FSM performance evaluation in later sections, FSM
chains perform quite reasonably without the unused states thereby implying that the
unused states do not play a significant role in overall channel characterization.
A.4.2.3 Long-Range Dependence in 11 Mbps Bit-Errors
The autocorrelation analysis in Section A.4.2.1 provided initial indications that the 11
Mbps bit-errors are LRD in nature. This section substantiates this preliminary notion of
LRD by analyzing the 11 Mbps bit-error process in further detail.
A.4.2.3.1 LRD Evaluation by Observing Energy at Different Scales Since LRD processes typically demonstrate second-order self-similarity, zooming out
from a sample path of the process should yield a path similar to the original in second-
order statistics. As shown by (A.8), in order to determine scaling in a process, we can
define an aggregate process by dividing a bit-error trace of length l into non-overlapping
blocks of length m and averaging over each block. The resultant aggregate sample path
4 16 64 256 1024 4096 163840
10
20
30
40
50
60
70
80
number of states (logscale)
perc
enta
ge o
f unu
sed
stat
es
5.5 Mbps2 Mbps
Figure 4. Percentage of unused FSM states at 2 and 5.5 Mbps.
29
averages m points from the actual sample path. Due to the 0,1 representation of the
bit-errors, an m level aggregate process represents the normalized energy of bit-errors in
non-overlapping windows of size m . We henceforth use the terms aggregate process
and energy process synonymously. Aggregation smoothes high variances in the sample
path and provides an on-average zoomed-out version of the actual sample path. Thus
energy processes at different aggregation levels outline the impact of aggregation on the
short-term second moment of the process.
Figure 5 outlines three aggregate processes. The top figure is a process sample path
( ) [ ]1 kΧ outlining the unnormalized energy (i.e., the total number of errors) observed in
each packet (packet transmission time=1 second). The second figure is a level-4
aggregate of the first sample path which depicts the average energy observed in four
packets. Thus, the first point in this level-4 aggregate sample path is
0 10 20 30 40 50 60 70 80 90 1000
500
1000
1500
X(1)
0 10 20 30 40 50 60 70 80 90 1000
500
1000
X(4)
0 10 20 30 40 50 60 70 80 90 1000
200
400
600
X(8)
0 10 20 30 40 50 60 70 80 90 1000
200
400
600
X(16)
Figure 5. Aggregates of the 11 Mbps energy process at different time scales.
30
( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ]( )4 1 1 1 111 1 2 3 44Χ = Χ + Χ + Χ + Χ .
Similarly, the remaining two figures are aggregates at levels 8 and 16. Each aggregate
path is zooming out of the actual sample path and no statistically differentiating features
are revealed by simple observation. Thus it can be inferred that the decrease in variability
with increased smoothing is very slow. This slow-varying decay is further highlighted by
the analysis of second-order statistics in the following section.
A.4.2.3.2 LRD Evaluation using Variance-Time Diagrams Recall that for an LRD process, the variance ( ) var mΧ of the aggregate process is
equal to ( ) 2 2 1varHm − Χ . Variance-time diagrams plot the logscale variance of the
aggregate process as a function of the logscale aggregation level. Second-order self-
similarity is implied if the logscale decay in the variance is strictly linear, that is, the
change in variance is directly proportional to the aggregation level. For an LRD process,
the Hurst parameter H can then be estimated by fitting a least-squares line through the
plot. A stationary second-order self-similar process is said to be long-range dependent if
1 2 1H< < .
31
The variance-time diagrams of two 11 Mbps bit-error traces for different aggregation
levels is given in Figure 6. Clearly, for both the traces under consideration, the variance
has a mostly linear decay with respect to the aggregation level. Least-squares lines of
order-1 are fit to the data points of the two variance-time diagrams. The slopes of the two
least-squares lines of Figure 6 (a) and (b) are 0.3401− and 0.287− , respectively. In
accordance with the above discussion, an estimate of the Hurst parameter, H , can be
obtained by noting that the slope of the variance-time plot should be equal to 2 2H − .
This results in Hurst parameter estimates of 0.83H = and 0.857H = for the two
traces. The two values of H are quite close to each other because the two traces are
realizations of the same random process. Further, for both Hurst estimates the
1 2 1H< < condition is satisfied, thus implying that the 11 Mbps bit-error process is
long-range dependent. To further substantiate the LRD notion, in following section we
provide LRD analysis using a frequency-domain estimator.
0 0.5 1 1.5 2 2.5 3 3.5-2.6
-2.4
-2.2
-2
-1.8
-1.6
-1.4
-1.2
log(m)
log(V
ar(X
(m) ))
0 0.5 1 1.5 2 2.5 3 3.5
-2.8
-2.6
-2.4
-2.2
-2
-1.8
-1.6
-1.4
-1.2
log(m)
log(V
ar(X
(m) ))
(a) trace 11 Mbps 1: Hurst parameter estimate, 0.83H =
(b) trace 11 Mbps 2: Hurst parameter estimate, 0.857H =
Figure 6. Variance-time diagrams of two 11 Mbps bit-error traces.
32
A.4.2.3.3 LRD Evaluation using the Periodogram A periodogram renders an estimate of the power spectral density of a process. The
periodogram is simply the square of the magnitude of the discrete-time Fourier
transformed samples. Mathematically, the periodogram of a discrete-time process nΧ is
given as:
( )
2
1
12
N ikkk
I eNωω π
−=
= Χ∑ , (A.15)
where ω is the frequency, N is the total number of samples and 1i = − . Recall from
Section A.3.7 that the spectral density of an LRD process is proportional to 1 2Hω − near
the origin, 0ω = . Since the periodogram of (A.15) is an estimate of the spectral density,
a regression of the logarithm of the periodogram on the logarithm of the frequency ω
should render an order 1 polynomial with a slope of 1 2H− . A frequency-domain
estimate of H can thus be obtained by fitting an order-1 least-squares line through a log-
log plot of the periodogram versus the frequencies. In general, only the lower 10%
(a) trace 11 Mbps 1: Hurst parameter estimate, 0.874H =
(b) trace 11 Mbps 2: Hurst parameter estimate, 0.877H =
Figure 7. Logscale periodogram of two 11 Mbps bit-error traces.
33
frequencies of the periodogram are used for this estimation since the approximation only
holds true near the origin.
The logscale periodograms of the two 11 Mbps traces are shown in Figure 7 (a) and
(b), respectively. These slopes yield Hurst parameter estimates of 0.874H = and
0.877H = for the two traces. These estimates satisfy the 1 2 1H< < condition, thus
substantiating that the 11 Mbps random process has long-range dependence. Further, note
that these estimates are quite close to the estimates rendered by the variance-time
diagrams of the last section.
A.4.3 Accurate Modeling of 802.11b Bit-Errors
A.4.3.1 Bit-Error Modeling at 5.5 Mbps
The autocorrelation analysis in preceding sections outlined a maximum memory-
length of 14 for the 5.5 Mbps bit-error process. A memory-length of 14 corresponds to an
FSM chain with 142 16384= states. ENK-based performances1 of FSM chains with
varying memory-lengths are given in Figure 8. FSM chains perform remarkably well for
the bad-bursts random variable. Note that even smaller order chains perform adequately
with the source coding overhead of less than or approximately equal to 0.03 for all cases.
However, the good-bursts random variable incurs significant overhead for smaller order
chains. For example, the two-state chain renders an overhead of approximately 0.5 and is
therefore not a viable option. For higher-order chains, the overhead decreases and drops
to a reasonable level, beginning at the 511-state model. Due to data over-fitting
considerations, we assume that any overhead less than 0.1 is acceptable. Thus we
1 The terms performance and accuracy are used synonymously throughout this thesis.
34
conclude that all FSM chains of orders 9 and above render appropriate models for the 5.5
Mbps bit-error process.
A.4.3.2 Bit-Error Modeling at 2 Mbps
The performances of varying order FSM chains are shown in Figure 9. For both
random variables, small order FSM models incur profound overhead. For instance, for
the good-bursts random variable the overhead of the order-1 (two-state) chain is
approximately 0.8 . Although lower order FSM chains cannot model the bit-error process
effectively, as we move to higher order chains the overhead decreases substantially and
drops to a reasonable level. Since the 5.5 Mbps FSM outlined that the actual models’
order can be smaller than what is outlined by the data correlation, in Figure 9 we only
provide analysis up to order 10 since the performance improvement saturates after the
order-10 (548-state) model.
It is clear from Figure 9 that low-order FSM chains incur significant ENK overhead
and hence are unable capture the 2 Mbps bit-error behavior. For both random variables,
the FSM performance subsequently improves with an increase in the order of the model.
The accuracy of the order-10 FSM chains is comparable to the divergence between two
actual traces. Hence, we conclude that 548-state FSM renders a good model of the 2
Mbps MAC layer bit-error process.
35
A.4.3.3 Bit-Error Modeling at 11 Mbps
In earlier sections, we revealed the LRD nature of the 11 Mbps bit-errors using
autocorrelation and scaling analyses. Based on the results from the last two sections, one
can conjecture that if high-order FSM simulations are performed, it might be possible to
identify an appropriate Markov process of an order lower than what is outlined by the
autocorrelation. However, ascertaining such a model order might require simulations with
2 4 8 16 32 64 128 256 5120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
number of states
EN
K: g
ood-
burs
ts
FSM5.5 Mbps bit-error traces
2 4 8 16 32 64 128 256 5120.002
0.004
0.006
0.008
0.01
number of states
EN
K: b
ad-b
urst
s
FSM5.5 Mbps bit-error traces
(a) good-bursts (b) bad-bursts Figure 8. Performances of varying order FSM chains for the 5.5 Mbps MAC layer bit-
error process.
2 4 8 16 32 64 128 256 512 10240
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
number of states
EN
K: g
ood-
burs
ts
FSM 2 Mbps bit-error traces
2 4 8 16 32 64 128 256 512 10240.005
0.01
0.015
0.02
0.025
0.03
0.035
number of states
EN
K: b
ad-b
urst
sFSM 2 Mbps bit-error traces
(a) good-bursts (b) bad-bursts Figure 9. Performances of varying order FSM chains for the 2 Mbps MAC layer bit-error
process.
36
high-order FSM chains, which is computationally infeasible. In this section we show that
a multifractal wavelet model (MWM) captures the LRD characteristics of the 11 Mbps
bit-error process quite accurately. Although Markov models cannot capture the LRD
nature of bit-errors at 11 Mbps, we use Markov chains as a performance reference when
evaluating the performance of the MWM.
A.4.3.3.1 The Multifractal Wavelet Model We used the MWM toolbox [73] to train an MWM. An actual 11 Mbps trace (i.e., a
bit-error sequence of zeros and ones) was used for MWM training. Various simulations
were performed with β , point-mass and hybrid β /point-mass probability distributions
for the ,j kA random variables and Gaussian and log-normal distributions for the 0 0,j kU
random variable. We observed that the performance of the MWM was quite insensitive to
the choice of the probability distribution chosen to capture the MWM random variables.
For brevity we only report results for the β distribution.
A.4.3.3.2 ENK-based Performance Evaluation A cautionary note is in place before we proceed with ENK-based performance
evaluation of the MWM at 11 Mbps. Due to its reliance on entropy, the ENK divergence
compares the skew of the probability distributions, but does not place much emphasis on
the second-order statistics (e.g., energy, variance etc.) of the distributions. The MWM (or
for that matter any model of LRD data), on the other hand, is designed to capture scaling
phenomena (and the consequent second-order statistics) of an LRD random process. Thus
for an LRD process, comparison only using ENK of good- and bad-burst distributions
can be misleading. Thus ENK divergence by itself cannot render an appropriate measure
to completely quantify the MWM performance. In addition to ENK, it is imperative that
37
second-order statistics of the random process be compared with the model. We perform
such second-order performance evaluation in the subsequent sections.
The ENK-based performances of the MWM and FSM chains are tabulated in Table 2,
where ( )FSM x represents an FSM chain with x states. The good-bursts ENK overhead
of the MWM is lesser than the 16-state FSM chain, while the bad-bursts overhead is more
than the 16-state FSM chain. The MWM ENK overhead is slightly worse than the 4096-
state FSM chain. For instance, for the first actual trace the MWM’s ENK good-bursts
divergence is 0.127 0.091 0.036− = more than the 4096-state FSM. For the same
example, the bad-bursts ENK overhead of MWM is 0.093 0.00094 0.09206− = more
than the 4096-state FSM.
Table 2. Performance of MWM and FSM for the 11 Mbps Bit-Error Process
good-bursts bad-bursts ( )trace1 trace2ENK 0.0586 0.00032
( )trace1 MWMENK 0.127 0.093
( )( )trace1 FSM 16ENK 0.174 0.002
( )( )trace1 FSM 4096ENK 0.091 0.00094
( )trace2 MWMENK 0.143 0.096
( )( )trace2 FSM 16ENK 0.189 0.002
( )( )trace2 FSM 4096ENK 0.088 0.0017
38
The slightly superior performance of the FSM chains is due to the fact that the FSM
model is extremely apt at capturing the short-term correlation structure of the random
process. This short-term dependence is because of small bursts of good and bad bits.
Such small bursts are quite prevalent even in an LRD process such as the present 11
Mbps bit-error process. To substantiate this claim, we show the small burst probabilities
of the good- and bad-bursts random variables in Figure 10. Note that burst-lengths of 1,
2, 3, 4 and 5 constitute 78.35% and 98.03% of the probability space of the good- and
bad-bursts random variables, respectively. This small burst behavior is very adequately
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
good burst length
good
bur
st p
roba
bilit
y
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
good burst length
good
bur
st p
roba
bilit
y
(a) good-bursts (b) bad-bursts Figure 10. Probability mass functions for good- and bad-bursts random variables derived
from an 11 Mbps trace. (Only the probabilities of small bursts are shown here.)
0 200000 4000000
1
trace 11 Mbps
0 200000 4000000
1
MWM
0 200000 5000000
116 state
FSM
0 200000 4000000
14096 state
FSM
0 5000 10000 150000
0.6trace
11 Mbps
0 5000 10000 150000
0.6
MWM
0 5000 10000 150000
0.6
16 state FSM
0 5000 10000 150000
0.64096 state
FSM
(a) aggregation level=8 (b) aggregation level=256 Figure 11. Energy processes of actual and synthesized bit-error traces.
39
characterized by an FSM chain. Since the skew of both probability distributions is
dictated by these highly probable small bursts, the ENK overhead of the FSM chains is
quite low, although FSM chains cannot capture the long-term process correlation. The
skew-oriented bias of the ENK measure masks the long-term correlation properties of a
random process, which is exhibited in the spread (i.e., the variance) of the probability
distribution. We henceforth focus solely on second-order analysis of the models under
consideration.
A.4.3.3.3 Performance in Capturing Energy at Different Scales We first consider energy in non-overlapping windows of the bit-error traces. As
mentioned previously, the definition of energy (as given in (A.8) and explained in prior
sections) outlines the second moment of the random process in short-term windows. Two
examples of an energy process derived from an actual source and energy processes
synthesized using the MWM, the 16-state FSM chain and the 4096-state FSM chain are
illustrated in Figure 11. It can be observed that the FSM chains project overly pessimistic
energy estimates (i.e., very high error rates), whereas the MWM in general has less
energy per window than the actual 11 Mbps bit-error process. By simple observation, it
can be deduced that the MWM captures the energy characteristics of the 11 Mbps bit-
error process better than the Markov chains. In the next section, we compare the
aggregated variance-time behavior of the FSM chains and the MWM.
A.4.3.3.4 Performance in Capturing the Variance-Time Characteristics
In this section, we evaluate the accuracy of the MWM and the FSM chains in
modeling the decay of aggregated variances. Figure 12 shows the variance-time behavior
of FSM chains. Clearly, the FSM chains can capture the short-term correlations of the
40
random process with outstanding accuracy as shown in the top-left corner of Figure 12.
However, the performance degrades sharply as the dependence (i.e., linear decay of the
variance) persists at higher scales. Not surprisingly, more and more correlation is
captured as we increase the memory-length of FSM chains. Thus if the complexity of an
FSM chain that captures all the scales present in the data can be afforded then such an
FSM chain can render a highly accurate model.
Unfortunately, in practical LRD processes the correlation typically persists at very
high scales. In such a case, a model that is designed to capture the correlation structure at
different scales (e.g., the MWM) is more suitable than FSM chains. This observation is
outlined in Figure 13 (a), which shows that the MWM can capture the decay of variance
of the 11 Mbps quite accurately within an additive constant. The phenomenon of a
model’s inability to capture the exact variance values is well-known in LRD literature. It
has been diagnosed that this problem arises because of non-stationarities introduced by
jumps in the mean and slow decaying trends [74]. (The jumps in the mean of the 11 Mbps
bit-error process can be easily observed in Figure 11.) Teverovsky and Taqqu [74]
proposed to eliminate this problem by fitting the function 2 21 2 HC C m −+ to the
variance-time diagram of the LRD process. The corrective factors, 1C and 2C , can then
be added to the variances produced by a model.
In the present problem, the corrective factors were 1 0C = and 2 3.71C = .
Variance-time diagram of the MWM with the corrective factors is given in Figure 13 (b).
Clearly, the corrected MWM captures the decay in the variance quite accurately. Thus we
deduce that MWM is an effective model of the long-range dependence present in the 11
Mbps bit-error process.
41
A.4.4 Discussion In this chapter, we proposed accurate models of MAC layer bit-error channels at 2,
5.5 and 11 Mbps data rates of an 802.11b LAN. While the MWM model for 11 Mbps bit-
errors has linear-complexity in synthesizing and predicting bit-error behavior, the FSM
chain model’s complexity increases exponentially with respect to the memory-length.
0 0.5 1 1.5 2 2.5 3 3.5 4-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
log(m)
log(V
ar(X
(m) ))
4 state FSM16 state FSM4096 state FSMtrace 11 Mbps
Figure 12. Variance-time diagrams of varying order FSM chains for the 11 Mbps bit-
error process.
0 0.5 1 1.5 2 2.5 3 3.5 4-4
-3.5
-3
-2.5
-2
-1.5
-1
log(m)
log(V
ar(X
(m) ))
trace 11 Mbps 1MWM
0 0.5 1 1.5 2 2.5 3 3.5 4-3.2
-3
-2.8
-2.6
-2.4
-2.2
-2
-1.8
-1.6
-1.4
-1.2
log(m)
log(V
ar(X
(m) ))
trace 11 MbpsMWM
(a) without corrective factors (b) with corrective factors Figure 13. Variance-time diagrams of the MWM for the 11 Mbps bit-error process.
42
The following chapter reduces the exponential complexity by approximating FSM
chains’ behavior using low-complexity models.
43
CHAPTER A. 5 COMPLEXITY REDUCTION FOR MARKOV CHANNELS
Most benefits of a wireless MAC layer channel model can be realized if the model is
able to provide real-time and online channel characterization and prediction. In
complexity- and power-constrained wireless and mobile environments, such channel
characterization is only possible with a low-complexity model. Despite some recent
interest in reducing the complexity of wireless models [20]–[29], development of
accurate, pragmatic and low-complexity wireless channel models is still an open
problem. Since low-complexity models have not been thoroughly explored and verified
for contemporary wireless and mobile networks, many of the protocols, applications and
systems mentioned in Chapter A.1 have not been realized in practical wireless systems.
The number of states of an FSM chain is an exponential function of the random
process’ memory-length - 2k states for a process with a memory-length of k bits. This
phenomenon is commonly referred to as state explosion. Due to state explosion, although
FSM chains can provide accurate models of wireless bit-errors, their high complexity
renders them impractical for realistic wireless environments.
To reduce FSM chains’ complexity, in this chapter we first consider hierarchical [18]
and hidden [75] Markov models. We observe that these models cannot accurately
characterize the bit-errors channels under consideration. Consequently, we focus on
directly approximating FSM chain behavior. We first make insightful observations about
underlying characteristics of an FSM model. As a first direct approximation model, we
44
derive and evaluate a new class of lumped Markov chains [59]. However, we observe that
lumped Markov chains are also unable to approximate the behavior of FSM chains.
Finally, we analyze how FSM chains capture good- and bad-burst behavior of
wireless channels. Using this analysis, we derive important guidelines for the realization
of accurate, effective and low-complexity models. These guidelines lead to a constant-
complexity model (CCM) that always comprises of five states irrespective of the memory-
length. We show that the performance of the 5-state CCM in modeling of the 2 and 5.5
Mbps MAC layer bit-error channels is comparable to exponential-complexity FSM
chains and better than linear-complexity models [29].
A.5.1 The Hierarchical Markov Model The hierarchical Markov model (hMM) is based on the observation that error traces
exhibit clear delineation between highly bursty error regions and relatively low error
regions. Therefore, in an hMM [18], severe- and low-burst regions are identified in the
bit-error traces. Each of the burst states has an embedded two-state Markov model as
depicted in Figure 14. One of the challenges of the hMM model is the delineation of the
high-level severe- and low-burst states. The work in [18] employed a state demarcation
heuristic to delineate the low- and severe-burst states in the error traces. The state
demarcation heuristic relied on two empirically determined thresholds to transit between
severe- and low-burst states. One of the thresholds, say threshold 1, determines whether
or not a burst of bad bits is a small isolated burst between mostly good bits. The other
threshold, say threshold 2, ascertains the number of good bits which can characterize the
end of a long/severe burst of bad bits. Small thresholds can make the process transit
erratically between the high-level states, whereas large thresholds can unnecessarily
45
increase the sojourn time spent in a high-level state. There is unfortunately no good
method of determining the best values of these thresholds and heuristic experimentations
are needed to find somewhat accurate values of these thresholds.
Table 3 outlines the ENK-based performance of the hMM for varying values of the
two thresholds. The ENK divergence of the actual traces [row 1 of Table 3] provides a
Pgb1
Pbg1
low-burst state
Pbb1 Pgg1
bad good
Pbg2
severe-burst state
Pbb2 Pgg2
bad good
Pgb2
Figure 14. The hierarchical Markov model (hMM) [18].
Table 3. Performance of the hMM for 5.5 Mbps Bit-Error Process good-bursts bad-bursts
( )trace1 trace2ENK 0.0086 0.0029
( )trace1 hMMENK , threshold1=threshold2=10 0.621 0.009
( )trace2 hMMENK , threshold1=threshold2=10 0.68 0.0106
( )trace1 hMMENK , threshold1=threshold2=25 0.62 0.009
( )trace2 hMMENK , threshold1=threshold2=25 0.676 0.0111
( )trace1 hMMENK , threshold1=threshold2=50 0.587 0.009
( )trace2 hMMENK , threshold1=threshold2=50 0.594 0.013
( )trace1 hMMENK , threshold1=threshold2=100 0.624 0.009
( )trace2 hMMENK , threshold1=threshold2=100 0.682 0.011
46
reference value for performance evaluation of the hMM. It is obvious that irrespective of
the threshold values, the hMM always incurs a high overhead of more than 0.6 for the
good-bursts random variables. The bad-bursts random variable usually takes small values
since most of the bits are not corrupted. From Table 3, it can be seen that for the bad-
bursts random variable, the ENK distance between the hMM- and actual traces is quite
small for all thresholds. This ENK overhear is nevertheless much larger than the ENK
divergence between the actual traces. From these results, we conclude that the hMM
cannot capture the present MAC layer wireless bit-errors.
A.5.2 The Hidden Markov Model To apply hidden Markov models (HMMs) to this problem, we need discriminative
statistical features that can be used to train the HMM. After much experimentation, we
found that bit-error energy in non-overlapping windows can serve as an effective
discriminative feature for the low and severe bit-error conditions. Due to the present
0,1 representation of bit-errors, the error-rate simply corresponds to the energy process
defined in Section A.4.2.3.1 [equation (A.8)] with the window size representing the
aggregation level. We use the bit-error energy as input to the HMM’s Baum-Welch
forward-backward training algorithm [76], [77].
We ran simulation for varying window sizes and for varying number of HMM states.
Table 4 enumerates performances of three HMMs; similar trends were observed for other
HMM experiments. Note that the HMM performance is quite sensitive to the window
length. For instance, note that the HMM over a 1000 bit window has far inferior
performance than the 2000 bits HMM, even though the 2000 bits HMM has lesser
number of states. In general, the HMM performance improved with an increase in
47
window size. This improvement, however, saturated once the window size became
greater than the packet size.
Comparing the good-bursts column of Table 4 with Table 3 reveals that the HMMs
with 3 and 5 states yield better good-bursts performance than the hMM. However, the
ENK values for the bad-bursts random variable in all the HMM cases are orders of
magnitude greater than the respective values for the hMM traces. Hence, we conclude
that, while the HMM improves the modeling of good-bursts, it does not model the bad-
bursts adequately. Thus the overall performance of HMM modeling for the experimental
error traces is unsatisfactory.
The poor performance of HMMs is because unlike other problem areas where well-
defined characteristics of the input data are available for training, the bit-error traces of
this study do not provide robust training features. Furthermore, the HMM assumes that
the probability of staying in a state is exponentially distributed which is not be an
accurate assumption for wireless bit-errors. This assumption of exponentially distributed
sojourn state times results in inaccurate HMM parameterization.
Table 4. Performance of the HMM for the 5.5 Mbps Bit-Error Process
good-bursts bad-bursts ( )trace1 HMMENK , window=2000 bits, HMM states=3 0.403 0.685
( )trace2 HMMENK , window=2000 bits, HMM states=3 0.409 0.731
( )trace1 HMMENK , window=1000 bits, HMM states=5 2.466 2.408
( )trace2 HMMENK , window=1000 bits, HMM states=5 2.406 2.562
( )trace1 HMMENK , window=4096 bits, HMM states=8 0.109 0.175
( )trace2 HMMENK , window=4096 bits, HMM states=8 0.114 0.179
48
Since the hierarchical and hidden Markov models cannot capture the bit-error
process, we now focus on directly approximating FSM chains’ behavior. This direct FSM
approximation will be performed by aggregating FSM chain states.
A.5.3 FSM Observations In this section, we first state two intuitive observations regarding FSM chains. These
observations are used in subsequent sections to derive important characteristics of FSM
chains. It is important to outline here how we intend to approximate FSM chains. The
approximate models of this thesis are developed by creating partitions of the FSM chain
state space. All FSM states in a partition are then simply aggregated/grouped into a single
aggregate state of the approximate process. Hence, this section mainly addresses the
following question: How should one define partitions on the FSM state space such that
the resulting aggregate process accurately approximates the underlying FSM chain? In
other words, we are trying to find out which FSM states can be aggregated together
without compromising the FSM model’s performance.
A.5.4 Observations about FSM Chains The first observation is a direct consequence of the binary nature of the present
wireless bit-error process:
Observation 1. If a bit-by-bit sliding window is used to compute the transition
probabilities of a 2k state FSM chain, then from a current state, n iΧ = , in one
transition the FSM chain can transit to only two possible states given by:
49
( )( )12 mod22 1 mod2 ,
kn k
ii+
Χ = +
(A.16)
where k is the memory-length of the FSM chain and 0,1, ,2 1ki ∈ −… is an arbitrary
state in the FSM chain’s state space.
An example given in Figure 15 clearly demonstrates this observation. A memory-
length of four, 4k = , is used in this example so the set all possible FSM states is
40,1,2, ,2 1 15− =… . The current state is ( ) ( )2 100110 6nΧ = = and as the window
slides by one bit, the 0 in the most significant bit position will be dropped and a bit will
be added to the least significant bit position. Since the data are binary, the chain can
transit to either ( ) ( )2 101100 12= or ( ) ( )2 101101 13= . Thus in essence Observation 1
implies that at each slide of the memory-window the process’ current state i is subjected
to three operations: left-shift by one bit which yields 2i , followed by an addition of a
zero (2 0 2i i+ = ) or an addition of a one (2 1i + ) at the least significant bit (LSB)
position, followed by a modulus operation which ensures that if the current state of the
process is 12kn −Χ = then the next state wraps around to state 0 (for 1 2n i+Χ = ) or
state 1 (for 1 2 1n i+Χ = + ). For instance, in the preceding example with 4k = , if
( ) ( ) 12 101000 8 2kn −Χ = = = then the next state will be either
( ) ( )41 22 8 mod2 0000n+Χ = × = or ( ) ( )41 22 8 1 mod2 0001n+Χ = × + = . Since
each FSM state has two transition possibilities, each row of the FSM transition
50
probability matrix will have at most two non-zero entries, given by ( ), 2 mod2ki ip and
( ) ( ), 2 1 mod2 , 2 mod21k ki i i ip p+ = − .
Intuitively, one can claim that the number of error-free bits received over any
reasonable wireless channel should be much more than the number of corrupted bits. The
second observation stated below formulates this claim in terms of FSM chain parameters:
Observation 2. The steady-state probability of state 0 of an FSM chain for wireless
channels is much greater than the steady-state probabilities of all other states,
Sliding Window
0 1 1 0 0
Sliding Window
0 1 1 0 1
6nΧ =
1 2 6 12n +Χ = × =
1 2 6 1 13n +Χ = × + =
Figure 15. Transition possibilities for an FSM chain (memory-length, 4k = ).
Table 5. Empirical Evidence in Support of Observation 2
2 Mbps 5.5 Mbps 0π 0.997 0.974
51
2 10
1
k
jj
π π−
=∑? ,
(A.17)
where k represents the memory-length and iπ represents the steady-state probability of
being in state i of the FSM chain.
The above observation implies that the mean-time spent in state 0 of the FSM chain
(i.e., the state with no errors) is much greater than the mean-time spent in all other states.
It can be intuitively argued that this observation holds for real-life wireless channels. For
instance, Table 5 gives the steady-state probabilities of the 802.11b 2 Mbps bit-error
FSM chain of order 10 and the 5.5 Mbps bit-error FSM chain of order 9. Since the
steady-state probability of staying in the good (all-zero) FSM state is very close to one
for both the channels shown in Table 5, we can safely claim that Observation 2 holds for
the wireless channels currently under consideration.
A.5.5 Markov Chain Lumpability We first evaluate direct applicability of the well-known Markov chain lumpability
technique [59] to the wireless modeling problem under investigation. Chen et al. [24],
[25] showed that on some wireless channels lumpability might be a viable option for
reducing channel modeling complexity. We specialize the general definition of
lumpability to the binary FSM case using the observations made in the previous section.
A.5.5.1 Lumpability for Wireless Bit-Error Channels
Let the state space of an FSM with 2k states be given as 0,1, ,2 1kH = −… . Now
consider a new process with state space 0 1 1, , , NS S S S −= … , where 2kN ≤ . Let the
52
FSM states belonging to H be disjointly distributed among states of the new process. In
other words, each element of S is in turn a set containing one or more FSM states and is
henceforth referred to as an aggregate state. If we impose a condition that an FSM state
cannot exist in two different aggregate states simultaneously then the set S constitutes a
partition of the FSM state space.
Before proceeding further, we employ Observation 1 to prove a necessary condition
for defining partitions of the FSM state space. This condition is stated as a lemma.
LEMMA 1. The next state in an aggregate process can be accurately determined only if
the FSM states ( )2 mod2ki and ( )2 1 mod2ki + do not belong to the same aggregate
state,
( ) ( )2 mod2 2 1 mod2k kj ji S i S∈ ⇒ + ∉ , (A.18)
where k is the memory-length, i H∈ and jS S∈ .
Proof: Lemma 1 is easily proven by contradiction. In essence, this lemma implies that
both transition possibilities of an FSM state cannot be aggregated in a single state. As
mentioned in Observation 1, ( )2 mod2ki and ( )2 1 mod2ki + are the only possible
transitions for FSM state i . Let there exist an aggregate state jS that contains both FSM
states ( )2 mod2ki and ( )2 1 mod2ki + . Also, let qS represent an aggregate state that
contains FSM state i . Then ,q jS Sp does not give any information about whether a good-
or a bad-bit should be added to the memory-window. _
53
Let , ,jj
i S i kk S
p p∈
= ∑ , then , ji Sp represents the probability of moving from FSM
state i to aggregate state jS in one step of the FSM chain. Given Lemma 1 and using
Observation 1, , ji Sp can be written as
( ) ( )
( ) ( ), 2 mod2
, , 2 mod2
, 2 mod2
1 , 2 1 mod20 ,otherwise.
k
kj
k ji iki S ji i
p i S
p p i S
∈= − + ∈
An FSM chain is lumpable [59] with respect to a partition if for every choice of an
FSM chain starting vector the lumped process is a Markov chain and the transition
probabilities do not depend on the choice of the FSM starting vector. A process is said to
be weakly lumpable [59] with respect to a partition if at least one starting vector leads to a
Markov chain.
The strong lumpability theorem [59] is stated as:
THEOREM 1. A necessary and sufficient condition for an FSM to be lumpable with respect
to a partition 0 1 1, , , NS S S S −= … is that for each pair of aggregate sets iS and jS ,
, ji Sp has the same value for every FSM state in iS .
See [59] for proof of a general case of this theorem.
The strong lumpability condition asserts that all FSM states belonging to an aggregate
state should have the same probability of moving out of the aggregate state. We illustrate
it using an example. Figure 16 shows two aggregate states, ,iS n m= and
( ) ( ) 2 mod2 , 2 mod2k kjS n m= , where for ease of notation aggregate set
54
( ) ( ) 2 mod2 , 2 mod2k kn m is written as 2 ,2 n m . As outlined in Observation 1, FSM
state 2n represents one of the two possible transition possibilities of FSM state n . The
probability of this transition is denoted as ,2n np in Figure 16. Similarly, FSM state m
can move to FSM state 2m in one transition and this probability is denoted as ,2m mp .
The overall probability of moving from aggregate state iS to aggregate state jS is given
as ,i jS Sp . For this example, the lumpability condition requires that
, ,2 ,2i jS S n n m mp p p= = .
Since accurate wireless modeling necessitates the derivation of the Markov model
parameters from traces collected over an actual network, it is virtually impossible to
guarantee that the consequent FSM chain will have a transition probability matrix that
strongly or weakly satisfies the lumpability condition. (This assertion can be easily
verified by considering any of the real-life traces collected over actual wireless MAC
channels.) We hence deduce that lumpability in its precise form is not generically
applicable to wireless channel modeling.
The above discussion motivates a new question: Can we somehow relax the
lumpability conditions such that it is more readily applicable to the wireless modeling
problem under investigation? The following section tackles this question.
55
A.5.5.2 Folded Markov Chains
The lumpability condition is too stringent to be enforced on wireless models. In this
section, we modify an FSM chain’s state transition probabilities such that the modified
chain can be divided into two equal-sized partitions that satisfy the strong lumpability
condition. We show that this state aggregation procedure can be applied recursively to a
transition probability matrix to achieve a desired level of complexity. Then we use the
802.11b MAC layer bit-error channel for empirical performance evaluation of this new
class of models.
We first note that to reach FSM states 2i and 2 1i + for 10 2 1ki −≤ ≤ − in a single
transition, the current state of the FSM chain should be either state i or state ( )12k i− + .
In other words, the following pairs of FSM states have the same set of next possible
states:
iS
n
m
jS
2n
2m
,i jS Sp
,2n np
,2m mp
Figure 16. Aggregate states iS and jS containing FSM states , n m and 2 ,2n m ,
respectively.
56
( ) ( ) ( )1 1 1 1 10,2 , 1,1 2 , , 2 1,2 1 2 2 1k k k k k k− − − − −+ − − + = −… . (A.19)
For instance, a 4-state (i.e., memory-length of 2 bits) FSM transition probability matrix is
given by
0,0 0,11,2 1,3
2,0 2,13,2 3,3
0 00 0
0 00 0
p pp p
p pp p
.
We can see that states 0 and 2 have the same one-step transition possibilities since in
one transition both these states can either transit to state 0 or state 1 . Now if the
probability of transiting to state 0 is the same from both states 0 and 2 then these states
will satisfy the lumpability condition and hence can be aggregated together. Similarly,
states 1 and 3 have the same transition possibilities. If the probability of transiting to
state 2 is the same for both states 1 and 3 then they can be aggregated together. That is,
if the above conditions are satisfied then the FSM chain can be lumped with respect to
partitions 2 10 0,0 2 2S −= + = and 2 11 1,1 2 3S −= + = thereby giving the
following transition probability matrix:
0 0 0 11 0 1 1
, ,, ,
S S S SS S S S
p pp p
.
Based on the observation that the state pair ( )1,2ki i− + have the same one-step
transition possibilities, we propose to modify an FSM chain’s transition probabilities
matrix as follows:
57
( ) ( )( ) ( )1
1 , 2 mod2 2 , 2 mod2, 2 mod2 2 , 2 mod2 2
k k kk k k i i i i
i i i ip p
p p−
− ++
+= =$ $
and ( ) ( ) ( )1, 2 1 mod2 2 , 2 1 mod2 , 2 mod21k k k ki i i i i ip p p−+ + += = −$ $ $
(A.20)
For 0,1, ,2 1ki = −… , where ,i jp and ,i jp$ represent the transition probabilities of the
original and modified FSM chains. After this transformation, state pairs ( )1,2ki i− + in
the modified transition probability matrix clearly satisfy the lumpability constraint and
can be aggregated together.
Using the above strategy, any 2 2k k× FSM transition probability matrix can be
modified and folded about 12k− to give a Markov chain with exactly half the number of
states. Since the basic transition probability structure is retained after the folding
operation, this state reduction procedure can in fact be applied recursively to a 2k state
FSM chain to give a 2m state folded Markov chain, where m is an integer such that
1 m k≤ < . We henceforth refer to these models as folded Markov chains (FMCs).
A folded process is a coarse approximation of an FSM chain because folding simply
ensures a non-zero transition probability between two aggregate states. However, the
FSM transition probabilities for the FSM states that are aggregated/grouped together may
be very different. Thus a folded process represents an on-average behavior of the FSM.
This fact will become clear in the performance evaluation section.
At this point, the following question may be raised: How is a 2m state FMC different
from a 2m state FSM? A 2m state FSM represents a process with a memory-length of
m , whereas a 2m state FMC might be a folded version of an FSM with a memory-length
greater than m . For instance, in the following section we evaluate performance of a 64-
state FSM with a 64-state FMC. While the number of states is the same in both the
58
models, the FSM has a memory-length of 6 whereas the FMC is formed by performing
three folding operations on an FSM with a memory-length of 9.
A.5.5.3 Evaluation of Folded Markov Chains
We fold the order-9 FSM chain for the 5.5 Mbps process to FMCs having 256, 128,
64, 32, 16, 8, 4 and 2 states. The performance comparison of these FMCs with FSMs of
memory-lengths 2, 3, 4, 5, 6, 7 and 8 is provided in Figure 17. It can be clearly seen that
the FMC performance for any number of states is similar to or worse than the FSM. Thus
the FMCs do not provide any improvement in performance over varying order FSMs.
FMC performance was similar for a 2 Mbps bit-error channel. It can be deduced that only
the on-average FSM behavior captured by the FMCs is not sufficient and more statistical
characteristics of FSM chains should be incorporated in an effective model.
2 4 8 16 32 64 128 256 5120.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
number of states
EN
K: g
ood-
burs
ts
FSMFMC
2 4 8 16 32 64 128 256 512
0.002
0.004
0.006
0.008
0.01
0.012
0.014
number of states
EN
K: b
ad-b
urst
s
FSMFMC
(a) good-bursts (b) bad-bursts Figure 17. Performance of FMCs formed by folding a 512 state FSM to 256, 128, 64, 32,
16, 8, 4 and 2 states; the FSM process is trained using a 5.5 Mbps trace.
59
A.5.6 Complexity Reduction by Approximating an FSM Chain’s Good- and Bad-Burst Behavior
Now that we have established that lumpability and relaxed versions of it cannot
capture the complex wireless bit-error behavior, we focus on analyzing how an FSM
chain captures a channel’s good- and bad-bursts. To that end, in this section we derive
generalized probability distributions of good- and bad-bursts for an FSM chain of
arbitrary order. The probability distributions are derived in terms of FSM chain transition
and steady-state probabilities. These distributions render useful insights into important
FSM characteristics, which are used to develop guidelines for defining FSM state space
partitions. (Recall that the objective of the present analysis is to ascertain partitions of
FSM state space. FSM states in a particular partition are then grouped together to form an
aggregate state in the low-complexity approximating process.) We want to define the
FSM state space partitions such that the resulting aggregate process, while being less
complex, closely matches the FSM chain characteristics.
Let H and S denote the state spaces of an FSM chain and an aggregate
(approximate) process, respectively. Let i H∈ and iS S∈ denote two arbitrary states of
the FSM and the approximate process, respectively. From Lemma 1 we have a necessary
condition that should be imposed on the aggregate states iS . To simplify notation, from
this point forward we drop the mod2k operation (where k is the memory-length) on
each FSM chain state. Thus, an FSM state ( )mod2ki is simply written as statei . As in
previous sections, let I and B denote the good- and bad-bursts random variables,
respectively. We want to derive closed-form expressions of I and B in terms of FSM
chain parameters. We expect the expressions for good- and bad-burst random variables to
60
render insights into how an FSM chain captures these random variables. The following
theorem states the FSM probability distribution of good-bursts:
THEOREM 2. The probability distribution of a good-burst of length exactly l ,
Pr I l= , for an FSM chain of memory-length k is
( ) ( )
( ) ( ) ( ) ( )
( )
11
1 1
1
min 2, 22 12 1 2 1 2 , 2 1 2
00
2 1 2 , 2 1 2 2 1 2 , 2 1 2 1
0,0 0,12 ,0
Pr ,
,, 0,
, .
kj j
l l l l
k
k li i i i
ji
i i i ii l k
I l p
p p l kk l where
p p p l k
π µ
µ
−+
− +
−
− −−+ + +
==
+ + + + +−
= = × ×
× <∀ > = ≥
∑ ∏
(A.21)
Proof: Before proceeding with the proof, we recall that the subscripts of all transition
and steady-state probabilities are modulo 2k . Let us focus on the proof of the l k≥ case
( )1 2 1 0 2
2 1, , , , 1
n
k k
ix x x x− −
Χ = += =…
( ) 22 2 1 2 mod 2kn i+Χ = +
1 (2 1)2mod2kn i+Χ = +
Initial State
( ) 111
2 1 2 mod 22
k kn kk
i −+ −−
Χ = +=
0n k+Χ = 0n l+Χ = 1 1n l+ +Χ =
Figure 18. State transitions of an FSM with memory-length k and a good-burst of length l k≥ .
61
since the proof of the other case is much simpler and follows a similar procedure. Given
any current state, a good-burst (i.e., burst of 0’s) will start if the current state has a 1 in
the LSB position of the memory-window, i.e., the current state represents an odd-
numbered FSM state 12 1,0 2 1kn i i −Χ = + ≤ ≤ − .
Without loss of generality, consider the state path given in Figure 18. For a good-
burst of length l starting in the current odd-numbered FSM state, the next 1k −
transitions will be ( ) ( ) ( ) ( )2 12 1 , 2 1 2, 2 1 2 , , 2 1 2ki i i i −+ + + +… . Note that
( ) 1 12 1 mod2 2k ki − −+ = and based on the discussion in Observation 1, the process
wraps around to FSM state 0 at this point, i.e., at point 11 2kn k −+ −Χ = , the good-burst
continues and the process wraps around, 0n k+Χ = . This transition sequence is
followed by l k− zero bits, i.e., the next l k− transitions are from state 0 to state 0
giving 1 2 0n k n k n k n l+ + + + + +Χ = Χ = Χ = = Χ =… . The good-burst ends when a
one bit is encountered at the ( )st1l + transition, and the FSM process moves to
( ) ( )1 2 1000 01 1n l+ +Χ = =… . This state-transition path when expressed in the form of
probabilities will have to be summed over all possible odd-valued FSM states,
62
( ) ( ) ( ) ( ) ( )
( ) ( )( ) ( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( ) ( )( )
2 2 1
1
2 2 1
1
2 2 1
1
1, 1 2 1 2, 1 2 1 2 , 1 21
0,0 0,11 2 ,03, 3 2 3 2, 3 2 3 2 , 3 2
30,0 0,13 2 ,0
2 1, 2 1 2 2 1 2, 2 1 2 2 1 2 , 2 1 2
2 12 1 2 ,
Prk k
k
k k
k
k k k k k k k k
kk k
l k
l k
p p pI l
p p pp p p
p p p
p p p
p
π
π
π
− −
−
− −
−
− −
−
−
−
− − − − − −
−−
× × × = = × × × × × × + × × ×
+× × ×
+×
…
…
M…
( )0,0 0,10.l kp p−
× ×
Taking out common terms yields
( ) ( ) ( )
111
2 1 20,0 0,1 2 1 2 1 2 , 2 1 22 ,0
00Pr
kj jk
kl ki i i
jiI l p p p pπ
−+−
− −−+ + +
==
= = × × × ∑ ∏ ,
which is the same as the expression in Theorem 2 for all k l≥ . _
Some explanation of the good-bursts probability distribution given above is as
follows: Let n denote the discrete time index at which a good-burst started. The last bit
received before the good-burst must be a corrupted bit, i.e., ( )1 1x n − = . Thus, at time
instance 1n − the FSM chain’s memory-window had a “1” at the LSB position. In other
words, the FSM chain was in an odd state, i.e., 1 2 1n i−Χ = + , where 11 2 1ki −≤ ≤ − .
For the good-bursts probability distribution, we have to account for (or sum over) all the
odd states of the FSM chain. This fact explains the 2 1iπ + ’s in the additive expression of
(A.21). For a good-burst of l bits, the l bits following ( )1 1x n − = must be error-free,
i.e., ( ) ( ) ( )0, 1 0, , 1 0x n x n x n l= + = + − =… . This results in the multiplicative
expression following each 2 1iπ + . Thus the multiplicative expression characterizes the
63
state transition path for l good bits starting in FSM state 2 1i + . Since the good-burst
ends after l bits, the n l+ -th bit must be corrupted, i.e., ( )1 1x n l+ + = . The iµ
expression characterizes the transition on the n l+ -th step depending on whether the
total burst-length is smaller or longer than the memory-window.
Similar to Theorem 2, the probability distribution of a bad-burst of length l is given
in the following theorem:
THEOREM 3. The probability distribution of a bad-burst of length exactly l ,
Pr B l= , for an FSM chain of memory-length k is
( ) ( )
( ) ( ) ( ) ( )
( )
11
1 1
1
min 2, 22 12 2 1 2 1, 2 1 2 1
00
2 1 2 1, 2 1 2 1 2 1 2 1, 2 1 2 2
2 1,2 1 2 1,2 1 2 1,2 2
Pr ,
,, 0,
, .
kj j
l l l l
k k k k k k
k li i i i
ji
i i i il ki
B l p
p p l kk l where
p p p l k
π µ
µ
−+
− +
−
− −−+ − + −==
+ − + − + − + −−
− − − − − −
= = × ×
× <∀ > = × × ≥
∑ ∏
(A.22)
Proof of this theorem is skipped because it is very similar to the proof of Theorem 2.
The expression for good- and bad-burst probability distributions given in (A.21) and
(A.22) are rather convoluted. Hence in their present forms, these expressions neither offer
any obvious insight into the FSM chain behavior nor are they amenable to further
analysis. In the following section, we employ Observation 2 to simplify the probability
distribution expressions of (A.21) and (A.22). The simplification in turn leads us to the
design guidelines that should be followed by a low complexity model.
64
A.5.6.1 Simplification of Good-bursts Distribution
We know from Observation 2 that the steady-state probability of FSM state 0 is very
high. Consequently, the steady-state probabilities of odd FSM states in the good-bursts
expression of (A.21) are negligible. The terms involving a transition to or from state 0 of
the FSM will hence dominate the good-burst probability distribution of (A.21).
Moreover, since the channel usually stays in the good state for practical wireless
networks, the good-burst length should in general be significantly greater than the
memory-length. Hence, an effective good-bursts probability distribution Pr I l=
should accurately capture the l k≥ behavior. An approximation of the good-bursts
probability distribution of (A.21) for l k≥ can be rewritten as:
( )1 0,0 0,12 ,0Pr , 0kl kI l p p p l k−−= ≈ ∀ ≥ > .
(A.23)
Although the above expression is an approximation of the FSM chain’s good-bursts
probability distribution, it is clearly more insightful. For instance, note that the parameter
characterizing this approximate probability distribution is the probability of a good bit
transmission followed by another good bit transmission ( 0,0p ) since this is the only
parameter in (A.23) that involves the good-burst-length,l . Hence, one important
consideration while grouping FSM states should be that the all-zero (i.e., no-error) FSM
state is not grouped with a large number of other states. This is a natural consequence of
Observation 2 which implies that the mean time spent in the all-zero (i.e., no-error) FSM
state is significantly higher than all other FSM states.
65
Similarly, in addition to the FSM state 0, two other important FSM states are state
12k− and state 1 since 12 ,0kp − and 0,1p are the only parameters, other than 0,0p , that
appear in the approximate probability distribution given in (A.23). Hence, due to their
relative importance in describing real-life wireless channels, a good model, in addition to
FSM state 0, should not group FSM states 1 and 12k− with too many other states. This
guideline will be employed to define the constant-complexity model.
A.5.6.2 Simplification of Bad-bursts Distribution
For the bad-bursts probability distribution of (A.22), we again invoke Observation 2
and neglect the terms in (A.22) that are not multiplied with 0π . Using this
approximation, the bad-bursts distribution (A.22) can be written as:
( )
1
1 1
1
min 2, 20 0 2 1,2 1
0
2 1,2 1 2 1,2 20
2 1,2 1 2 1,2 1 2 1,2 2
Pr ,
,where
, .
j j
l l l l
k k k k k k
k l
j
l k
B l p
p p l k
p p p l k
π µ
µ
+
− +
−
− −− −
=
− − − −−
− − − − − −
= =
<= ≥
∏
(A.24)
The only terms appearing in (A.24) after the approximation involve FSM states 0,
2 2k − , and 2 1j − , for any 1 j k≤ ≤ . From Observation 1 and the good-bursts
approximation, we have already established that FSM state 0 should not be aggregated
with many other states. This deduction is reasserted here. Moreover, it is preferable not to
aggregate FSM state 2 2k − with many other states. Also, if possible, all FSM states
2 1j − , where 1 j k≤ ≤ , should not be grouped with too many other states.
66
A.5.6.3 Guidelines for Approximating an FSM chain
Based on the analyses of previous sections, we now define guidelines that should be
followed to develop partitions on the FSM state space. FSM states in each partition are
then aggregated to give a low-complexity aggregate model. The FSM state aggregation
procedure is based on the underlying assumption that there is a given complexity budget.
That is, the required number of states in the aggregate model is specified beforehand.
FSM state aggregation should result in a model which has the required number of states.
Given a complexity budget in the form of the total number of states and based on
preceding discussions, we define the following guidelines that should be followed to
develop an aggregate model with total number of states satisfying the complexity budget:
Guideline 1. Any FSM chain state aggregation should satisfy the condition given in
Lemma 1.
Guideline 2: FSM state 0 should not be aggregated with other states.
Guideline 3: FSM states 12k− and 1 should be aggregated with a minimal number of
other states.
Guideline 4: FSM states 2 2k − and 2 1j − , for all 1 j k≤ ≤ , should be aggregated
with a minimal number of other states.
Note that Guideline 1 and Guideline 2 are more assertive than Guideline 3 and
Guideline 4. This is due to the analysis provided in the previous section, which outlined
that: (i) Guideline 1 is necessary for an accurate model, and (ii) Guideline 2, which is a
consequence of Observation 2, is asserted by the approximate distributions of both good-
and bad-bursts. Also note that Guideline 1, Guideline 2, and Guideline 3 can be easily
satisfied in a low-complexity model. However, Guideline 4 is somewhat problematic
67
because putting each 2 1j − FSM state, for all 1 j k≤ ≤ , in a separate partition (i.e.,
separate aggregate state) makes the total number of states of the approximate model an
increasing function of the memory-length k . Thus, satisfying Guideline 3 implies that the
resultant complexity (i.e., number of states) of the aggregate model will at least be a
linear function of the memory-length. We, on the other hand, want to keep the number of
states in the model independent of the underlying process’ memory-length. In the
following section, we develop a constant-complexity model which adheres to the first
three guidelines. Performance evaluation of the model for 802.11b channels demonstrates
that although the proposed model ignores Guideline 4, it approximates an FSM chain’s
behavior with outstanding accuracy.
A.5.7 Constant-Complexity Model In this section, we propose a constant-complexity model (CCM) which adheres to
Guideline 1, Guideline 2, and Guideline 3. Here, it should be emphasized that the FSM
state space partitioning presented in this section is only one of the many possible state
assignments. Future low-complexity channel models can define other state partitions
which should perform adequately as long as the above guidelines are followed.
The CCM keeps FSM states 0, 1 and 12k− each in a separate partition, while
grouping all the remaining FSM states into two partitions. The resulting model always
has 5 states irrespective of the memory-length. The structure and transition possibilities
of the CCM are illustrated in Figure 19. It is clearly outlined by Figure 19 that the CCM
assigns separate states to FSM states 0, 1 and 12k− , thereby adhering to Guideline 2 and
Guideline 3. All remaining even FSM states are grouped in a single aggregate CCM state,
68
while all remaining odd FSM states are grouped in another aggregate state. Note that
none of the CCM states contains both an odd and an even FSM state, i.e., an aggregate
state either contains even FSM states or odd FSM states. Thus Guideline 1, which states
that FSM states 2i and 2 1i + should not be aggregated together, is also satisfied by the
CCM. Based on our analysis, this 5-state CCM should follow the behaviour of the
underlying 2k state FSM quite closely. This CCM efficacy will be adequately
highlighted in the next section where we compare its performance with FSM and linear-
complexity models.
A.5.7.1 Performance of the CCM at 2 Mbps
We provide ENK based performance comparison between the 548-state FSM and the
5-state CCM for memory-lengths ranging from 1 up to 10 in Figure 20 and Figure 21. We
also compare performance with previously proposed short-term energy model (SEM) and
zero-crossing model (ZCM) [29]. These two models constrain the complexity to increase
linearly with the memory-length. Performance of the 548-state FSM model formulates a
criterion for performance evaluation of the CCM, SEM and ZCM. The longest memory-
length of 10 yields a 548-state FSM, an 11-state SEM, a 10-state ZCM and a 5-state
0 1
12 k − 1 12, 4, 6, , 2 2, 2 2, , 2 2k k k− −− + −… …
3, 5, 7, , 2 1k −…
Figure 19. State aggregation and transitions for the CCM. Each box represents an
aggregate CCM state. The number(s) inside a CCM state are the aggregated FSM states.
69
CCM. Let us first focus on Figure 20 which plots performance versus complexity for
FSM chains, CCM, SEM and ZCM. Although all memory-lengths from 1 up to 10 were
evaluated, to show the results clearly this figure only plots the ENK values for a certain
number of states.
Due to the fixed CCM complexity, only the ENK performance of one CCM
(corresponding to a memory-length of 8) is shown in Figure 20. This particular CCM was
chosen since it rendered the best overall performance. The performance of CCM models
2 4 8 16 32 64 128 256 512 10240
0.5
1
1.5
number of states (logscale)
EN
K: g
ood-
burs
ts
FSMCCM (memory-length=8)SEMZCM
2 4 8 16 32 64 128 256 512 10240
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
number of states (logscale)
EN
K: b
ad-b
urst
s
FSMCCM (memory-length=8)SEMZCM
(a) good-bursts (b) bad-bursts Figure 20. ENK based modeling performance versus complexity for the 2 Mbps bit-error
process.
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
memory-length
EN
K: g
ood-
burs
ts
FSMCCMSEMZCM
2 4 6 8 10 12 140.001
0.01
0.1
1
10
0.001
0.01
0.1
1
10
0.001
memory-length
EN
K: b
ad-b
urst
sFSMCCMSEMZCM
(a) good-bursts (b) bad-bursts Figure 21. ENK based modeling performance versus memory-length for the 2 Mbps bit-
error process.
70
for the remaining memory-lengths will be discussed shortly. It is clear from Figure 20
that for the good-bursts random variable the CCM performs as well as the 548-state FSM.
For the same complexity as the CCM (i.e., 5-states), the linear-complexity models have
higher ENK overhead. However, the performance of higher order linear-complexity
(SEM and ZCM) models is reasonable. Hence, it can be deduced that the CCM captures
the good-bursts behavior of the 2 Mbps wireless MAC layer channel accurately and with
lesser number of states than any other model under consideration. Similarly, Figure 20
shows that the CCM ENK overhead for the bad-bursts random variable is also very small
and is quite comparable to the corresponding FSM, SEM and ZCM. Specifically, the
CCM incurs an ENK overhead of 0.053 as opposed to 0.018 for the 4-state FSM, 0.039
for the 5-state SEM and 0.0386 for the 5-state ZCM.
Figure 21 provides further insight into the performance of CCM for different
memory-lengths. From Figure 21 it can be observed that the CCM performance for all
orders is better than the FSM model, SEM and ZCM for the good-bursts random variable.
In case of the bad-bursts random variable, the performance of all models with memory-
lengths greater than 3 is comparable. The CCM performance for small orders is better
than the linear-complexity models. For high orders, while both linear- and constant-
complexity models have slightly greater overhead than the FSM model, the CCM
performance is comparable to its linear-complexity counterparts.
The ENK divergence highlights that the CCM provides an accurate and low-
complexity bit-error model for 802.11b LANs operating at 2 Mbps. This performance
substantiates our initial analysis which outlined that a 5-state CCM can render a
performance that is comparable to the respective 2k state FSM chain. As shown in [29],
71
the linear-complexity (SEM and ZCM) models also yield very good ENK based
performances.
A.5.7.2 Performance of the CCM at 5.5 Mbps
ENK based performances of the FSM chains, CCM, SEM and ZCM at 5.5 Mbps are
outlined in Figure 22 and Figure 23. Only a CCM with memory-length of 6 is shown in
Figure 22 since it renders the best overall (good- and bad-bursts) performance. It is clear
from Figure 22 that the CCM performance for the good-bursts random variable is
2 4 8 16 32 64 128 256 5110
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
number of states (logscale)
EN
K: g
ood-
burs
ts
FSMCCM (memory-length=6)SEMZCM
2 4 8 16 32 64 128 256 511
0.01
0.1
1
0.01
0.1
1
0.01
number of states (logscale)
EN
K: b
ad-b
urst
s
FSMCCM (memory-length=6)SEMZCM
(a) good-bursts (b) bad-bursts Figure 22. ENK based modeling performance versus complexity for the 5.5 Mbps bit-
error process.
0 2 4 6 8 10 12 140
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
memory-length
EN
K: g
ood-
burs
ts
FSMCCMSEMZCM
0 2 4 6 8 10 12 140
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
memory-length
EN
K: b
ad-b
urst
s
FSMCCMSEMZCM
(a) good-bursts (b) bad-bursts Figure 23. ENK based modeling performance versus memory-length for the 5.5 Mbps
bit-error process.
72
comparable to or better than all other modeling techniques. Note however that the ZCM
performs slightly better than the CCM. Thus the CCM and ZCM, even at low orders,
capture the good-bursts behavior of the 5.5 Mbps channel very accurately. Similarly,
Figure 22 shows that the CCM ENK overhead for the bad-bursts random variable is also
very small. Figure 23 outlines the performance rendered by CCMs corresponding to
different memory-lengths. From Figure 23 it can be observed that the CCM performance
for all orders is better than or comparable to the FSM, SEM and the ZCM for the good-
bursts random variable. In case of the bad-bursts random variable, the performances of all
the models except the SEM are similar.
Thus, while keeping both complexity and modeling performance under consideration,
the ENK divergence asserts that the CCM outperforms its linear-complexity counterparts
in modeling of the 802.11b bit-errors at 5.5 Mbps.
A.5.8 Discussion At this point, we have developed accurate and low-complexity models for the
wireless bit-error channels under consideration. In the following chapters, we explore the
application and usefulness of these models. Specifically, the next chapter uses these
models in a novel wireless multimedia framework. The last contribution chapter of this
part quantifies the inaccuracies that are incurred if channel memory is ignored and a low-
order FSM model is used to simulate and analyze wireless systems.
73
CHAPTER A. 6 CHANNEL MODEL BASED HEADER ESTIMATION FOR WIRELESS
MULTIMEDIA
Wireless channels incur unpredictable and time-varying packet losses due to channel
interference and node mobility. This data loss is particularly detrimental for real-time
communications since their delay constraints generally do not allow retransmission-based
recovery of lost packets. Consequently, recent multimedia standards have introduced
enhanced error resilience and concealment features (e.g., slices in JVT/H.264 [83] and
reversible VLC in MPEG-4 [84]) to cater for bandwidth-constrained and error-prone
wireless channels. Distortion in multimedia quality at a wireless receiver can be
substantially decreased if corrupted packets, instead of being dropped, are relayed to the
multimedia application. The application can then decide to retain, drop or recover the
corrupted packets.
To improve packet throughput at a wireless receiver, enhanced robustness is provided
at the physical layer of emerging wireless protocol stacks. Nevertheless, residual/MAC-
to-MAC errors not corrected by the physical layer cause checksum failures at higher
(MAC and transport) layers, leading to a significant number of packet drops. The UDP-
Lite protocol was proposed to address this problem [41]- [44]. As explained in Section
A.2.2, the proposed UDP-Lite based transport schemes ignore errors in the application
layer payload, but drop all packets that have one or more bit-errors in the IP, the UDP, or
the application layer headers.
74
It has been shown that UDP-Lite based partial protection with application layer
forward error correction (FEC) improves wireless bandwidth utilization [41]- [51].
Support of partial protection necessitates changes to the standard protocols at the
multimedia transmitter and/or intermediate network nodes. In many realistic scenarios,
modifications to multimedia servers and/or intermediate nodes cannot be dictated by the
end-receivers. We argue that the requirement of transmitter modifications in UDP-Lite
has hampered its wide-spread deployment. Furthermore, frequent header errors result in
significant packet drops for UDP-Lite, especially at high data rates2.
UDP-Lite’s shortcomings can be addressed by a receiver-based scheme that, in
addition to ignoring payload errors, can estimate corrupted header fields. For such a
header estimation scheme to be practical, modifications below the application layer
should only be made to the wireless receiver. Thus no additional information (such as
FEC redundancy) is available for header estimation at the receiver. However, the
corrupted payloads relayed to the receiver’s application layer by a header estimation
scheme can and should be corrected using application layer FEC decoding. In this
chapter, we propose a cross-layer header estimation methodology that employs the MAC
layer bit-error channel models employed in the previous chapters to estimate the
corrupted headers of a packet.
Before outlining the actual header estimation methodology, we derive and present
sound analytical conditions for the region-of-operation under which header estimation
performs better/worse than UDP and UDP-Lite. We clearly show that for any realistic
wireless system, the FEC redundancy required by header estimation is always lower than
2 In [18] the authors showed that under realistic settings of an 802.11b network, packets dropped by a UDP-Lite based protocol stack are 5.87% and 36.7% at 5.5 and 11 Mbps, respectively.
75
UDP and UDP-Lite protocols. Since FEC is generally performed on a byte-level, analysis
is provided for an arbitrary symbol size with the implicit assumption that the symbol size
is greater than one bit. We demonstrate the efficacy of header estimation for two
important classes of symbol-level wireless channels: symmetric/memory-less channels
and Gilbert channels. We show that an ideal header estimation scheme can provide
redundancy reduction (or goodput improvement) of up to 75% over UDP and UDP-Lite.
The analysis in the first part of this chapter serves as a motivation to develop a
practical, effective and accurate header estimation framework to improve wireless
multimedia quality. We propose a header estimator that can use the accurate MAC layer
bit-error channel models developed in the preceding chapter to estimate the corrupted
critical header fields (CHF) of a packet, while non-critical header fields are simply
ignored. At a header estimation-based UDP multimedia receiver, the most likely
transmitted CHF is estimated through channel parameters. The proposed scheme requires
no modifications to the standard protocols at senders and/or intermediate nodes. Only
minor protocol stack modifications are needed at the receiver. We map header estimation
to a problem of maximum-likelihood (ML) estimation of known parameters in noise [79].
We derive likelihood functions for an arbitrary-order full-state Markov chain model and a
multifractal wavelet model [61]- [63]. The FSM likelihood function is extended to the
provide likelihood using the constant-complexity model. Trace-driven video simulations
at varying data rates of an 802.11b LAN show that the proposed scheme provides
significantly better throughput and multimedia quality than normal UDP and UDP-Lite.
76
A.6.1 FEC Redundancy Lower Bounds for UDP, UDP-Lite and Header Estimation
In this section, we derive theoretical bounds on the improvements provided by an
ideal header estimation scheme with application layer FEC operating on an q -ary
symmetric channel (SC) and a Gilbert channel (GC). Throughout this section, we
consider a MAC layer channel which sends and receives symbols of size m bits. We
assume that this symbol size is equal to the FEC symbol size. Since FEC is generally not
performed on the bit-level, for the following theoretical analysis we assume that 1m > .
The term “ideal header estimation” implies that all corrupted packets intended for a
receiver are passed to its application layer. We derive lower bounds on the expected
amount of FEC redundancy required to successfully decode one FEC block. Naturally,
we want the amount of redundancy to be as low as possible for efficient utilization of
scarce wireless bandwidth. The bounds derived in this section answer the following
question: Under what conditions does header estimation require lesser FEC redundancy
for payload correction than UDP and UDP-Lite?
As mentioned before, we assume that the transmitter packetizes and transmits
symbols of arbitrary size m , where m is also the FEC symbol size. A block-based
maximum distance separable (MDS) FEC scheme capable of simultaneously correcting
errors and erasures operates at the transmitter and receiver application layers. The
transmitter packetizes each FEC block into l packets, with each packet having a data
payload of Dn symbols. Before transmitting each packet, a header of size Hn is
appended to the packet. Thus each packet has a fixed length of H Dn n+ symbols. The
FEC algorithm only protects the data symbols, and hence the FEC block-length is Dn l
77
symbols. A total of r out of the Dn l symbols are redundant. A packet dropped by a
protocol below the application layer is treated as a packet erasure by the FEC decoder.
Since the FEC decoder is operating at symbol level, each packet erasure will result in Dn
symbol erasures. We are assuming that before decoding, the FEC decoder can identify
missing packets or packet erasures in an FEC block. This can, for example, be achieved
by transmitting an FEC-protected sequence number in each packet.
Let X and Y be two random variables which respectively characterize the number
of errors and erasures observed at the wireless receiver before FEC decoding. An MDS
code can recover all errors X and erasures Y if 2X Y r+ ≤ [80].
A.6.1.1 Redundancy Bounds on the q-ary Symmetric Channel
The inputs and outputs of an q -ary symmetric channel (SC) are derived from an
alphabet of 2mq = symbols. An SC is characterized by a single parameter p , the
probability that a transmitted symbol jx is received as i jx x≠ :
Pr is received is transmitted fori jp x x i j= , ≠ .
The overall probability of a symbol jx being corrupted over a SC is:
( )Pr symbol error 2 1mSCp p= = − . (A.25)
We now derive FEC redundancy lower bounds on UDP, UDP-Lite and header estimation
based protocol stacks operating on an SC channel.
78
A.6.1.1.1 FEC Redundancy Bound on a UDP based Protocol Stack Traditional wireless protocol stacks perform a checksum on the entire packet and
drop all packets that fail the checksum. While the checksum is generally performed at
both UDP and MAC layers, for simplicity and brevity, we refer to a protocol stack that
drops all corrupted packets as a UDP protocol stack. Throughout this chapter, dropped
packets are treated as erasures by the wireless receiver’s application layer FEC decoder;
each dropped packet results in Dn erased symbols. Since UDP drops all corrupted
packets, the number of errors in the received data are always equal to zero,
Pr 0 UDP 1X = = . In this section, we derive an expression for the expected value of
the number of erasures, Y , observed with the UDP protocol.
For UDP, an Dn -symbol erasure will occur whenever a received packet has one or
more symbol-errors. Let udp SCε , denote the probability of observing a UDP packet
erasure over an SC:
( )1 1 H Dn nudp SC SCpε +, = − − ,
where SCp is the probability of symbol error given in (A.25). The probability of having
k packet erasures over UDP is:
( ) ( )Pr pkt eras UDP 1k l kudp SC udp SC
lk k ε ε −
, , = −
,
where l is the total number of packets containing one FEC block. Then the expected
value of packet erasures is
79
E # of pkt eras UDP
E # of symbol eras UDP E UDPudp SC
D udp SC
l
Y n l
ε
ε,
,
=
⇒ = = .
Since Pr 0 UDP 1X = = , E UDP 0X = . Thus the average amount of
redundancy required by a FEC decoder operating on a UDP protocol stack is
( )1 1 H Dudp SC D udp SC
n nD SC
r n l
n l p
ε, ,+
≥ ≥ − − .
(A.26)
A.6.1.1.2 FEC Redundancy Bound on a UDP-Lite based Protocol Stack
Since a UDP-Lite protocol stack drops all packets that have header errors, the
probability of UDP-Lite packet erasures over an SC is
( )Pr corrupt hdr 1 1 Hnudplite SC SCpε , = = − − . Consequently, the expected
number of UDP-Lite erasures is
( )
E pkt eras UDPLite
E symbol eras UDPLite E UDPLite 1 1 H
udplite SCn
D udplite SC D SC
l
Y n l n l p
ε
ε,
,
=
⇒ = = = − .−
In addition to erasures, a UDP-Lite protocol stack will also have errors in the application
layer payload. The probability of having k symbol errors in the total
E UDPLiteDh n l Y= − symbols received at the FEC decoder is
( ) ( )Pr symbol errs UDPLite Pr UDPLite 1k h kSC SC
hk X k p pk
− = = = − .
The expected number of symbol errors is
80
( ) ( )
E symbol errs UDPLite E UDPLite
E UDPLite 1 HSC
nD SC D SC SC
X hp
n l Y p n l p p
= == − = − .
Thus the total expected redundancy required to recover the errors and losses in an FEC
block over a UDP-Lite protocol stack is
( ) ( )
2E UDPLite1 21 1H H
udplite SC D udplite SCn n
D SCSC SC
r n l Xn l pp p
ε, ,
≥ +≥ − +− −
( ) ( )1 1 21 Hnudplite SC D SCSCr n l pp , ≥ − −− . (A.27)
A.6.1.1.3 FEC Redundancy Bound on a Header Estimation based Protocol Stack
Under an ideal header estimation protocol stack, there are no erasures since all
packets are passed to the FEC decoder regardless of whether there are errors in the
headers or payload. That is, Pr 0 HdrEst 1 E HdrEst 0Y Y= = ⇒ = . Based on
previous derivations, the expected number of symbol errors is
E HdrEst D SCX n lp= . Thus the total expected amount of redundancy required by an
ideal header estimation scheme that passes all packets to the FEC decoder is
2hdrest SC D SCr n lp, ≥ . (A.28)
A.6.1.1.4 Comparison of the FEC Redundancy Bounds We now compare the minimum expected FEC redundancy of UDP and UDP-Lite
with header estimation. Let us first compare the minimum redundancies of UDP-Lite and
header estimation:
81
( ) ( ) ( ) ( )
( )( )( )min min 1 1 1 2 2
1 1 1 2
H
H
nudplite SC hdrest SC D SC SC SC
nD SC SC
r r n l p p p
n l p p, , − = − − − −
= − − − .
Clearly, ( ) ( )min min 0udplite SC hdrest SCr r, ,− > when 0 5SCp < . . Thus
( ) ( )min min when 0 5udplite SC hdrest SC SCr r p, ,> < . . (A.29)
The condition 0 5SCp < . is true for any realistic wireless channel, and therefore in all
practical wireless environments header estimation should always require lesser FEC
redundancy than UDP-Lite. In fact, on most wireless channels, 0 5SCp .= .
Now let us compare the minimum redundancy of header estimation with UDP:
( ) ( ) ( )( ) ( )( )
min min 1 1 2
2 1 1 1 .
H D
H D
n nudp SC hdrest SC D SC SC
n nD SC SC
r r n l p p
n l p p
+, ,+
− = − − − = − − + −
It can be easily shown that:
( ) ( ) ( )min min when 0 49 and 6udp SC hdrest SC SC H Dr r p n n, ,> < . + ≥ . (A.30)
In accordance with prior discussions, we know that the 0 49SCp < . condition is true for
any realistic wireless channel. Also, the size of a wireless packet (headers included),
H Dn n+ , is always greater than6 . For instance, in 802.11b networks, even without any
payload data, the total size of MAC, IP and UDP headers is 60 bytes.
82
Figure 24 plots the minimum expected FEC redundancies required by UDP, UDP-
Lite and header estimation for symbol error probabilities ranging between 0 and 0 1. . It
can be clearly seen that header estimation requires significantly lower redundancy than
both UDP and UDP-Lite. Note that the difference in redundancy increases with an
increase in the probability of symbol error. For 0 0013SCp = . , the percentage of
bandwidth used for redundancy is approximately0 25%. , 7 6%. and 47 9%. for header
estimation, UDP-Lite and UDP, respectively. The FEC redundancy difference becomes
much wider for 0 01SCp = . , where header estimation, UDP-Lite, and UDP respectively
use 2.02% , 47 04%. and 99 48%. of bandwidth in redundant symbol transmission. For
0 06SCp = . and higher, the gap between UDP-Lite and UDP narrows with each using
78.15% and 100% of bandwidth for redundancy, while header estimation requires
4 84%. redundancy - a remarkable goodput improvement of approximately 73% over
UDP-Lite and of approximately 95% over UDP.
0 0.02 0.04 0.06 0.08 0.10
10
20
30
40
50
60
70
80
90
100
probability of symbol error, pSC
redu
ndan
t FE
C s
ymbo
ls %
UDPUDP-LiteHeader Estimation
Figure 24. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header Estimation over an q -ary symmetric channel; 8m = , 256q = , 30L = ,
60Hn = , 452Dn = .
83
Thus, while header estimation always requires lesser FEC redundancy, the advantages
are dramatic for somewhat high error-rate channels, e.g., the 5.5 and 11 Mbps 802.11b
channels. Later in this chapter, we assert these theoretical findings using a practical
header estimator that is tested using actual wireless error traces. In the next section, we
derive similar bounds for the Gilbert channel.
A.6.1.2 Redundancy Bounds on the Gilbert Channel
Consider the one-hop symbol-level Gilbert wireless channel of Figure 1. The Gilbert
channel (GC) [81] has been used to model many wireless channels [9]- [11], [13]- [15],
[18]- [20], [26]. In this section, we compare minimum expected FEC redundancies of
UDP-Lite and UDP with header estimation over a GC.
A.6.1.2.1 Bound on a UDP based Protocol Stack Let udpGCε , denote the probability of observing a packet erasure on a UDP protocol
stack operating over a Gilbert channel (GC). Then udpGCε , is the probability of having
one or more symbol-errors in the received packet, and can be expressed as:
( ) ( )( )
( ) ( )( )
1
1
1
Pr corrupt pkt 1
1
1 1 ,1 1
H D H D
H D
H D
n n n nudpGC g b bggg gg
n ng gg
n nb b
pp p
p
ε π π
π
π π µ
+ + −,
+ −
+ −
= = − −
= −
= − − − −
(A.31)
where gπ and bπ respectively represent the steady-state probabilities of staying in the
good and bad states and µ is the Gilbert channel’s memory as defined in (A.4). Using the
derivations in Section A.6.1.1.1, we can express the average amount of redundancy
required by a FEC decoder operating on a UDP protocol stack as
84
udpGC D udpGCr n lε, ,≥ . (A.32)
A.6.1.2.2 Bound on a UDP-Lite based Protocol Stack A UDP-Lite based protocol stack drops all packets that have header errors. Thus the
probability of packet erasures, liteGCε , , of UDP-Lite over a GC is:
( ) ( ) ( )1 1Pr corrupt hdr 1 1H H Hn n nliteGC g b bg gg ggg ggp pp pε π π π− −
, = = − − = − . (A.33)
Using derivations of Section A.6.1.1.2, the expected number of UDP-Lite erasures is
( ) 1E UDPLite 1 HnD liteGC D gggY n l n l pε π − ,
= = − .
In addition to erasures, a UDP-Lite protocol stack will also have errors in the application
layer payload. The probability of a symbol error over the Gilbert channel is
Pr symbol err UDPLiteGC g gb b bb bp p pπ π π= = + = . (A.34)
Then the expected number of UDP-Lite symbol errors is
E UDPLite 1D liteGC GCX n l pε , = − ,
and the lower bound on the total expected redundancy required to recover the errors and
losses over a UDP-Lite protocol stack over a GC is
( )
2E UDPLite
2 1udpliteGC D liteGC
D liteGC liteGC b
r n l X
n l
εε ε π
, , , ,
≥ +≥ + − .
(A.35)
85
A.6.1.2.3 Bound on a Header Estimation based Protocol Stack Using the reasoning of Section A.6.1.1.3, the total (expected) amount of redundancy
required by an ideal header scheme over a GC is
2hdrest GC D br n lπ, ≥ . (A.36)
Comparison of the above bound with the bounds in (A.32) and (A.35) reveals that
minimum expected redundancy required by header estimation is independent of channel
memory. The redundancy is simply a function of the probability of error. Thus the
performance of header estimation will remain unchanged with changes in channel
memory. On the other hand, the redundancy required by UDP and UDP-Lite is high for
low-memory channels and the redundancy decreases with an increase in channel
memory.
A.6.1.2.4 Comparison of the FEC Redundancy Bounds First, let us compare minimum expected FEC redundancies of UDP-Lite and header
estimation:
( ) ( ) ( )( )
min min 2 1 21 2 0 for 0 5.
udpliteGc hdrest GC D liteGC D liteGC GC D GC
D liteGC GC GC
r r n l n l p n lpn l p p
ε εε
, , , ,
,
− = + − − = − > < .
That is,
( ) ( )min min when 0 5udpliteGC hdrest GC GCr r p, ,> < . . (A.37)
This condition is similar to the one derived for the q -ary symmetric channel, implying
that header estimation should perform better than UDP-Lite as long as the average
probability of error is less than 0 5. . For any reasonable Gilbert wireless channel, the
probability of symbol error should be considerably smaller than 0 5. .
86
We now compare minimum expected redundancies of UDP and header estimation
over a GC:
( ) ( )min min 2udpGC hdrest GC D udpGC D GCr r n l n lpε, , ,− = − ,
where udpGCε , and GCp are given in (A.31) and (A.34), respectively. Plugging in the
values of udpGCε , and GCp gives
( ) ( )( ) ( )
( ) ( )
( )
1
1
1
min min
1 2 2
1 2
1 2
1
H D H D
H D
H D
udpGC hdrest GCn n n n
D g gg b bg gg g gb b bbn n
D g bg gg
n nbgD bg gg
bg gb
D gg
r r
n l p p p p p
n l p p
pn l p pp p
n l
π π π π
µπ
µ
π π
, ,+ + −
+ −
+ −
− = − − − − = − + + − = − + + − +
= − ( ) 12 H Dn nggp + − + − .
Based on the above comparison, we obtain the following condition:
( ) ( )min min when 1udpGC hdrest GC gr r π, ,> → . (A.38)
The above inequality is generally true because on any practical wireless channels
H Dn n+ will always be greater than one symbol. Also, ggp the overall probability of
staying in the error-free state is generally very high. Thus FEC comparison of UDP
versus header estimation for the Gilbert channel essentially converges to the same
conclusion as the symmetric channel: Unless the channel has an unreasonably high error-
rate, header estimation will always utilize wireless bandwidth more efficiently than UDP.
87
Figure 25 shows the percentage of redundant symbols in each FEC block for UDP,
UDP-Lite and header estimation over a Gilbert Channel. The redundancy is plotted
against channel memory while fixing the probability of error. The leftmost points in
Figure 25 represent the memory-less case. It can be seen that header estimation always
requires lesser FEC redundancy to recover corrupted packets than UDP and UDP-Lite.
This difference in the amount of required redundancy gets more significant with an
increase in the probability of error. In general, due to the large number and bursts of good
symbols in a high memory channel, the amount of redundancy required by UDP and
UDP-Lite decreases with an increase in channel memory. In all cases, the redundancy
required by header estimation is extremely low and independent of the channel memory.
Thus, while the design of FEC schemes for UDP and UDP-Lite need to take channel
memory into account, an accurate header estimator can be deployed on a wireless
network without any knowledge of the underlying channel’s memory.
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
channel memory, µ
redu
ndan
t FE
C s
ymbo
ls %
UDPUDP-LiteHeader Estimation
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
channel memory, µ
redu
ndan
t FE
C s
ymbo
ls %
UDPUDP-LiteHeader Estimation
(a) 0 001GC bp π= = . (b) 0 01GC bp π= = . Figure 25. Minimum expected FEC redundancies of UDP, UDP-Lite and Ideal Header
Estimation over a Gilbert channel; 8m = , 30L = , 60Hn = , 452Dn = .
88
A.6.1.3 Discussion
At this point, we have theoretically verified that a protocol employing header
estimation should require lesser FEC redundancy at a wireless receiver than UDP and
UDP-Lite. This naturally brings us to the practical question of how to realize an accurate
header estimation technique for wireless environments. The following section addresses
this question by designing a header estimation scheme which utilizes the channel models
proposed in preceding chapters.
A.6.2 Maximum-Likelihood Header Estimation Framework
The maximum-likelihood estimation scheme proposed in this section only estimates
the critical header fields (CHF) that can uniquely identify a UDP multimedia session at a
receiver and are not liable to change during the course of the multimedia session. In our
experiments, we treat the following as CHF: (i) destination MAC address, (ii) source IP
address, (iii) destination IP address, (iv) source port, and (v) destination port.
Nevertheless, all mathematical treatment is provided for a general case of N critical
fields.
Under the proposed methodology, a list of active CHF (i.e., CHF of sessions that are
currently being received) is provided to a header estimation module by the multimedia
application(s). On receiving the first error-free packet of a new session, the multimedia
application adds the new session’s CHF information to the list of active multimedia
sessions. Whenever a corrupted packet is received, a likelihood score of its critical fields
is computed with respect to each entry of the CHF list. The CHF rendering the highest
likelihood are chosen as the estimated CHF of the received (corrupted) packet.
89
The main objective of header estimation is to pass maximum number of (error-free
and corrupted) packets to the application layer using only parameters of a MAC layer bit-
error channel model. We defer discussion on how an application can make use of the
corrupted packets to subsequent sections.
A.6.2.1 Functionality at and below a Receiver’s MAC layer
Figure 26 outlines the interactions between the proposed header estimation module
and different layers of a wireless receiver’s protocol stack. The packets after wireless
physical layer processing are passed to the MAC layer which verifies the packet’s
checksum to determine if the received packet has errors. Instead of dropping a corrupted
packet, the packet and its checksum information (i.e., packet passed/failed the checksum)
pkt Wireless Channel
Pkt after network and transport layer processing
Updated channel model parameters
Corrupt UDP pkt with estimated CHF
Corrupt UDP pkt which has either dst MAC or dst IP address of local receiver
Error-free pkt
Received pkt after physical layer
processing
Physical Layer
MAC Layer without UDP Pkt Drops
Header Estimation Module
Network and Transport Layers
Network and Transport Layers with Disabled
Checksums
Application Layer L
ist of active CH
F
Figure 26. Interactions between the UDP-based header estimation module and
different layers of a wireless receiver’s protocol stack; modified protocol stack layers are shown in different colors and dotted lines represent communications that are not
related to packet reception.
90
is passed to a module that checks the transport type, the destination MAC address, and
the destination IP address of the received packet. Header estimation is invoked only for
UDP packets, while TCP and network layer traffic are handled by the conventional
protocol stack. Furthermore, the MAC layer does not attempt retransmission-based
recovery of corrupt UDP packets, i.e., ACKs are sent even for corrupt UDP packets.
Instead of MAC retransmissions, header estimation with application layer FEC is used to
recover from errors in the packet. Such retransmission-less recovery is well-suited for
delay-sensitive real-time communications.
Header estimation is invoked when all of the following conditions are satisfied: (i) a
corrupt UDP packet is received, (ii) either the destination MAC or the destination IP
address matches the local receiver’s addresses, and (iii) there are one or more active
multimedia sessions on the receiver. Three scenarios exist when a packet is received:
(i) Packet is error-free: No need to perform header estimation.
(ii) Packet is corrupt and the packet is intended for the local receiver: Header estimation
is invoked and an ACK is sent to the last hop network entity to avoid MAC layer
retransmissions.
(iii) Packet is corrupt and the packet is not intended for the local receiver: This case
represents a false alarm when, due to channel errors, either destination MAC or
destination IP of a packet not intended for the local receiver gets mapped to the MAC or
IP address of a receiver. Due to the receiver-based nature of the present scheme, false
alarms cannot be detected at a receiver’s MAC layer. Thus header estimation is invoked
even for false alarm packets, and a MAC layer ACK is sent to the last hop network
entity.
91
A.6.2.2 The Header Estimation Module
The header estimation module employs a likelihood function to find the most likely
transmitted CHF given: (i) the received CHF, (ii) a list of active CHF, and (iii)
parameters of the MAC layer error channel model. The list of active CHF is provided by
the receiver’s application layer as shown in Figure 26. The transmitted/active CHF that
renders the maximum value of the likelihood function is chosen as the estimated CHF.
The corrupt packet and the estimated CHF are passed to higher layers. In essence, the
present header estimation problem is the estimation-theoretic problem of maximum-
likelihood (ML) estimation of known parameters in noise [79].
A.6.2.3 Processing at a Receiver’s Network, Transport and Application Layers
The corrupted packets along with the estimated CHF are passed by the header
estimation module to the receiver’s network layer. The network layer performs its regular
operation with two modifications: (a) instead of the (possibly corrupted) IP addresses in
the network layer header, the estimated IP addresses are treated as the true IP addresses;
(b) network layer checksum on IP headers is disabled. At the UDP layer, source and
destination ports are taken from the estimated CHF and the corrupted packets are passed
to the (estimated) multimedia application.
A.6.3 Likelihood Functions for Header Estimation In this section, we derive header estimation likelihood functions for two previously
proposed classes of MAC layer channel models, namely the full-state Markov (FSM)
model and the multifractal wavelet model (MWM). Let 1 2, , ,i i i iNx x xΛ = … denote
92
an ordered set of N critical header fields for an arbitrary multimedia sessioni . As
mentioned before, in this chapter we have 5N = , where 1ix , 2ix , 3ix , 4ix and 5ix
correspond to the destination MAC, source IP, destination IP, source port, and destination
port of multimedia session i , respectively. A receiver receives 1M ≥ simultaneous
multimedia streams. Let 1 2, , , MΩ = Λ Λ Λ… denote an unordered set of CHF each
corresponding to a currently active multimedia sessions on a given receiver. Note that
each 1 2, , ,i i i iNx x xΛ = ∈ Ω… is in turn a set of critical fields corresponding to a
given session, where the first subscript of x is the session index and the second subscript
is the CHF index. Let °rΛ denote the set of CHF of a received packet, i.e.,
± ± ² ² 1 2, , ,r r r rNx x xΛ = … is a possibly corrupted version of an iΛ ∈ Ω . Let ¶rΛ denote
the estimated CHF.
Let Χ represent a stochastic MAC layer channel model characterizing the bit-error
channel over which a receiver is receiving it packets. Then, for a critical header field ijx
(i.e., critical field j for a multimedia session i ), our objective is to derive the likelihood
function ± Pr ,rj ijx x Χ in terms of the parameters of Χ . In other words, given
parameters of a channel model Χ , we want to find the likelihood that a transmitted
critical header field ijx (after possible channel corruptions) was received as ±rjx . We
assume that the likelihood functions of all CHF are independent. Thus ± Pr ,rj ijx x Χ ’s
for each critical field can be ascertained independently and then the overall likelihood
considering all critical fields is:
93
± ± 1
Pr , Pr ,N
r i rj ijj
x x=
Λ Λ Χ = Χ∏ , (A.39)
where 1 i M≤ ≤ is the session index and j is the CHF index. Once ± Pr ,r iΛ Λ Χ has
been computed for all 1 i M≤ ≤ , the CHF estimate ¶rΛ is simply the iΛ that renders
the maximum ± Pr ,r iΛ Λ Χ .
The challenge of this ML-based header estimation lies in the derivation of a
likelihood function ± Pr ,rj ijx x Χ of a critical field, given parameters of a wireless
channel model. In the following sections, we derive likelihood functions for FSM and
MWM channel models.
A.6.3.1 Header Estimation Likelihood Function for FSM Chains
In this section, we derive the CHF likelihood function for a k -th order FSM chain
nΧ , where n is the bit time index. For clarity, in this chapter we deviate slightly from
the previously used FSM chain notation and the transition probability between FSM
states i and j are represented as Pr i j→ . We focus on one arbitrary critical field
ij ix ∈ Λ by fixing the CHF index j . Henceforth ix and ±rx respectively represent the
critical field j from iΛ and the received critical field j . Let us define a new variable:
±i r iz x x= ⊕ , (A.40)
where ⊕ represents a binary exclusive-OR operation. iz comprises bit locations that are
different between ±rx and ix . Assuming that the different bits are in fact the errors
94
introduced by FSM channel, ± Pr | ,r i nx x Χ is likelihood of observing error pattern iz
on the channel.
Recall from previous discussions [see Figure 15] that an FSM chain in state iv can
only transit to two FSM states, 2 0iv + or 2 1iv + ; all FSM states are mod2k . Thus,
when the bit added to 2 iv is [ ]1iz k + , we get the [ ] Pr 2 1i i iv v z k→ + + . From state
[ ]2 1i iv z k+ + , the process will transit to
[ ]( ) [ ] [ ] [ ]2 2 1 2 4 2 1 2i i i i i iv z k z k v z k z k+ + + + = + + + + . Using similar logic, the
process will next transit to
[ ] [ ]( ) [ ] [ ] [ ] [ ]2 4 2 1 2 3 8 4 1 2 2 3 .i i i i i i i iv z k z k z k v z k z k z k+ + + + + + = + + + + + +
A recursive relationship in the transition probabilities can be identified at this point.
Generalizing the recursive relationship yields the header estimation likelihood function
for a k -th order FSM chain as follows:
± [ ]( ) [ ]
[ ]
21 101
2 1 10
Pr | , FSM Pr 2 1 mod2
2 2 1 mod2
Pr ,
2 2 1 mod2
ikr i n v i i i
aa a b ki ibW k
a aa a b ki ib
x x v v z k
v z k b
v z k b
π
−− − −=− −
= − − −=
Χ = = → + + + + + ↓ + + +
∑∏
∑
(A.41)
where n is the bit time index, iv is the FSM state represented by the first k bits of iz ,
W represents the number of bits in the critical field, xπ represents the steady-state
probability of being in FSM state x , Pr x y→ is the transition probability of going
95
from FSM state x to state y , and [ ]iz x represents the value of iz at the x -th bit
location.
The FSM likelihood function answers the following question: What is the probability
that channel errors have changed ix to ±rx ? Since ±i r iz x x= ⊕ denotes the bit pattern
that would be observed if the channel changed ix to ±rx , we have to find the probability
that the channel nΧ produced the bit-error pattern iz . Clearly, the FSM channel’s initial
state must be iv because iv denotes the FSM state represented by the first memory-
window of iz , leading to the ivπ term. This initial state must be followed by a unique
sequence of state transitions that result in the bit-error pattern iz . To quantify the
probability that an FSM channel will follow this “unique state sequence”, recall that in
one transition the FSM process can only transit to two possible states. Also, due to the
Markov property, the probability of transiting to one of the two possible states is only
dependent on the present state. The final likelihood score of ix is hence characterized by
a multiplication of the transition probabilities of this unique state sequence, as
represented by the multiplicative Pr x y→ terms in the likelihood function.
A.6.3.2 Header Estimation Likelihood Function of MWM
Recall from Section A.3.8 that the multifractal wavelet model (MWM) uses
expectation-maximization to model two random variables: (i) the scaling coefficient at
the coarsest scale 0 0,j kU , where 0j and 0k represent the coarsest scale and time,
respectively; (ii) ,j kA random variables defined over a [ ]1,1− interval, j and k
representing the scale and time, respectively. In previous chapters, we showed that the 11
96
Mbps bit-errors have long-range dependence which can be captured using the MWM.
Therefore, in this section we derive the likelihood function for an MWM. Previously we
used the bit-error sequences of zeros and ones to train the MWM. Derivation of a
likelihood function for an MWM trained using such a strategy is somewhat difficult.
Consequently, in this chapter we train an MWM using the number of bit-errors in a
packet as the training sequence.
Let nΧ denote the MWM process, where n represents discrete packet time
instances. It was shown in [61] that due to the use of the Haar wavelet transform, the
MWM-predicted number of errors [ ]e n in packet n can be expressed as
[ ] 2 ,2m
m ne n U−= for 10,1, ,2mn −= … . If the packets have a fixed size C , then the
probability of bit-errors in the packet received at packet time index n is
[ ] [ ] 2 ,2 mm np n e n C U C−= = . Now note that each received bit is basically a value
taken from a binary time series of length l , i.e., [ ] 0lix i = , [ ] 0,1x i ∈ , and i
represents the discrete bit time index. Based on equation (A.40), [ ]1W
im z m=∑ yields the
total number of bits that are different between ±rx and ix , i.e., the hamming distance
between ±rx and ix . If the bits of iz are in fact the errors introduced by a MWM wireless
channel then given a probability of having [ ]1W
im z m=∑ errors is [ ]( ) [ ]1W im z mp n =∑ ,
and the probability of having [ ]1W
imW z m=− ∑ correct bits is
[ ]( ) [ ]11 W imW z mp n =−∑− . Likelihood of the bit pattern iz is then a multiplication of the
above events. Thus the MWM likelihood function is as follows:
97
± ( ) [ ] ( ) [ ]112 2, ,Pr | , MWM 2 1 2 ,
WWii
mmW z mz mm m
r i n m n m nx x U C U C ==−− − ∑∑ Χ = = −
(A.42)
where 10,1, ,2mn −= … is the packet time index, m is the number of scales used to
train the MWM, C is the number of bits in a packet, W is the number of bits of in the
critical field, and iz is given in (A.40), and ,m nU is the scaling coefficient at scale m
and time n .
Similar to (A.41), the MWM likelihood function renders the probability that the bit-
error pattern ±i r iz x x= ⊕ is observed on an MWM channel. Since the probability of bit-
error in packet n is given by 2 ,2 mm nU C−
, the probability of observing [ ]1W
im z m=∑
bit-errors in packet n is [ ]( ) [ ] ( ) [ ]11
2 ,2WW ii mm
z mmz m m np n U C ==− ∑∑ = . Treating error-
free and corrupted bits as the two outputs of a Bernoulli random variable yields the
MWM likelihood expression.
Once ± Pr | ,r i nx x Χ ’s for all currently active sessions, 1 i M≤ ≤ , are computed
using the FSM or MWM likelihood functions, the session i that renders the maximum
± Pr ,r iΛ Λ Χ is chosen as the estimated CHF, ¶rΛ . We also introduce a provision that a
packet is dropped if the maximum likelihood is less than 0.25 because in such a case the
estimation confidence is very low.
98
A.6.3.3 Extending the FSM Likelihood Function to the CCM
The complexity of an MWM to generate a length l sequence is linear. However, the
complexity of FSM chains grows exponentially with respect to memory-length. Due to
their exponential complexity, FSM chains are unreasonably complex to be employed in
the header estimation framework. Therefore, in this section we extend the FSM
likelihood function to the CCM so that the approximating CCM model can be used for
header estimation instead of the FSM model.
Let xS denote the aggregate CCM state that contains FSM statex . Since the CCM
aggregates FSM states, using (A.41) the likelihood function for the CCM can be rewritten
as:
± [ ]
[ ] [ ]2 11 1 10 0
2 1
1
2 2 1 2 2 12
Pr | , CCM Pr
Pr ,
v i i ii
a aa a b a a bi i i ib b
r i n S v v z k
W m
v z k b v z k ba
x x S S
S S
π
− −− − − − −= =
+ +
− −
+ + + + + +=
Χ = = → → ∑ ∑
∏
where the subscripts of all aggregate states xS are modulo 2m and all other parameters
are defined in Section A.6.3.1. The low-complexity of the CCM clearly makes it a natural
alternative to FSM chains in the present header estimation methodology. In all
subsequent performance evaluations of the header estimation methodology, we use
CCMs instead of FSM chains and show that the likelihood function rendered by the CCM
is highly accurate.
99
A.6.4 Performance Evaluation of the Header Estimation Framework
A.6.4.1 Experimental Setup
We use the wireless traces described in Section A.4.1 to simulate the wireless
channel. For video evaluations, we report throughput, FEC and PSNR results for five
multimedia receivers. Each receiver receives multiple video streams with a maximum of
five video streams. At each physical layer data rate, we repeat video experiments using
three distinct wireless trace-sets that were collected at different times of day. Video
experiments for each trace-set are repeated 25 times starting at different randomly
selected locations inside the error traces. Thus the throughput and FEC results for 2, 5.5
and 11 Mbps are each averaged over 3 5 5 25 1,875× × × = received video streams. Due
to the high complexity of video decoding, for each trace-set the PSNR results are
reported for one (randomly selected) video experiment, that is, PSNR results for 2, 5.5
and 11 Mbps are each averaged over 3 5 5 75× × = received video streams.
For each packet transmission, a 512 byte packet (452 bytes of video payload and 60
bytes of headers) was corrupted using the bit-error traces. The models used for likelihood
computation on all receivers were trained using error traces which were not used in the
video experiments. In accordance with the results of Sections A.4.3.1 and A.4.3.2, FSM
chains of order-9 and order-10 were employed for the 5.5 and 2 Mbps bit-error processes,
and an MWM trained using the number of bit-errors in a packet was employed for the 11
Mbps process. Each FSM chain was folded to a 5-state CCM.
Video sequences were compressed using the H.264 video coding standard [83], [85].
The sequences had a QCIF frame size and were encoded at a frame-rate of 30 fps. The
100
streams were encoded at different source coding bitrates ranging from 100 kbps to 1
Mbps. A slice mode with fixed number of 452 bytes per slice was used for encoding [83].
Intra frame period was set to 12, i.e., each group of pictures (GOP) had 12 frames.
Varying numbers of video streams were assigned to the wireless receivers. Transmission
of packets from each stream was simulated in a round robin fashion according to source
bitrates. In order to achieve successful video decoding, in the simulations we introduced a
provision that the first frame of the video sequence (i.e., the very first I-frame of the first
GOP) is always received correctly.
A.6.4.2 Throughput Performance
The term throughput here refers to the ratio of the total number of packets relayed to
the receiver’s application to the total number of packets sent by the sender’s application
layer. That is, throughput comprises of both error-free and corrupted packets. The
percentage packet drop rate is ( )1 throughput 100− × . Figure 27 outlines the packet
drops incurred by UDP Normal, UDP-Lite and UDP with header estimation at 2, 5.5, and
11 Mbps. The results are averaged over all receivers and multimedia streams and hence
the packet drops are referred to as average packet drops. The leftmost points in Figure 27
(a), (b), and (c) depict the simplest case of each receiver is receiving only one multimedia
stream. The number of video streams per receiver is then incremented. More than one
multimedia per receiver is an important scenario for video conferencing applications.
101
A.6.4.3 Comparison of Packet Drops
It can be clearly seen in Figure 27 (a), (b) and (c) that header estimation always incurs
lesser packet drops than normal UDP and UDP-Lite. The header estimation packet drops
include: (i) packets that were dropped because both the destination IP and the destination
MAC address were corrupted, and (ii) packets whose critical fields were incorrectly
estimated (resulting in false alarms). At 2 Mbps, header estimation packet drops are
approximately 0.2% , as opposed to approximately 0.4% and 1% in case of UDP-Lite
and normal UDP. Since the 2 Mbps channel has receivers with very low packet error
1 2 3 4 50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
video streams per receiver
aver
age
pack
et d
rops
%UDP NormalUDP LiteUDP Hdr Est
1 2 3 4 50.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
video streams per receiver
aver
age
pack
et d
rops
%
UDP NormalUDP LiteUDP Hdr Est
(a) 2 Mbps (b) 5.5 Mbps
1 2 3 4 50
5
10
15
20
25
30
video streams per receiver
aver
age
pack
et d
rops
%
UDP NormalUDP LiteUDP Hdr Est
(c) 11 Mbps
Figure 27. Average packet drops for UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates and for varying number of video streams per receiver; each point is averaged over ( )3 # of video streams 5 25× × × received video streams.
102
rates, the margin of improvement is small. At 5.5 Mbps, UDP with header estimation
provides approximately 4% and 2% throughput improvements over normal UDP and
UDP-Lite. Due to the very high data rate at 11 Mbps, the header estimation packet drops
increase to about 3% , but this packet drop rate is still substantially lower than that of
normal UDP ( 15%≈ ) and UDP-Lite ( 30%≈ ).
A.6.4.4 False Alarm Rate
A false alarm is a packet that is not intended for a multimedia session, but is relayed
to that session. There are three sources of false alarms: (i) due to channel errors, either
destination MAC or destination IP address of a packet (not intended for the local
receiver) gets mapped to the MAC or IP address of the receiver; (ii) a corrupted packet is
inaccurately estimated; (iii) a corrupt non-multimedia UDP packet is received when one
or more multimedia sessions are active.
For the five streams per receiver case, cumulative false alarm rates are 0.07% ,
0.52% , and 1.3% at 2, 5.5 and 11 Mbps. While these false alarms are quite low, they
must be detected because they can desynchronize the video and/or FEC decoders. To
detect false alarms, we protected the 2 byte H.264 slice sequence numbers (in the RTP
header, with one slice per packet) with 4 bytes of redundancy to ensure that these
sequence numbers can always be recovered at the receiver. A receiver dropped all
packets whose slice numbers were much larger or smaller than the next/expected slice
number. For applications which do not have a slice/packet sequence number, a small
incremental packet sequence number with parity bytes can be easily inserted into each
packet by the sender’s application layer. This sequence number based scheme also
provides erasure locations (i.e., dropped packets) to the FEC decoder.
103
A.6.4.5 FEC Performance
We now evaluate the amount of FEC redundancy required by the application to
recover from errors and packet drops in the multimedia content. Since the corrupted
packets contain many error-free bytes, this error-free data should facilitate application
layer FEC decoding. As mentioned earlier, for an MDS FEC code if a codeword has 2t
number of redundant symbols then a maximum of t transmission errors in that block can
be corrected [80]. For the same amount of redundancy, 2t erasures can be recovered. In
the UDP-Lite and UDP with header estimation scenarios, for an FEC codeword with 1e
erasures (i.e., packet drops) and 2e errors, if 1 2e t≤ then the FEC decoding algorithm
can recover the 1e erasures. After erasure decoding, 2e errors can be corrected if
( )2 12 2e t e ≤ − .
We simulate MDS forward error correction for all three (UDP Normal, UDP-Lite,
UDP with header estimation) protocol variants. A codeword length of 30N = bytes is
used for all experiments. Each codeword is composed of one byte from a different packet,
where each packet consists of 452 bytes of data payload. Thus each packet contributes to
452 separate RS codewords, and each codeword spans over 30 packets. The FEC
construction is shown pictorially in Figure 28. For all protocol stack variants, we treat
packet drops as erasures in the received codewords. Note in Figure 28 that a packet drop
results in an erasure in 452 codewords.
104
Since normal UDP does not have corrupted packets, all parity bytes are used for
erasure decoding. Unlike normal UDP, FEC codewords for UDP with header estimation
have errors due to corrupted packets and erasures due to incorrect estimations and/or
false alarms. Similarly, FEC codewords for UDP-Lite have errors due to corrupted
packets and erasures due to packets with corrupted headers. For performance evaluation,
we define a simple measure called decodable probability:
( ) ( )= decodable codewords received codewords transmitteddp ,
where a codeword with 1e erasures and 2e errors is decodable only if 1 22 2t e e≥ + .
Clearly, 0 1dp≤ ≤ and 1dp = implies that all received codewords were successfully
decoded.
pkt hdr
pkt payload=452 bytes
1
2
3
30
RS codew
ord 1
RS codew
ord 452
RS codew
ord 2 R
S codeword 3
RS codew
ord 451
pkt num A pkt drop will introduce an
erasure in all the 452 RS codewords
Figure 28. Codeword construction for video FEC simulations.
105
Figure 29 outlines the decodable probability as a function of the number of message
bytes in an RS codeword for the five streams per receiver experiment. At each data rate,
the results are averaged over all the experiments. From Figure 29 (a), it is clear that at 2
Mbps normal UDP and UDP-Lite require 6 bytes per RS codeword for almost 100%
recovery; that is, approximately 20% bandwidth is wasted in redundancy. UDP with
header estimation achieves almost error-free recovery even if two redundant bytes are
sent per 28 message bytes - approximately 7% bandwidth is used for redundant
symbols. From Figure 29 (b), it can be observed that, due to the increased error-rate at 5.5
Mbps, the performance gap between UDP with header estimation and the other protocols
16 18 20 22 24 26 280.98
0.982
0.984
0.986
0.988
0.99
0.992
0.994
0.996
0.998
1.0
message bytes per block
aver
age
deco
dabl
e pr
obab
ility
UDP NormalUDP LiteUDP Hdr Est
16 18 20 22 24 26 28
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
message bytes per block
aver
age
deco
dabl
e pr
obab
ility
UDP NormalUDP LiteUDP Hdr Est
(a) 2 Mbps (b) 5.5 Mbps
16 18 20 22 24 26 280.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
message bytes per block
aver
age
deco
dabl
e pr
obab
ility
UDP NormalUDP LiteUDP Hdr Est
(c) 11 Mbps
Figure 29. Average FEC redundancy required by UDP Normal, UDP-Lite and UDP with Header Estimation at different data rates of an 802.11b LAN; each point is averaged over
3 5 5 25 1875× × × = received video streams.
106
widens. Normal UDP and UDP-Lite waste approximately 33% bandwidth on FEC
redundancy to achieve almost 100% recovery. UDP with header estimation achieves
almost 100% recovery by wasting merely 20% bandwidth on FEC redundancy. Figure
29 (c) shows that at 11 Mbps the improvements provided by UDP with header estimation
are quite significant; UDP with header estimation requires approximately 27%
redundancy for almost 100% recovery, while both normal UDP and UDP-Lite require
53% redundancy. Thus header estimation salvages the high error rate 11 Mbps channel.
A.6.4.6 Video Performance
In this section, we present results for the 5 streams per receiver experiment, with a
fixed rate FEC having two redundant bytes per RS codeword of 30 bytes. The average
GOP-by-GOP peak signal-to-noise ratio (PSNR) plots at different data rates are given in
Figure 30. All PSNR results are averaged over 75 received video streams. Since we
allow the very first video (I) frame of the first GOP to be received without any errors and
losses, PSNR of the first GOP is not plotted. The dotted line in Figure 30 represents
PSNR of error-free video, which provides a performance upper bound for the protocols
under consideration. PSNR of UDP with header estimation is the closest to the PSNR of
the error-free video at all data rates. At 2 and 5.5 Mbps, respective average PSNRs of
normal UDP and UDP-Lite are approximately 10 dB and 25 dB lower than the PSNR of
UDP with header estimation. However, at 11 Mbps the PSNR of UDP with header
estimation is approximately 25 dB higher than the PSNRs of normal UDP and UDP-Lite,
both of which render equally and extremely low PSNRs at 11 Mbps.
107
A.6.5 Discussion In this chapter, we developed an effective header estimation framework for wireless
multimedia applications. The proposed framework used the channel models proposed in
preceding chapters to provide significant improvements in wireless bandwidth utilization.
In the following chapter, we show another use of the proposed channel models by
quantifying the simulation and analysis inaccuracies that are incurred if channel memory
is ignored.
5 10 15 2015
20
25
30
35
40
45
aver
age
PS
NR
GOP
Error-freeUDP NormalUDP LiteUDP Hdr Est
5 10 15 20
15
20
25
30
35
40
45
aver
age
PS
NR
GOP
Error-freeUDP NormalUDP LiteUDP Hdr Est
(a) 2 Mbps (b) 5.5 Mbps
5 10 15 205
10
15
20
25
30
35
40
45
aver
age
PS
NR
GOP
Error-freeUDP NormalUDP LiteUDP Hdr Est
(c) 11 Mbps
Figure 30. Average PSNR of video sequences for UDP Normal, UDP-Lite and UDP with Header Estimation using a 30 byte RS codeword with 2 parity bytes; each graph is
averaged over 3 5 5 75× × = received video streams.
108
CHAPTER A. 7 IMPACTS OF IGNORING CHANNEL MEMORY ON ANALYSIS AND SIMULATION OF WIRELESS SYSTEMS
Results of the preceding chapters have established that the MAC layer wireless bit-
error channels have memory. We have also showed that accurate and low-complexity
models can be developed to capture the underlying channel’s memory. The burstiness
and the consequent memory of wireless channels are well-accepted concepts in the
wireless research community. However, much of the contemporary research continues to
use memory-less binary-symmetric and 1st order Gilbert channels for bit-level theoretical
analysis and experimental evaluation of wireless protocols and applications [86]- [100].
The impacts of these simplistic bit-error channel models on the design and evaluation of
wireless systems are largely unexplored.
In this chapter, we quantify the impact of bit-level Markovian channel memory on the
performance of two commonly-used and very meaningful wireless performance metrics:
the expected goodput of an unreliable protocol and the expected number of per-packet
retransmissions for a reliable wireless protocol operating on a single-hop wireless
network. Due to the analytical intractability of the multifractal wavelet model, we focus
solely on the Markov-based channel models considered in this thesis. We derive the two
protocol performance metrics in terms of the parameters of four channel models of
varying memory-lengths, namely a memory-less binary-symmetric channel (BSC) model,
a two-state Gilbert channel (GC) model [81], an order-10 (1024 state) full-state Markov
chain, and an order-20 constant-complexity model (CCM). These models are trained
109
using actual 802.11b MAC layer bit-error traces and subsequently the trained models are
used to estimate the goodput and retransmissions.
We show that extremely misleading estimates of goodput and retransmissions are
obtained when using a BSC or a GC. In particular, for the retransmission metric the
results obtained under the memory-less assumption can be orders of magnitude more
pessimistic than what is observed on the actual channel. On the other hand, the estimates
provided by channel models with high-order memory (i.e., 1024 state FSM and constant-
complexity models) are highly accurate.
A.7.1 Goodput of an Unreliable Protocol In this section, we quantify the goodput of an abstract unreliable protocol - such as
the UDP protocol [64] - operating over wireless links. Here goodput refers to the ratio
between the number of received error-free packets and the total number of transmitted
packets. We compare how accurately the following bit-error wireless channel models
estimate the goodput of a wireless channel: (i) a memory-less binary-symmetric channel
(BSC) model, (ii) a 2-state Gilbert channel (GC) model [81], (iii) a full-state Markov
(FSM) channel model, and (iv) a constant-complexity channel model (CCM). We first
analytically derive packet goodput in terms of the channel models’ parameters. We train
these models using actual traces and then estimate the traces’ goodputs using the trained
models. If a model accurately characterizes the bit-error channel then it should provide a
goodput estimate that is very close to the trace-based goodput.
110
A.7.1.1 Goodput of a Wireless Channel
Since contemporary wireless stacks perform a checksum on each packet to detect and
drop corrupted packets, the present abstract protocol drops all packets with one or more
bit-errors. To cater for end-to-end sessions with multiple hops that include a wired
(Internet) segment followed by a wireless access segment, we assume that only the last
transmission hop is a wireless link. We assume an uncogested path between the sender
and the receiver. Also, the wireless hop employs a CSMA/CA mechanism to resolve
channel contentions, and therefore the number of collisions is negligible. These
assumptions ensure that all packet drops are due to channel noise and interference; i.e.,
for simplicity of analysis, we ignore packet drops due to congestion or collisions.
Since we define goodput as the ratio between the number of received error-free
packets and the total number of transmitted packets, goodput is simply the probability γ
of receiving an error-free packet on the wireless channel. Goodput is constrained by
0 1γ≤ ≤ , where 0γ = represents the limiting case when all the received packets have
errors and are therefore dropped, and 1γ = represents the limiting case when all the
received packets are error-free.
We first derive expressions of goodput estimates γ$ in terms of the parameters of the
trained channel models. Second, we compute the actual goodput γ of the bit-error traces
used in this study. Then for each wireless trace, we train all four channel models
considered in this chapter. Finally, the actual and estimated goodputs ( γ and γ$ s) are
compared.
111
A.7.1.2 Goodput of a Binary-Symmetric Channel Model
A binary symmetric channel (BSC) is a special case of the q -ary symmetric channel
mentioned in the last chapter. Specifically, a BSC is stateless channel that corrupts every
transmitted bit with a probability ε . Consequently, goodput or the probability of
receiving an error-free packet of length L over a BSC is simply given by:
( )Pr error-free pkt BSC 1 LBSCγ ε= = −$ . (A.43)
Given training bit-error data, the parameter ε is computed by taking the ratio between the
number of bad bits and the total number of bits in the training data.
A.7.1.3 Goodput of a Gilbert Channel Model
The Gilbert channel (GC) [81] is a 1st order Markov chain with a good and a bad
state. In the present bit-error modeling context, the two Gilbert states jointly capture a
process with a memory-length of one bit. The probability of the next (good or bad) bit is
dependent on the whether the last received bit was good or bad. Transitions to the good
state result in error-free bits, while transitions to the bad state yield corrupted bits. Due to
the present notation, we represent the good and bad states as state 0 and state 1,
respectively. The GC is completely characterized using two parameters, 0,0p and 1,1p ,
where 0 represents the error-free state and 1 represents the error state. Although both
BSC and GC are special cases of FSM chains, we treat them separately because of their
widespread use in wireless studies [86]- [100].
As shown in the last chapter, goodput or the probability of receiving an error-free
packet of length L over a GC is given by:
112
( ) ( ) ( )( )
1 10 0,0 1 1,0 0,0 0,0 0 0,0 1 1,0
10 0,0
Pr error-free pkt GC
.
GCL L L
Lp p p p p p
p
γ
π π π π
π
− −
−
= = + = +
=
$
(A.44)
The above expression shows that the probability of getting a good packet over a Gilbert
channel model is simply the probability of starting in the error-free state and then staying
in that state for the length of the packet.
A.7.1.4 Goodput of a Full-state Markov Channel Model
The probability of receiving an error-free packet of L bits on a k -th order FSM
channel model is dependent on the present state of the model. If the last received bit was
error-free then the least-significant bit in the memory-window will be zero, implying that
the FSM chain is in an even state. On the other hand, if the last received bit was corrupted
then the FSM chain would be in an odd state.
Let us first focus on the scenario of currently being in an even state and then
receiving L consecutive good bits. Throughout this chapter, we invoke a realistic
assumption that L k> , where k is the memory-length of the process. Let FSM state 2i ,
10 2 1ki −≤ ≤ − , be the current even state of the FSM channel model. Since all FSM
states have the mod2k operation, unless otherwise stated, we drop the mod2k operation
throughout the following text. Recall from Observation 1 in Section A.5.4 that every
FSM state i can transit to only two other states. Thus the current state 2i can transit to
either state ( )2 2i or state ( )2 2 1i + . Since we are only concerned with bursts of error-
free bits, the probability of getting an error-free bit starting in state 2i is ( )2 ,2 2i ip . Now
113
for the length of the memory-window, the next 1k − transitions will be between even
states giving the following states sequence:
( ) ( ) ( ) ( )( )0 1 12 2 2 2 2 2 2 2 2 mod2 0k ki i i i i−= → = → =L .
Thus after these 1k − transitions the process will be in FSM state 0 . From that state, to
get the remaining error-free bits, the next ( )1L k− − transitions will be from state 0 to
state 0 . To generalize the above discussion in terms of FSM chain parameters, the
probability of getting a burst of L good bits starting in FSM state 2i is given by
( ) ( ) ( ) ( )1
2 12 0,02 2 ,2 2
0j j
k L ki i ij
p pπ +− − −
=∏ . This probability has to be summed over all
possible even FSM states yielding ( ) ( ) ( ) ( )1
12 1 2 1
2 0,02 2 ,2 200
kj j
k L ki i iji
p pπ−
+− − − −
==∑ ∏ .
An expression for the probability of getting an error-free packet starting in an odd
FSM state can be derived similarly. Adding these expressions gives the goodput of an
FSM channel model as follows:
( ) ( ) ( ) ( )( ) ( ) ( ) ( )
( ) ( )( ) ( ) ( ) ( )
11 1
11 1
2 1 2 21 12 0,0 2 1 0,02 2 , 2 2 2 1 2 , 2 1 2
0 002 1 2 21
0,0 2 2 12 2 , 2 2 2 1 2 , 2 1 20 00
Pr error-free pkt FSMk
j j j j
kj j j j
FSMk kL k L k
i ii i i ij ji
k kL ki ii i i ij ji
p p p p
p p p
γ
π π
π π
−+ +
−+ +
− − −− − − −+ + += ==
− − −− −+ + += ==
= = +
= +
∑ ∏ ∏
∑ ∏ ∏
$
.
(A.45)
The above expression gives the overall probability of getting L consecutive error-free
bits by summing over all possible state paths starting in an even or an odd FSM state.
114
A.7.1.5 Goodput of a Constant-Complexity Channel Model
The constant-complexity model (CCM) aggregates states of the FSM chain as shown
in Figure 19. Recall that the CCM aggregates states of an FSM chain of arbitrary order to
a five state model. Specifically, FSM states 0 , 1 and 12k− are kept in three isolated
states of the CCM. The remaining even FSM states are aggregated into one CCM state,
while the remaining odd FSM states are aggregated into another CCM state. Throughout
the following text, we refer to the five CCM states as 0c , 1c , 12kc − , evenc and oddc .
Note that at any time instance, if the process is in states 0c , 12kc − or evenc then the last
received bit was error-free. Similarly, the CCM being in state 1c or state oddc implies
that the last received bit was corrupted. The probability of transiting from current CCM
state ic to CCM state jc is denoted by ,i jc cp , and icπ represents the steady-state
probability of being in CCM state ic .
To get a burst of L error-free bits on a CCM-based channel, we have to consider that
the CCM can be in any of the five states when the burst starts. If the CCM is in state 0c
at the start of the burst then the probability that the following L bits are error-free is
simply given by ( )0,0Lp . If the process is in state 1c , for the next bit to be error-free, the
CCM should transit to state evenc . This transition has to be followed by 3k − good bits,
i.e., 3k − transitions from evenc to evenc . After that the CCM should transit to state
12kc − and then to state 0c . Once in state 0c , the process will continue being in that state
for the following L k− transitions. Summarizing the above discussion gives probability
of receiving an error-free packet starting in state 1c as
115
( ) ( )1 11 1 0 0 02 23
, , , , ,k keven even evenk L k
c c even c c c c c c c cp p p p pπ − −− −
. Similar expressions
can be derived for the remaining CCM states. Now summing over all possible initial
states gives the complete expression for CCM goodput as
( ) ( ) ( )
( ) ( )( )
1 10 0 0 1 1 0 0 02 2
1 1 0 0 02 2
1 1 02 2
3, , , , , ,
3, , , , ,
2, , ,
Pr error-free pkt CCM
k keven even even even
k kodd odd even even even even
k keven even even even
CCML k L k
c c c c c c c c c c c c c ck L k
c c c c c c c c c c ck
c c c c c c c
p p p p p p
p p p p p
p p p p
γ
π π
π
π
− −
− −
− −
− −
− −
−
=
= + +
+
$
( ) ( )1 10 0 0 0 02 21
, , , .k kL k L
c c c c c c cp pπ − −− −+
(A.46)
Similar to the FSM expression of (A.45), the above probability sums over all possible
CCM state paths of receiving an error-free packet of length L bits.
A.7.1.6 Comparison of Estimated Goodputs
In this section, we compare the goodput estimates provided by the channel models
against the goodput computed from an actual trace. For comparison with a trace, we first
2 Mbps 5.5 Mbps0
20
40
60
80
100
good
put %
Actual tracesBinary-symmetric channel model2-state Gilbert channel model1024-state full-state Markov channel model5-state constant-complexity channel model
Figure 31. Comparison of the average goodput of the actual traces with the goodput
estimates provided by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five traces.
116
train all four models (BSC, GC, FSM and CCM) using that trace. We then plug in the
trained parameters of these models into equations (A.43) to (A.46) in order to get
throughput estimates of the trace from the models.
Actual and estimated goodputs are compared in Figure 31. The results in Figure 31
are averaged over five traces at each physical layer data rate. The CCM is trained by
aggregating states of an order-20 FSM chain. A packet length of 100 bytes is used to
compute actual and estimated goodputs. It can be clearly seen that for both data rates the
goodput estimates provided by the binary-symmetric and Gilbert channels are highly
pessimistic and inaccurate; at both data rates percentage goodputs estimated by the BSC
and the GC are approximately 20% and 30% respectively, while the actual goodput is
approximately 97% . Since the Gilbert channel incorporates one bit of memory, its
goodput estimate is slightly better than the memory-less binary-symmetric channel.
However, both these channels models are too inaccurate to be used in any realistic
measurement or analytical study. The order-10 full-state Markov model provides very
accurate goodput estimates because it incorporates high-order channel memory. While
being significantly less-complex than the FSM model, the CCM provides estimates that
are even better than the order-10 FSM models because the CCM is constructed by
aggregating states of an order-20 FSM chain.
A.7.2 Retransmissions of a Reliable Protocol In this section, we show that the expected number of retransmissions per packet can
be modeled as a simple function of the goodput. We then compare the retransmission
estimates provided by the models under consideration.
117
A.7.2.1 Expected Retransmissions on a Wireless Channel
In this section, we quantify the expected number of retransmissions experienced by a
packet being transported by an abstract reliable protocol - such as the transmission
control protocol (TCP) [101] or the 802.11 MAC layer protocol [58]. We only focus on
the retransmission-due-to-channel-noise aspect of reliable protocols by employing the
following simple abstraction: keep retransmitting until the packet is received correctly.
We acknowledge that this abstraction is somewhat unrealistic because reliable protocols
generally stop retransmitting after a certain threshold. However, this abstraction allows us
to quantify the worst-case performances of the channel models under consideration. Like
the previous section, at the receiver the abstract reliable protocol drops all packets with
one or more bit-errors. Also, we carry the assumption from the last section that only the
last transmission hop is a wireless link.
Let X denote the random variable representing the total number of retransmissions
required to successfully transmit a packet under the abstract retransmission protocol. Due
to the present abstraction, X can be modeled as a geometric random variable with
parameter γ , where γ is defined in the last section as the true probability of a successful
packet on the wireless channel. More specifically, the probability that a packet will
experience m retransmissions can be expressed as ( )Pr 1 mX m γ γ= = − .
Consequently, the expected number of retransmissions β is
1 1Xβ γ= Ε = − . (A.47)
118
As expected intuitively, the expected number of retransmissions is inversely proportional
to the probability of a good packet; increase in the probability of a good packet γ will
cause the 1 γ expression to decrease.
Until this point, we have assumed that we accurately know the value γ , the true
probability of a successful packet on the wireless channel. In wireless simulations, an
estimate of this parameter γ$ is provided by a wireless channel model. From the last
section, we know that equations (A.43) to (A.46) provide the γ$ estimates for the BSC,
GC, FSM and CCM channel models. Given the γ$ estimates, the estimated number of
retransmissions per packet can be computed as:
µ 1 1β γ= −$ . (A.48)
Plugging in equations (A.43) to (A.46) renders each channel model’s estimate of per-
packet retransmissions.
A.7.2.2 Comparison of Estimated Retransmissions
To compute the average number of retransmissions per packet from an actual trace,
we divide the trace into 100 byte packets. Then to emulate transmission of packet i , we
count the burst-length of corrupted packets including and following packet i . This burst-
length is the number of retransmissions that packet i will experience. Burst-
lengths/retransmissions of all the emulated packets are accumulated. Finally, the
accumulated retransmission count is normalized by the total number of error-free packet
transmissions.
119
As before, parameters of the channel models are derived from the traces against
which they are being compared. Note here that the results of (A.48) are not computed by
taking the reciprocal of the averaged goodput results of Figure 31. The retransmission
estimates are computed by applying equation (A.47) to a model that is trained specifically
for a particular trace. Since equation (A.47) takes the reciprocal of 0 1γ≤ ≤$ , a model
with low γ$ can render very high values of µβ .
Figure 32 plots the average number of retransmissions per packet observed in an
actual trace compared against the retransmission estimates provided by the binary-
symmetric, Gilbert, full-state Markov and constant-complexity channel models. It can be
clearly seen in Figure 32 that the estimates provided by the BSC model are grossly
inaccurate. For instance, at 2 Mbps the BSC models estimates the expected number of
retransmissions per packet to be approximately 700 whereas the average number of per-
packet retransmissions observed in the actual traces is about 0.02. The highly inaccurate
retransmission estimates by the BSC are mostly due to receiver-4’s traces. The goodput
estimate of the BSC model for this trace is approximately 0.0003 at 2 Mbps. Putting this
value into equation (A.47) gives an extremely inaccurate estimate of more than 3000
retransmissions per packet. This simple result shows the scale of inaccuracy that is
incurred if channel memory is completely ignored during theoretical or experimental
verification of a wireless system.
120
2 Mbps 5.5 Mbps0
100
200
300
400
500
600
700
800
retr
ansm
issi
ons
per
pack
et
Actual tracesBinary-symmetric channel model2-state Gilbert channel model1024-state full-state Markov channel model5-state constant-complexity channel model
Figure 32. Comparison of the number of retransmissions per packet estimated by BSC, Gilbert, 1024-state Markov, and 5-state CCM models; each result is averaged over five
traces.
2 Mbps 5.5 Mbps0
5
10
15
20
25
30
35
40
retr
ansm
issi
ons
per
pack
et
Actual traces-2-state Gilbert channel model1024-state full-state Markov channel model5-state constant-complexity channel model
Figure 33. Number of retransmissions per packet without the BSC model.
2 Mbps 5.5 Mbps0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
retr
ansm
issi
ons
per
pack
et
Actual traces--1024-state full-state Markov channel model5-state constant-complexity channel model
Figure 34. Number of retransmissions per packet without the BSC model.
121
The estimates of the BSC model are so overwhelming inaccurate that the remaining
plots are not clearly visible in Figure 32. Therefore, in Figure 33 we plot the results
without the BSC model. From Figure 33, it can be seen that at 2 Mbps even the Gilbert
channel provides very inaccurate estimates of the expected number of retransmissions.
The GC estimate is closer to the actual traces at 5.5 Mbps, but is still significantly worse
than the FSM and CCM models. Figure 34 only shows the estimates by the 1024-state
FSM model and the CCM. Since these channel models incorporate high-order memory,
their estimates are extremely close to the retransmissions observed in the actual traces.
122
CHAPTER A. 8 CONCLUSIONS AND FUTURE WORK
In this part of the thesis, we showed that 802.11b MAC layer bit-errors at 2 and 5.5
Mbps are Markovian, while bit-errors at 11 Mbps are long-range dependent. We
demonstrated that high-order full-state Markov (FSM) chains can model the bit-errors at
2 and 5.5 Mbps. A multifractal wavelet model (MWM) was used to characterize 11 Mbps
bit-errors. We mitigated the complexity of FSM chains by approximating FSM behavior
using a constant-complexity model which always comprised five states and was highly
accurate. We employed the proposed channel models to estimate corrupted packet
headers in an FEC-based wireless multimedia framework. This novel framework
provided significant improvements in bandwidth utilization and multimedia quality.
Finally, we highlighted some of the inaccuracies that are incurred by using inaccurate
models. These inaccuracies can be avoided by using the constant-complexity model
proposed in this thesis.
As future work, we will study the applicability of the proposed models on other
wireless channels. Another ongoing extension of this work is to incorporate the proposed
channel models into open-source network simulators, such as ns-2 [102] and Qualnet
[103]. We are also investigating alternative methods that can reduce the complexity of the
header estimation framework. Finally, we intend to extend analysis similar to Chapter
A.7 to other wireless protocols and systems so that we can quantify the inaccuracies that
are incurred by inaccurate channel models.
123
PART-A REFERENCES
[1] B. D. Fritchman, “A Binary Channel Characterization using Partitioned Markov Chains,” IEEE Transactions on Information Theory, vol. 13, pp. 221–227, April 1967.
[2] S. Tsai, “Markov Characterization of the HF Channel,” IEEE Transactions on Communications Technology, vol. 17, pp. 24–32, February 1969.
[3] H. O. Burton and D. Sullivan, “Errors and Error Control,” Proceedings of the IEEE, pp.1293–1301, November 1972.
[4] H. A. Blank and P. J. Trafton, “A Markov Error Channel Model,” IEEE National Telecommunications Conference, December 1973.
[5] R.T. Chien, A.H. Haddad, B. Goldberg and E. Moyes, “An Analytic Error Model for Real Channels,” IEEE International Conference on. Communications (ICC), June 1972.
[6] A. H. Haddad, S. Tsai, B. Goldberg, G. C. Ranieri, “Markov Gap Models for Real Communication Channels,” IEEE Transactions on Communications, vol. 23, no. 11, pp. 1189–1197, 1975.
[7] L. N. Kanal and A. R. K. Sastry, “Models for Channels with Memory and Applications to Error Control,” Proceedings of the IEEE, vol. 66, no. 7, pp. 724–744, 1978.
[8] M. Yajnik, S. Moon, J. Kurose, and Don Towsley, “Measurement and Modelling of the Temporal Dependence in Packet Loss,” IEEE Infocom, March 1999.
[9] M. Zorzi and R. R. Rao, “On Channel Modeling for Delay Analysis of Packet Communications over Wireless Links,” Allerton Conference on Communications, Control and Computing, September 1998.
[10] H. Balakrishnan and R. Katz, “Explicit Loss Notification and Wireless Web Performance,” IEEE Globecom, November 1998.
[11] M. Zorzi and R. R. Rao, “On the Statistics of Block Errors in Bursty Channels,” IEEE Transactions on Communications, vol. 45, no. 6, pp. 660–667, June 1997.
124
[12] R. R. Rao, “Higher Layer Perspectives on Modeling the Wireless Channel,” IEEE ITW, June 1998.
[13] G. T. Nguyen, R. Katz, and B. Noble, “A Trace-based Approach for Modeling Wireless Channel Behavior,” Winter Simulation Conference, December 1996.
[14] A. Konrad, B. Y. Zhao, A. D. Joseph, and R. Ludwig, “A Markov-based Channel Model Algorithm for Wireless Networks,” ACM Wireless Networks Journal (WINET), vol. 9, pp. 189 – 199, 2003.
[15] A. Konrad, B. Y. Zhao, A. D. Joseph, and R. Ludwig, “A Markov-based Channel Model Algorithm for Wireless Networks,” ACM Mobicom Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM), July 2001.
[16] P. Ji, B. Liu, D. Towsley, Z. Ge, and J. Kurose, “Modeling Frame-level Errors in GSM Wireless Channels,” Performance Evaluation Journal, vol. 55, no. 1-2, , pp. 165–181, January 2004.
[17] P. Ji, B. Liu, D. Towsley, and J. Kurose, “Modeling Frame-level Errors in GSM Wireless Channels,” IEEE Globecom, November 2002.
[18] S. A. Khayam, S. Karande, H. Radha, and D. Loguinov, “Performance Analysis and Modeling of Errors and Losses over 802.11b LANs for High-Bitrate Real-time Multimedia,” Signal Processing: Image Communication, vol. 18, no. 7, pp. 575–595, August 2003.
[19] S. Karande, S. A. Khayam, M. Krappel, and H. Radha, “Analysis and Modeling of Errors at the 802.11b Link Layer,” IEEE International Conference on Multimedia and Expo (ICME), July 2003.
[20] S. A. Khayam and H. Radha, “Markov-based Modeling of Wireless Local Area Networks,” ACM Mobicom Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM), September 2003.
[21] S. A. Khayam, S. Aviyente, H. Radha, and J. R. Deller, Jr. “Markov and Multifractal Wavelet Models for Wireless MAC-to-MAC Channels,” Performance Evaluation, to appear.
[22] S. A. Khayam, S. Aviyente and H. Radha, “On Long-Range Dependence in High-Bitrate Wireless Residual Channels,” Conference on Information Sciences and Systems (CISS), March 2005.
125
[23] R. R. Rao, “Higher Layer Perspectives on Modeling the Wireless Channel,” IEEE ITW, June 1998.
[24] A. M. Chen and R. R. Rao, “Wireless Channel Models – Coping with Complexity,” Wireless Multimedia Network Technologies, Kluwer Academic Publishers, pp. 271–288, 1999.
[25] A. M. Chen and R. R. Rao, “On Tractable Wireless Channel Models,” IEEE PIMRC, September 1998.
[26] A. Willig, M. Kubisch, C. Hoene, and A. Wolisz, “Measurements of a Wireless Link in an Industrial Environment using an IEEE 802.11-Complaint Physical Layer,” IEEE Transactions on Industrial Electronics, vol. 49, no. 6, pp. 1265–1282, 2002.
[27] A. Willig, “A New Class of Packet- and Bit-Level Models for Wireless Channels,” IEEE PIMRC, October 2001.
[28] A. Köpke, A. Willig, and H. Carl, “Chaotic Maps as Parsimonious Bit Error Models of Wireless Channels,” IEEE Infocom, March 2003.
[29] S. A. Khayam and H. Radha, “Linear-Complexity Models for Wireless MAC-to-MAC Channels,” ACM Wireless Networks (WINET) Journal, vol. 11, no. 5, September 2005.
[30] S. A. Khayam and H. Radha, “Constant-Complexity Models for Wireless Channels,” IEEE Infocom, April 2006.
[31] R. Caceres and L. Iftode, “Improving the Performance of Reliable Transport Protocols in Mobile Computing Environments,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 13, no. 5, 1995.
[32] A. Bakre and B. R. Badrinath, “I-TCP: Indirect TCP for Mobile Hosts,” IEEE ICDCS, May 1995.
[33] R. Yavatkar and N. Bhagwat, “Improving End-to-End Performance of TCP over Mobile Internetworks,” Workshop on Mobile Computing Systems and Applications, Dec. 1994.
[34] H. Balakrishnan, V. N. Padmanabhan, S. Seshan and R. H. Katz, “A Comparison of Mechanisms for Improving TCP Performance over Wireless Links,” IEEE/ACM Transactions on Networking, 1997.
126
[35] G. Holland and N. Vaidya, “Analysis of TCP Performance over Mobile Ad Hoc Networks,” ACM Wireless Networks (WINET), vol. 8, pp. 275–288, 2002.
[36] Z. Fu, X. Meng, and S. Lu, “How Bad TCP Can Perform In Mobile Ad Hoc Networks,” IEEE ISCC, 2002.
[37] K. Chandran, S. Raghunathan, S. Venkatesan, and R. Prakash, “A Feedback based Scheme for Improving TCP Performance in Ad-hoc Wireless Networks,” IEEE ICDCS, 1998.
[38] T. D. Dyer and R. V. Boppana, “A Comparison of TCP Performance over Three Routing Protocols for Mobile Ad Hoc Networks,” ACM MobiHoc, Oct. 2001.
[39] C. Parsa and J.J. Garcia-Luna-Aceves, “Improving TCP Performance over Wireless Networks at the Link Layer,” Mobile Networls and Applications, vol. 5, pp. 57–71, 2000.
[40] M. Gerla, K. Tang, and R. Bagrodia, “TCP Performance in Wireless Multihop Networks,” IEEE WMCSA, 1999.
[41] L-A. Larzon, M. Degermark, S. Pink, L-E. Jonsson, and G. Fairhurst, “The Lightweight User Datagram Protocol (UDP-Lite),” RFC 3828, July 2004.
[42] L-A. Larzon, M. Degermark, and S. Pink, “UDP-Lite for real time multimedia applications,” IEEE ICC, Jun. 1999.
[43] L-A. Larzon, M. Degermark, and S. Pink, “Efficient use of wireless bandwidth for multimedia applications,” IEEE MOMUC, Oct. 2000.
[44] L-A. Larzon, M. Degermark, and S. Pink, “The Lightweight User Datagram Protocol (UDP-Lite),” RFC 3828, Jul. 2004.
[45] H. Zheng and J. Boyce, “An Improved UDP Protocol for Video Transmission Over Internet-to-Wireless Networks,” IEEE Transactions on Multimedia, vol. 3, no. 3, pp. 356--365, September 2001.
[46] H. Zheng, “Optimizing Wireless Multimedia Transmissions through Cross Layer Design,” IEEE International Conference on Multimedia and Expo (ICME), July 2003.
[47] A. Singh, A. Konrad, and A. D. Joseph, “Performance evaluation of UDP-Lite for cellular video,” ACM NOSSDAV, 2001.
127
[48] A Servetti and J. C. De Martin, “Error tolerant MAC extension for speech communications over 802.11 WLANs,” IEEE VTC, 2005.
[49] C. H. Shih, Y. M. Tou, and C. K. Shieh, “A self-regulated redundancy control scheme for wireless video transmission,” IEEE WirelessCom, 2005.
[50] E. Masala, M. Bottero, and J. C. De Martin, “MAC-level partial checksum for H.264 video transmission over 802.11 ad hoc wireless networks,” IEEE VTC, 2005.
[51] S. A. Khayam, S. Karande, M. Krappel, and H. Radha, “Cross-Layer Protocol Design for Real-time Multimedia Applications over 802.11b Networks,” IEEE International Conference on Multimedia and Expo (ICME), July 2003.
[52] Z. Ye, S. V. Krishnamurthy, and S. K. Tripathi, “A Framework for Reliable Routing in Mobile Ad Hoc Networks,” IEEE Infocom, 2003.
[53] J. Tang, G. Xue, and W. Zhang, “Reliable Routing in Mobile Ad Hoc Networks Based on Mobility Prediction,” IEEE MASS, Oct. 2004.
[54] S. Mueller, R. P. Tsang, and D. Ghosal, “Multipath Routing in Mobile Ad Hoc Networks: Issues and Challenges,” Lecture Notes in Computer Science, 2004.
[55] P. Papadimitratos, Z. J. Haas, and E. G. Sirer, “Path Set Selection in Mobile Ad Hoc Networks,” ACM MobiHoc, Jun. 2002.
[56] F. Zhai, Y. Eisenberg, T. N. Pappas, R. Berry, and A. K. Katsaggelos, “Rate-Distortion Optimized Product Code Forward Error Correction for Video Transmission over IP-based Wireless Networks,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004.
[57] ISO/IEC 8802-11:1999(E), “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications,” August 1999.
[58] IEEE Std 802.11b-1999, “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Higher-Speed Physical Layer Extension in the 2.4 GHz band,” September 1999.
[59] J. G. Kemeny and J. L. Snell, Finite Markov Chains, Springer-Verlag: New York, 1976.
[60] D. Cox, “Long-Range Dependence: A Review,” Statistics: An Appraisal, pp. 55 – 74, 1984.
128
[61] R. Riedi, M. Crouse, V. Ribeiro, and R. Baraniuk, “A Multifractal Wavelet Model with Application to Network Traffic,” IEEE Transactions on Information Theory, 45(3), pp. 992–1018, 1999.
[62] P. Arby, R. Baraniuk, P. Flandrin, R. Riedi, and D. Veitch, “Multiscale Nature of Network Traffic,” IEEE Signal Processing Magazine, 19(3), pp. 28 – 46, May 2002.
[63] V. Ribeiro, R. Riedi, and R. Baraniuk, “Wavelets and Multifractals for Network Traffic Modeling and Inference,” IEEE ICASSP, May 2001.
[64] J. Postel, “User datagram protocol,” RFC 768, Aug. 1980.
[65] P. Brockwell and R. Davis, Introduction to Time Series and Forecasting, Springer: Verlag, 1996.
[66] N. Merhav, M. Gutman, and J. Ziv, “On the Estimation of the Order of a Markov Chain and Universal Data Compression,” IEEE Transactions on Information Theory, vol. 35, pp. 1014–1019, September 1989.
[67] M. J. Weinberger, J. J. Rissanen, and M. Feder, “A Universal Finite Memory Source,” IEEE Transactions on Information Theory, vol. 41, no. 3, pp. 643–652, 1995.
[68] W. Willinger, V. Paxson, R. H. Riedi, and M. S. Taqqu, “Long-Range Dependence and Data Network Traffic,” Long Range Dependence: Theory and Applications, Birkhäuser, pp 373- 407, 2002.
[69] P. Abry, P. Flandrin, M. Taqqu, and D. Veitch, “Wavelets for the Analysis, Estimation and Synthesis of Scaling Data,” Self Similar Network Traffic Analysis and Performance Evaluation, Wiley, 2000.
[70] R. J. Adler, R. E. Feldman, and M. Taqqu, A Practical Guide to Heavy Tails, Birkhäuser, 1998.
[71] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Dover, pp. 564- 565, 1972.
[72] Homepage of linux-wlan-ng device drivers, http://www.linux-wlan.org.
[73] Multifractal Wavelet Model Toolbox, http://www-dsp.rice.edu/software/mwm.shtml.
129
[74] V Teverovsky and M. Taqqu, “Testing for Long-range Dependence in the Presence of Shifting Mean or a Slowly Declining Trend using a Variance-type Estimator,” Journal of Time Series Analysis, vol. 18, no. 3, pp. 279–304(25), May 1997.
[75] L. R. Rabiner, “A Tutorial on Hidden Markov Models and its Applications,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257- 286, February 1989.
[76] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-Likelihood from Incomplete Data via the EM Algorithm,” Journal of Royal Statistics Society Series, vol. 39, 1977.
[77] L. E. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of Markov Processes,” Inequalities, vol. 3, no. 1, pp. 1–8, 1972.
[78] T. Cover and J. Thomas, Elements of Information Theory, Wiley: New York, 1991.
[79] H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I, Wiley: New York, 2001.
[80] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley, May 1984.
[81] E. N. Gilbert, “Capacity of a Burst Noise Channel,” Bell. Sys. Tech. Journal, vol. 39, pp. 1253–1265, September 1960.
[82] M. Mushkin and I. Bar-David, “Capacity and coding for the Gilbert-Elliot channels,” IEEE Transactions on Information Theory, vol. 35, no. 6, pp. 1277- 1290, November 1989.
[83] ISO/IEC JTC 1/SC29/WG11 and ITU-T SG16 Q.6, “Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264|ISO/IEC 14496-10 AVC),” Mar. 2003.
[84] ISO/IEC JTC 1/SC29/WG11, “Text of ISO/IEC 14496-2:2001 (Unifying N2502, N3301, N3056, and N3664,” Doc. N4350, July 2001.
[85] H.264/AVC Software Coordination webpage, http://iphome.hhi.de/suehring/tml.
[86] A. Natu and D. Taubman, “Unequal protection of JPEG2000 code-streams in wireless channels,” IEEE Globecom, Nov. 2002.
130
[87] M. Grangetto, E. Magli, G. Olmo, “Reliable JPEG 2000 wireless imaging by means of error-correcting coder,” IEEE ICME, June 2004.
[88] D. Krishnaswamy and S. Kalluri, “Multi-level weighted combining of retransmitted vectors in wireless communications,” IEEE VTC, 2006.
[89] C. E. Koksal and H. Balakrishnan, “Quality-aware routing metrics for time-varying wireless mesh networks,” IEEE Journal on Selected Areas in Communications (JSAC), to appear.
[90] J. Farber and K. Zeger, “Optimality of the natural binary code for quantizers with channel optimized decoders,” IEEE ISIT, July 2003.
[91] W. S. Lee, M. R. Pickering, M. R. Frater, and J. F. Arnold, “Error resilience in video and multiplexing layers for very low bit-rate video coding systems,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 15, no. 9, pp. 1764- 1774, 1997.
[92] X. Luo and G. B. Giannakis, “Energy-constrained optimal quantization for wireless sensor networks,” IEEE SECON, Oct. 2004.
[93] M. U. Ilyas and H. Radha, “End-to-end channel capacity of a wireless sensor network under reachback,” CISS, Mar. 2006.
[94] M. Godavarti and A. O. Hero III, “Diversity and degrees of freedom in wireless communications,” IEEE ICASSP, May 2001.
[95] H. Dong, I. D. Chakares, A. Gersho, E. Belding-Royer, and J. D. Gibson, “Selective bit-error checking at the MAC layer for voice over mobile ad hoc networks with IEEE 802.11,” IEEE WCNC, Mar. 2004.
[96] L. Bononi, M. Conti, and E. Gregori, “Runtime Optimization of IEEE 802.11 Wireless LANs Performance,” IEEE Transactions on Parallel and Distributed Computing, vol. 15, no. 1, 2004.
[97] C-F. Chiasserini and E. Magli, “Energy-Efficient Coding and Error Control for Wireless Video-Surveillance Networks,” Telecommunication Systems, vol. 26, no. 2, pp. 369- 387, 2004.
[98] Z-H. Tan, P. Dalsgaard, and B. Lindberg, “A subvector-based error concealment algorithm for speech recognition over mobile networks,” IEEE ICASSP, May 2004.
131
[99] W. S. Lee, M. R. Frater, M. R. Pickering, and J. F. Arnold, “A robust codec for transmission of very low bit-rate video over channels with bursty errors,” IEEE Transactions on Circuits and Systems for Video Technology (CSVT), vol. 10, no. 8, pp. 1403- 1412, December 2000.
[100] L. Zhong, F. Alajaji, and G. Takahara, “A queue-based model for wireless Rayleigh fading channels with memory,” IEEE VTC, Sep. 2005.
[101] J. Postel, “Transmission control protocol,” RFC 793, Sep. 1981.
[102] Homepage of the ns-2 network simulator, http://www.isi.edu/nsnam/ns/.
[103] Homepage of the Qualnet network simulator, http://www.scalable-networks.com/.
132
PART B
SELF-PROPAGATING MALWARE DETECTION AT NETWORK ENDPOINTS
USING INFORMATION-THEORETIC TOOLS
133
CHAPTER B.1 INTRODUCTION
A recent and dramatic increase in automated network intrusions has necessitated
defense mechanisms that can curb the spread of self-propagating malicious software3
(malware) in real-time. Moreover, rapid evolution and mutation of malware stipulate
detection of novel (i.e., previously unknown) attacks with few, if any, assumptions about
the attack strategy. To that end, network-based anomaly detectors attempt to flag
behavior that is anomalous or abnormal for a networked entity or a user [1]–[29]. The
challenge of anomaly detection systems is the characterization of benign behavior. Most
of the contemporary anomaly detectors are either (a) network-based systems that detect
anomalies by observing unusual network traffic patterns [2]–[24] or (b) host-based
systems that detect anomalies by monitoring an endpoint’s operating system (OS)
behavior, for instance by tracking OS audit logs, processes, command-lines or keystrokes
[25]–[29]. Contemporary anomaly detectors tend to be computationally complex with
high false alarm rates and slow response times [30]- [32].
Since network endpoints4 are serving as extremely potent and viable launch pads and
carriers for malware infections [34], [35], it is important that real-time and effective
defenses be developed specifically for network endpoints. Recently, there has been some
interest in network- and host-based malware detection at endpoints [25]- [29], [19]- [21]
or at servers close to the endpoints [22]- [24]. Most of these studies leverage some
3 Due to the present focus on detection of self-propagating malicious software, throughout this thesis the term malware corresponds to self-propagating malware. 4 “An endpoint is an individual computer system or device that acts as a network client and serves as a workstation or personal computing device.” [33].
134
characteristics of past malware for endpoint-based malware detection. While these
malware characteristics hold true for some of the contemporary malware, their validity
and efficacy are currently being questioned [37]- [42]. Consequently, there is a growing
interest in developing behavioral signatures of benign/legitimate behavior [1]. Once a
robust behavioral model is in place, malicious activity can be detected using deviations
from benign behavior rather than relying on prior experiences of malicious activity.
The objective of behavioral anomaly detection is the characterization of an endpoint’s
benign behavior. Naturally, it is desirable to identify behavioral features that will get
perturbed if the endpoint is compromised by any (past, present, or future) self-
propagating malware. This work identifies such behavioral features using information-
theoretic tools and leverages these features for real-time malware detection at network
endpoints.
B.1.1 Overview of Contributions To obtain benign behavioral profiles of end-users, we have spent 12 months
collecting traffic statistics of a diverse set of endpoints in home, office, and university
settings. An endpoint’s traffic profile contains information about session-level network
activity, such as one-way hashed source and destination IP addresses, session direction
(incoming or outgoing), source and destination ports, timestamps, and keystrokes that are
used to initiate sessions. For malicious activity, we use a diverse set of real and simulated
worms. These worms vary in their propagation rates and scanning techniques. We
evaluate the benign data profiles for behavioral features that get perturbed when the
endpoint is compromised by a self-propagating malware. Based on the identified features,
we propose three malware detection techniques.
135
The first malware detection technique proposed in this thesis is truly network-based
since it only uses traffic features for malware detection. This technique relies on the
premise that the vulnerabilities targeted by any malware are associated with a small
number of source or destination ports. Thus, on a compromised machine, the distribution
of source or destination ports on which a host communicates should be perturbed after
infection. Information-theoretic measures can quantify such perturbations in port
distributions. We first evaluate whether entropy of port distributions can be used to detect
worms. We observe that in many cases entropy cannot identify malware-related port
perturbations because it captures the variance of a distribution rather than the frequencies
of individual ports. As an alternative technique, we propose the use of the Kullback-
Leibler (K-L) divergence measure [43] to characterize perturbations in source and
destination port profiles as a means of detecting attempts of malware propagation. In our
framework, we record a small version of each endpoint’s benign traffic profile and
continuously compare it using the K-L divergence measure to the port histograms
observed in the last window of t time-units. Our results with the collected benign and
malware data show that K-L divergence of port histograms is perturbed significantly on
compromised endpoints, which allows very accurate detection of malicious activity by
simply observing each host’s traffic. We also experiment with three other information
divergence measures, namely the Jenson-Shannon (J-S) divergence, the K -divergence
and the resistor-average (R-A) divergence [44], [45]. However, these divergence
measures do not provide any substantial improvements over the K-L divergence.
We use a very small subset of K-L divergences derived from the normal user profiles
and the malware data to train support vector machines (SVMs) [46] for each endpoint.
136
The trained SVMs are then tested using all other malware which are embedded at
multiple random instances in the normal profiles. For all our experimental evaluations,
we observe almost 100% detection accuracy and negligible false alarm rates. We
compare the performance of our proposed malware detector with two existing anomaly
detectors, namely the maximum-entropy detector [14] and the rate-limiting detector [20].
We show that the proposed K-L/SVM-based detector provides consistently and
substantially better performance than the techniques of [14] and [20].
The remaining two malware detection techniques proposed in this thesis are joint
network-host anomaly detectors which exploit the observation that when a user is
actively using his/her computer most of the benign traffic is triggered by a small subset of
keystrokes and mouse clicks. Based on this observation, we propose to correlate the last
input from the keyboard or mouse hardware buffer with every new network session. We
use marginal keystroke data to show that the session initiation keys are not necessarily
used as frequently by an end-user. To effectively exploit the session-keystroke correlation
in a real-time and automated fashion, we propose two information-theoretic measures,
namely keystrokes’ entropy and session-keystroke mutual information [43].
We compute the keystrokes’ entropy and mutual information on a window-by-
window basis. We observe that the entropy is consistently low and mutual information is
somewhat high in the time windows containing benign data. However, once malicious
traffic with a marginal keystroke distribution is inserted into the benign profile, there is a
significant increase in the entropy and simultaneously there is a decrease in the mutual
information. These entropy and mutual information perturbations are because of the fact
that many keys that are generally used very frequently by the users are never used to
137
initiate legitimate network activity. For a user who is active on his/her endpoint, the
malicious network sessions that are not initiated by the user are logged with unlikely and
diverse keystrokes thereby changing the keystrokes’ distribution.
To create an automated detection tool based on the keystroke distributions, we use a
small subset of the benign profiles to generate the joint and marginal distributions of
keystrokes and network sessions. Based on the statistics of these distributions, we
develop entropy/mutual information threshold above/below which an alarm is raised. For
both entropy and mutual information based detectors, we observe almost 100% detection
accuracy and very low false-alarm rates. Overall the mutual information detector has
lower false alarm rates than the entropy detector. Nevertheless, both detectors provide
significantly better performance than the existing maximum-entropy and rate-limiting
detectors.
B.1.2 Organization of this Part The rest of this part is structured as follows. Chapter B.2 describes related work in
this area. Chapter B.3 provides brief background on self-propagating malware and
support vector machines. Chapter B.4 details the benign endpoint profiles and malware
collected/simulated for this study. Chapter B.5 proposes a network-based information-
theoretic technique which detects malware by leveraging the K-L divergence between
benign and real-time traffic features in an SVM framework. Chapter B.6 presents two
other techniques which respectively employ entropy and mutual information of
keystrokes and network sessions to detect malware. Chapter B.7 identifies possible
attacks on the proposed malware detectors and discusses defenses against these attacks.
138
Chapter B.8 summarizes key conclusions of this part and outlines our future research
directions.
139
CHAPTER B.2 RELATED WORK
Most of the contemporary studies perform network-based anomaly detection at the
enterprise network perimeter or the local network perimeter. Zou et al. [2] propose a
malware warning center (MWC) and distributed ingress and egress sensors at a local
network’s perimeter. Similarly, Wu et al. [3] propose a network architecture and a
distributed algorithm to detect multi-vector worms. Schechter et al. [4] use a combination
of rate limiting and portscan detection on local network worm detector. Jung et al. [5]
develop a network-level fast portscan detector that uses a threshold random walk (TRW)
on typical access patterns to infer whether a host is malicious or benign. Weaver et al. [6]
simplify the TRW algorithm to make it more amenable to hardware and software
implementations. The simplified algorithm of [6] can accurately detect very low rate
worms. Soule et al. [11] apply a Kalman filter to normal traffic and then use multiple
anomaly detection techniques to detect abnormal behavior. Kim et al. [12] propose that
gateway routers score each packet based on its legitimacy. Similarly, anomaly detectors
that monitor blocks of unused IP addresses are also becoming increasingly popular
[15]- [18].
There has been some recent interest in detecting malware at servers near the
endpoints. Whyte et al. [22] detect worms by monitoring (at the gateway router)
connections that are not preceded by a DNS address resolution request. Gupta and Sekar
[23] detect changes in traffic volume at a mail server to detect mass mailing worms.
Xiong [24] trace attachment at mail servers to detect mass mailing worms.
140
Barford et al. [9] use time-frequency signal analysis to develop a change detection
algorithm. Krishamurthy et al. [10] propose a sketch-based change detection algorithm.
Lakhina et al. [7], [8] propose a subspace method to detect and characterize network-
wide volumetric traffic anomalies. The authors then extend their work in [13] and use
entropy to detect anomalies. Another recent study by Gu et al. [14] uses maximum-
entropy estimation to quantify a baseline distribution at a network gateway or router,
which is in turn used to classify anomalous activity using the K-L divergence.
The most commonly used endpoint-based network-level malware detection technique
is rate limiting. This technique proposed by Twycross and Williamson [19], [20] limits
the rate of an endpoint’s network traffic to curb and detect malware propagation. Sellke
et al. [21] extend rate limiting by proposing a branching worm propagation model and in
turn using this model to develop a window-based rate limiting mechanism.
Wong et al. [37], [38] show that rate limiting is not very effective on endpoints or
local network perimeter, but can provide effective malware throttling if deployed on
backbone routers. Panjwani et al. [42] evaluated whether portscans are precursor to
malicious attacks. It was concluded in [42] that over 50% of attacks are not preceded by a
portscan and, therefore, “port scans should not be considered as precursors to an attack.”
Moreover, Li et al. [39] show that statistical filtering-based defense mechanisms are
effective when they are adapted in accordance with an attack. In [39] it is also shown that
the performance of a statistical filter degrades significantly if the attacker is more
adaptive than the filter.
In the host-based anomaly detection context, most of the existing detectors
characterize benign user behavior by modeling commands given by a user in a textual OS
141
environment [26]–[29]. Due to the high market penetration of graphical operating
systems, it is important to model graphical behavioral features of end-users.
A recent technique called BINDER [25] correlates keystrokes with OS processes and
raises an alarm whenever a process is initiated without an end-user’s input. There are
important differences between BINDER and the detector proposed in this thesis. First,
BINDER is purely host-based and does not employ any network session information.
Second, BINDER cannot detect memory-resident malicious codes because its detector is
invoked only when a new process is created. (There have been many well-known worms
that were memory-resident; two most famous examples are CodeRed II and Witty.)
Since our technique uses both network and host information, it can detect memory-
resident malware. Lastly, BINDER requires a whitelist of legitimate applications before
deployment. The detector proposed in this thesis can be deployed out-of-the-box after
which all training is done online.
142
CHAPTER B.3 BACKGROUND
In this section, we provide background material which is required to understand the
contributions of this part.
B.3.1 Self-Propagating Malware Self-probating malware is a recent term that is used to refer to a malicious code that
has the ability to spread from one compromised computer to another computer without
any human intervention. These malicious codes generally target vulnerabilities in
background processes or services that are continuously running on vulnerable hosts. After
compromising a vulnerable computer, a self-propagating malware tries to locate and
infect other vulnerable hosts on the network. The process of locating vulnerable hosts is
called scanning. Over the last few years, malware have evolved to use very sophisticated
scanning and infection techniques [41].
There are two prevalent types of self-propagating malware:
• Worms: A worm is a standalone malicious code that propagates copies of itself to
vulnerable computers;
• Bots: A bot is a malware which after infecting a computer contacts a central
command and control server. This server in turn makes the compromised computer
part of a bot network (botnet) of compromised computers. These botnets are
subsequently remote controlled by the central server.
143
After compromising vulnerable computers, malware can use these computers to
launch distributed denial of service (DDoS) attacks, relay spam or steal personal
information.
B.3.2 Support Vector Machines
Given training vectors ni ∈x ¡ , 1,2, ,i l= … in two classes, and a vector l∈y ¡
such that each 1, 1iy ∈ + − , a C-SVM for non-separable data considers the following
primal optimization problem [46]:
( )( )1
1min ,2lT Ti i i
iC y K bα
=+ +∑w w w s x
such that derivatives of the objective function vanish with respect to iα and subject to the
constraint that 0, 1,2, ,i i lα ≥ = … . In the objective function w is a perpendicular to the
hyperplane that separates the positive and negative points, C is a parameter that is used
to cost the iα ’s, ( ),iK s x is a non-linear kernel that maps the input data to another
(possibly infinite dimensional) Euclidean space, and is ’s are points called the support
vectors that maximize the separation between the positive and negative examples. We use
a degree-3 radial basis kernel function to train the C-SVM.
144
CHAPTER B.4 DATA COLLECTION AND SIMULATION
In this section, we explain the two main datasets collected for this study. The first
dataset comprises benign traffic and keystroke profiles collected from several hosts with
regular human users. The second dataset comprises real and simulated malware traffic.
Since university policy and user reservations prohibited us from infecting operational
endpoints with malware, we first identify network- and host-based features perturbed by
the introduction of malicious code into each system and then perform offline analysis by
inserting malicious traffic at random instances in the endpoints’ benign traffic profiles.
B.4.1 Benign Traffic-Keystroke Profiles Our first step towards the development of a network-based malware detector was to
collect pertinent network and OS-based data. We started by investing up to 12 months in
monitoring network/OS profiles of a diverse set of 13 endpoints. Users of these
endpoints included home users, research students, and technical/administrative staff with
Windows 2000/XP laptop and desktop computers. The laptop endpoints were used by
their users both at home and at work. Some endpoints, in particular home computers,
were shared among multiple users. The endpoints used in this study were running
different types of applications, including peer-to-peer file sharing software, online
multimedia applications, network games, SQL/SAS clients etc.
Data were collected by a multi-threaded windows application called argus, which
runs as a background process storing network and keystroke activity in a log file. The log
145
file is periodically and securely uploaded to a secure copy (SCP) server. argus only
logs session-level information where a session corresponds to bidirectional
communication between two IP addresses. Communication between the same IP address
on different ports is considered part of the same network session. This session-level
granularity reduces the complexity of the malware detector, while providing complete
information about sessions originating from or terminating at an endpoint. Each session is
logged using the information contained in the first packet of the session. A session
expires if it does not send/receive a packet for more than τ seconds. In the collected data,
τ is set to 10 minutes.
For each logged session, argus also logs the last keystroke or mouse click that was
pressed before the first packet of the session. We generically refer to keyboard and mouse
inputs as keystrokes or keys in this thesis. The last keystroke is associated with a session
only if the key was pressed no more than λ seconds before the session. If there was no
key pressed in the last λ seconds before a session then a void keystroke value of zero is
inserted. In the collected traces, λ is set to 10 seconds. Throughout this thesis, we only
focus on sessions with non-zero keys. We assume that the last pressed key has initiated
the associated session, that is, an inherent correlation relationship is assumed between the
last key and the consequent session. Clearly, this correlation will not be present when a
malicious code is trying to propagate from an oblivious end-user’s computer, and hence
perturbations in the session-keystroke correlation can be leveraged at that point to detect
the malicious code.
Each entry of the log file has the following seven fields:
146
<session id, direction, protocol, src port, dst port,
timestamp, virtual key code>,
whose explanation is given below:
• session id: 20-byte SHA-1 hash [47] of the concatenated hostname and remote IP
address. Hashing preserves privacy, which is important because the collected data are
going to be publicly available;
• direction: one byte flag indicating outgoing unicast, incoming unicast, outgoing
broadcast, or incoming broadcast packets;
• protocol: transport-layer protocol (i.e., TCP or UDP) of the packet;
• src port: source port of the packet;
• dst port: destination port of the packet;
• timestamp: millisecond-resolution time of session initiation;
• virtual key code: one byte virtual key code, as defined by Microsoft’s MSDN
library [48], of the last (keyboard or mouse) keystroke that was pressed before the
session. In view of our stringent privacy considerations, we only log the very last
keystroke that was pressed right before the first packet of a new session. Throughout
this thesis, we refer to this jointly collected session and keystroke data as session-key
or key-session data. Moreover, keystrokes observed in this joint profile are referred to
as the session initiation keys.
Some pertinent statistics of the collected benign data are listed in Table 6. Diversity
of the endpoints used in this study is evident from Table 6, which shows that the
endpoints operate in different environments (and hence run different types of
applications). Also, the total size of the dataset (i.e., total number of sessions) varies from
147
11 996, for endpoint 13 to 444 345, for endpoint 4 . In general, we observed that home
computers generate significantly higher traffic volumes than office and university
computers because: (i) they are generally shared between multiple users, and (ii) they run
peer-to-peer and multimedia applications. The high traffic volumes of home computers
are also evident from the high mean sessions per second [column 4].
Another interesting observation is that, with the exception of home computers, the
observed endpoints generally use a small set of source and destination ports very
frequently [columns 5 and 6]. (The source and destination port frequencies in Table 6 are
computed for outgoing unicast packets.) This observation holds particularly true for
destination ports because in most cases ten destination ports are used approximately 90%
of the times – endpoints 3 and 4 being the exceptions here. This is a preliminary
Table 6. Statistics of the Benign Profiles
Endpoint ID
Endpoint Type
Home/Office/Univ
Total Sessions
Mean Session
Rate (sps)
Cumulative Freq of
Ten Most-Used Src Ports (%)
Cumulative Freq of
Ten Most-Used Dst Ports (%)
Cumulative Freq of Ten Most-Used
Session Keys (%)
1 Office 33 487, 0 25. 90 37. 88 06. 96.01 2 Office 21 066, 0 22. 47 8. 87 53. 92.32 3 Home 373 009, 1 92. 3 95. 37 29. 94.01 4 Home 444 345, 5 28. 5 86. 10 82. 94.86 5 Home/Univ 27 873, 0 44. 15 91. 99 27. 95.25 6 Univ 60 979, 0 19. 54 95. 94 0. 95.49 7 Univ 171 601, 0 28. 40 7. 96 75. 95.56 8 Univ 41 809, 0 52. 66 1. 96 44. 96.13 9 Univ 235 133, 0 41. 44 1. 94 84. 95.48 10 Univ 152 048, 0 21. 75 19. 95 11. 95.27 11 Univ 207 187, 0 31. 38 85. 95 2. 95.14 12 Home/Univ 100 702, 0 33. 24 78. 95 0. 95.13 13 Univ 11 996, 0 23. 44 56. 95 98. 95.95
148
indication that port usage is a statistic that is somewhat consistent across endpoints, and
therefore can be leveraged to detect malicious activity. Also, later in the thesis it is shown
that the different benign behavior of home endpoints poses a considerable challenge to
malware detectors.
The last important observation is that without exception all of the observed endpoints
use a small set of session initiation keys very frequently [column 7]. (The session
initiation key frequencies in Table 3 are computed for outgoing unicast packets with non-
zero keys.) In fact, on all hosts more than 90% of the sessions are initiated using 10
keys. This is a preliminary indication that the correlation of the session-key data is
consistent across endpoints and therefore can be leveraged to detect malicious activity.
The joint session-key data described above provides us correlated information of
keystroke and sessions. In other words, this data can be used to develop a joint session-
key probability distribution. In addition to the correlated/joint data, the keystroke-based
detectors proposed later in this thesis also requires marginal distributions of keystrokes.
That is, we need a distribution of all the keystrokes that are pressed on an endpoint. The
following section describes this data.
B.4.2 All-Keystrokes’ Profiles To develop a marginal distribution of keystrokes, we had to log all the keys that are
pressed on a host. Due to strict privacy constraints imposed by the university, and due in
part to user reservations, it was not possible to collect such data on all the participating
hosts. We installed a custom-developed keylogger on two computers [endpoints 5 and
12] and collected keystroke data for more than a month. Each entry of the keylogger
149
contains two fields: <timestamp, keystroke>, which are in the same format as
described in the last section.
This dataset is referred to as the all-keys data. For the remaining endpoints, an
average of the all-keys data of endpoints 5 and 12 is used for the keystrokes’ marginal
distribution. This marginal keystroke distribution is simply a normalized histogram of the
frequency of usage of the keystrokes.
In addition to benign data, we have also collected malware data generated by real
malicious codes. The following section explains collection of the malicious traffic data.
B.4.3 Malware Classification To generate traffic patterns for each malware, we infected a vulnerable machine with
a malware and observed the traffic generated by the malware using the argus data
utility described in the previous section. (The vulnerable machines used here are different
from the operational endpoints used for benign profile collection.) This section details the
malware collected and simulated in this study. Before we describe malware data
collection, explanation of some terminology is in order.
After compromising a vulnerable host, a malware tries to infect other computers by
sending out scan packets with infectious payloads. A vulnerable machine gets infected if
it receives and processes a scan packet. Throughout this text, scan packets generated by a
malware after compromising a vulnerable host are referred to as outgoing scan packets.
Based on the outgoing scan packets, we classify malware into two broad categories:
• Destination-port malware: destination ports of scan packets are fixed, but the source
ports may be arbitrary;
150
• Source-port malware: source ports of scan packets are fixed, but the destination ports
may be arbitrary.
In the former case, we call the destination ports of a malware attack ports and source
ports non-attack ports. In the latter case, the roles are reversed and we call source ports
attack and destination ports non-attack. With the exception of the Witty worm [57],
[60], all contemporary malware are destination-port malware. However, the above
classification is important to understand later results. Note that a source/destination port
malware can be multi-vector [41] targeting multiple vulnerabilities simultaneously. We
now describe the malware used in this study.
B.4.4 Real Malware A critical aim of our study is to use real and diverse malware data to test our detection
techniques. To this end, we installed original and unpatched releases of Windows 2000
and Windows XP on a computer using Microsoft Virtual PC 2004 [49]. The advantage of
using virtual machines (VMs) was that once a virtual host was infected, we could
reinstall it by overriding just a few key files. We assigned static IP addresses to both
virtual machines and connected them to the Internet. These hosts were then compromised
by the following malware: Zotob.G [50], Forbot-FU [51], Sdbot-AFR [52], and
Dloader-NY [53]. We also requested network administrators and research
collaborators in our university to share malware binaries and source codes with us. This
way we acquired SoBig.E@mm [54] and the C source code of MyDoom.A@mm [55],
which are mass-mailing worms. Finally, we downloaded binaries or source codes of the
following worms from the Internet: Blaster [56], Rbot-AQJ [57], and RBOT.CCC
[58].
151
Table 7 shows the diversity of the malware used in this thesis. The malware have
different (and sometimes multiple) attack ports and transport protocols. Also, these
malware include both high- and low-rate malware; Dloader-NY has the highest scan
rate of 46 84. scans per second (sps), while MyDoom-A and Rbot-AQJ have very low
scan rates of 0 14. and 0 68. sps, respectively. We show later that the low-rate MyDoom-
A and Rbot-AQJ are more difficult to detect than high-rate malware. Blaster is one
of the two worms that are used to generate negative examples for SVM training later in
the document.
All real malware collected for this study fall into the widely prevalent category of
destination-port malware. While these malware provided us with a good base for
evaluating our proposed techniques, we wanted to test our methods against an even
broader class of attacks. Consequently, we simulated three additional malware that were
somewhat different from the ones described above. These simulated malware and their
distinguishing characteristics are described next.
Table 7. Information of Malware Used in This Study
Malware Release Date Avg. Scan Rate (sps) Port(s) Used Blaster Aug 2003 10 5. TCP 135 , 4444 , UDP 69
Dloader-NY Jul 2005 46 84. TCP 135 , 139 Forbot-FU Sep 2005 32 53. TCP 445 MyDoom-A Jan 2006 0 14. TCP 3127 3198− RBOT.CCC Aug 2005 9 7. TCP 139 , 445 Rbot-AQJ Oct 2005 0 68. TCP 139 , 769 Sdbot-AFR Jan 2006 28 26. TCP 445 SoBig.E Jun 2003 21 57. TCP 135 ,UDP 53 Zotob.G Jun 2003 39 34. TCP 135 , 445 ,UDP 137 Witty Mar 2004 357 0. UDP 4000
CodeRed II Jul 2004 4 95. TCP 80 Sim Src Port Simulated 3 57. TCP 1500
152
B.4.5 Simulated Malware
The first malware simulated for this study is the Witty worm [59], [60]. Among
other distinguishing characteristics, this worm has two unique properties that are of direct
consequence here: (a) it uses a fixed source port 4000 to propagate, while the destination
port is selected randomly; and (b) after every 20 000, transmitted packets Witty
overwrites a random block on the hard disk of the compromised host. Therefore, Witty
not only falls in the rare source-port malware category, but it also potentially crashes
compromised hosts after dispatching only 20 000, scan packets. On an endpoint with
broadband connectivity, Witty demonstrates an average scan rate of 357 sps, peaking
out at 970 sps [60]. At this rate, 20 000, scan packets can be transmitted (and the infected
host crashed) in less than a minute, which presents a tremendous challenge to real-time
detectors. We simulated the Witty worm using the exact pseudo random number
generator parameters and pseudo code provided in [60]. We only test the worst-case
scenario with 20 000, scan packets at the average scan rate of 357 sps.
In addition to Blaster, we employ Witty as the second worm for training the
SVMs in the network-based malware detector proposed in the following chapter. To
comprehensively evaluate the performance of the proposed detector for source port
malware, we simulate a worm that sends scan packets with a fixed TCP source port of
1500 at an average scan rate of 3 57. sps; note that this scan rate is exactly 100 times
less than Witty’s average scan rate, which makes this simulated worm challenging to
detect.
The last simulated malware of this study is an HTTP worm. We acknowledge that it
is unlikely that an endpoint will be running a service that can be infected by an HTTP
153
malware. Nevertheless, we simulate an HTTP worm because they use destination port
80 , which is a very common port in the benign profile of an endpoint. Thus it is quite
challenging for network-based frequency/histogram detectors to detect malicious HTTP
traffic. We simulate the HTTP-based CodeRed II worm [61] using an average scan
rate of 4 95. sps [62]. Table 7 gives additional information about the simulated malware.
B.4.6 Inserting Malware Data in Benign Traffic Profiles A vulnerable VM was infected with each of the malicious codes. We then used
argus to log malicious traffic traces from the VM in the same format as the benign
session-key data. While this provided us complete information about the malicious
sessions, we did not have information about the keystrokes that a user will be pressing
when a malicious code is trying to propagate after compromising his/her machine. The
only way to realistically generate such data is to infect participating endpoints with
malicious codes without informing the user of that endpoint. Clearly, such a procedure is
not possible. Therefore, for each malicious session we generate an associated keystroke
using the marginal keystroke distribution generated from the all-keys data.
Armed with this information, we insert T minutes of malicious traffic data of each
malicious code in the benign session-key profile of each endpoint at a random time
instance. Specifically, for a given endpoint’s benign session-key profile, we first generate
a random infection time It (with millisecond accuracy) between the endpoint’s first and
last session times. Given n malicious sessions starting at times 1 nt … t, , , where nt T≤ ,
we create a special infected profile of each host with these sessions appearing at times
1I I nt t … t t+ , , + . Thus in most cases once a malware’s traffic is completely inserted
154
into a benign profile, the resultant profile contains interleaved benign and malicious
sessions starting at It and ending at I nt t+ . For all malware used in this study, we use
15T = minutes.
We are now ready to use the infected profiles to characterize traffic and keystroke
perturbations observed when an endpoint is compromised by a malicious code. In the
following two chapters, we propose two malware detection techniques that use the data
described in this section.
155
CHAPTER B.5 MALWARE DETECTION USING TRAFFIC FEATURES
In this chapter, we propose the first of the two information-theoretic malware
detection techniques developed in this thesis. This technique is purely network-based and
does not utilize the keystroke data described in the last section. Thus in this chapter
malware are detected using only traffic perturbations. Like prior endpoint-based studies,
throughout this thesis we focus solely on outgoing unicast traffic since incoming packets
can be easily blocked using firewalls.
We observe that the vulnerabilities targeted by all malware are associated with a
small number of source or destination ports. Thus, on a compromised machine the
distribution of source or destination ports on which a host communicates should be
perturbed after infection. These perturbations can be quantified using information-
theoretic measures. This chapter evaluates the efficacy of using port perturbations as
features for malware detection and identifies appropriate measures to quantify these
perturbations.
B.5.1 Malware Detection Using Sample Entropy Lakhina et al. [13] in a recent work showed that sample entropy of source and
destination ports observed at a border router can reveal traffic anomalies. We first
evaluate whether entropy is an appropriate feature to detect traffic anomalies at an
endpoint and then propose an alternative framework that significantly surpasses the
performance of prior detectors when used at endpoints.
156
B.5.1.1 Entropy of Source and Destination Ports
Entropy characterizes the degree of dispersal (or concentration) of a probability
distribution without regard to the actual values of the random variable under
consideration. This degree of dispersal is characterized by the variance of a probability
distribution. To compute sample traffic entropy as proposed in [13], we generate usage
frequency histograms of source and destination ports for outgoing packets using a 20 -
second window (other window sizes produce qualitatively similar results). Source and
destination port histograms for each window are computed by counting the number of
times a particular port is used during the window.
Let nS and nD denote the sets of source and destination ports observed in window
n , respectively. Define nn i nX p i S= , ∈ and nn j nY q j D= , ∈ to be respectively
the source and destination port histograms derived from window n , where nip is the
number of times source port i was used in time-window n and njq is the number of
times destination port j was used in time-window n . Also let us define
nnn ii Sp p∈= ∑ as the aggregate frequency of source ports observed in window n and
nnn jj Dq q∈= ∑ as the corresponding frequency of destination ports. Then sample
entropies of the source and destination port histograms for window n can be computed
as:
( ) 2logn
n ni inn ni S
p pH X p p∈= − ∑ and ( ) 2log
n
n nj jn
n nj D
q qH Y q q∈= − ∑ .
(B.49)
157
If there is no traffic in a window n (i.e., 0np = or 0nq = ) then malware detection is
not performed. For simplicity, we refer to sample entropy as entropy and the normalized
port histograms as port distributions.
B.5.1.2 Entropy-based Traffic Perturbations in the Infected Profiles
Figure 35 shows source and destination port entropies of four different endpoints
infected with one random instance of Blaster, MyDoom-A, Rbot-AQJ, and Witty.
2000 4000 6000 8000 10000120001400016000180000
0.5
1
dst p
ort e
ntro
py
2000 4000 6000 8000 10000120001400016000180000
1
2
3
src
port
ent
ropy
time window
1000 2000 3000 4000 5000 6000 7000 80000
0.5
1
dst p
ort e
ntro
py
1000 2000 3000 4000 5000 6000 7000 80000
0.5
1
1.5
2
src
port
ent
ropy
time window
(a) endpoint 1, Blaster (b) endpoint 5, MyDoom
2 4 6 8 10
x 104
0
0.5
1
dst p
ort e
ntro
py
2 4 6 8 10
x 104
0
0.5
1
1.5
src
port
ent
ropy
time window
1000 2000 3000 4000 50000
1
2
3
4
dst p
ort e
ntro
py
1000 2000 3000 4000 50000
0.5
1
1.5sr
c po
rt e
ntro
py
time window
(c) endpoint 9, Rbot-AQJ (d) endpoint 13, Witty Figure 35. Source and destination port entropies at infected endpoints. Infection start
times are marked with a circle. Infections in (a), (b), and (c) last approximately 15 minutes, while that in (d) lasts approximately one minute. Each non-overlapping time-
window is 20 seconds.
158
Figure 35 shows that the attack port entropies do not reveal any discernable perturbations.
However, in some cases entropy perturbations in non-attack ports can provide useful
information about infection. For instance, Figure 35 (a) shows that the entropy of source
ports exhibits a sudden increase at the time of infection. Similar behavior is observed in
Figure 35 (d), where the non-attack (destination) ports’ entropy jumps at the infection
time. This phenomenon is not observed for low-rate (MyDoom-A and Rbot-AQJ)
malware as shown in Figure 35 (b) and (c). The jump in non-attack ports’ entropy for
high-rate malware is due to the fact that most endpoints initiate only a few sessions
during any given time-window [see Table 6]. Once compromised, while the attack port is
fixed, an endpoint starts communicating through a large number (i.e., one per scan
packet) of non-attack ports. Thus the degree of dispersal of non-attack ports increases
dramatically, in turn leading to an increase in the entropy. Since low-rate malware do not
initiate a lot of simultaneous sessions, no perturbations in the non-attack port are
discernable for low-rate malware. Thus we conclude that entropy cannot detect
perturbations caused by low-rate malware.
Results of Figure 35 are at odds with [13] which showed that the entropy of
destination ports was perturbed significantly on compromised networks. The failure of
entropy-based anomaly detection in Figure 35 is due to the huge difference in the volume
of traffic observed at a network’s perimeter as opposed to that observed at an endpoint.
During an attack, a perimeter router still observes a considerable amount of traffic on
benign ports, thus perturbing the port distributions enough so as to allow entropy-based
detectors to discern the attack. However, this phenomenon does not occur at individual
endpoints as explained by the following example. Consider two windows of activity
159
observed by an endpoint’s entropy-based anomaly detector. The first window has benign
activity with 9 HTTP sessions on port 80 and 1 FTP session on port 21 . The second
window contains malicious activity with 900 malicious sessions on port 135 and 100
malicious sessions on port 500 . After normalization, both of these windows will render
the same port distribution 0 9 0 1ni n np p i S/ , ∈ = . , . , which is a Bernoulli random
variable with parameter 0 9. . Consequently, the entropy of both malicious and benign
windows will be exactly the same although the traffic behavior in each case is completely
different. Thus anomalous activity can go undetected because entropy does not take the
actual values of source/destination ports into consideration.
Results presented in this section show that the entropy-based framework of [13] is not
a robust indicator of infection when applied to attack ports and therefore cannot be used
to detect non-attack port perturbations for low-rate malware. Entropy fails to highlight
anomalies because it does not take the actual values of the source and destination ports
into consideration. To address this problem, the following section employs an
information-theoretic measure that compares port probabilities in the current window to
the corresponding port probabilities in an endpoint’s benign profile.
B.5.2 Malware Detection Using Information Divergence At this point, we have established that for effective malware detection we need to use
a measure that compares the frequencies of individual ports. Information divergence
measures can provide such comparison. In this section, we evaluate four information
divergence measures. First, we evaluate the widely-used Kullback-Leibler divergence.
160
B.5.2.1 Kullback-Leibler Divergence of Source and Destination Ports
The Kullback-Leibler (K-L) divergence [43] is an information-theoretic measure of
the similarity or dissimilarity between two probability distributions. Let us denote the
benign source and destination port histograms derived from an endpoint’s benign profile
as iX p i S= , ∈ and jY q j D= , ∈ , where S and D respectively denote the sets
of source and destination ports observed in the benign profile. Then the K-L divergence
between the benign and currently observed port histograms can be expressed as:
( ) 2logn
n ni i nnn ii S
p p pD X X p p p∈/|| = /∑ and ( ) 2log
n
n nj j nn
n ij D
q q qD Y Y q q q∈
/|| = /∑ , (B.50)
where ii Sp p∈= ∑ and jj Dq q∈= ∑ respectively represent the aggregate source and
destination port frequencies observed in the benign profile. Note that K-L divergence is
an asymmetric measure. The advantages of using window-based metrics nX and nY as
primary distributions of the K-L divergence are twofold: (a) fewer sessions are observed
in a window as opposed to the benign profile, nS S| |<| | and nD D| |<| | , which
reduces the complexity of real-time detection; and (b) better detection accuracy can be
achieved if we focus on the specific ports engaged in communication during the current
window n .
We generate port histograms of benign profiles using the first 100 sessions on an
endpoint. The training time for the endpoints of this study ranged between 12 hours to 5
days with an average of approximately 2 days. We train with only 100 sessions to
quantify worst-case performance of the proposed detector.
161
To effectively leverage K-L divergence in the present endpoint-based anomaly
detector, we introduce the following provisions. First, in (B.50) if 0ip = and 0nip > ,
for any i , then ( )nD X X|| is set to ∞ . This problem persists for ( )nD Y Y|| in (B.50).
In other words, X and Y must be continuous with respect to nX and nY , respectively.
To achieve this, before training we initialize the benign histograms with 1ip = and
1iq = , for 0 65535i …= , , , which assigns never-used ports very small, non-zero
frequencies.
Second, it is well-known that scaling of training data improves the performance of
learning tools by making the training process better behaved and by mitigating the bias
towards larger input values. Therefore, we normalize the K-L divergence values by a
constant factor.
Finally, to reduce complexity and to filter out noise due to benign data, we introduce
a provision to ignore overtly benign behavior. From the training data, we generate a
histogram of session volume (i.e., total number of sessions) in a window. After
normalization, we compute the histogram’s mean eµ and variance 2eσ for each endpoint
e . We invoke malware detection only when the total number of sessions observed in a
window is greater than e eγ µ σ = + . The value of γ varied between 3 and 13
sessions per 20 second window, with an average of 6 6. sessions per 20 second window,
for the endpoints considered in this study.
162
50 100 150 2002.5
3
3.5
dst p
ort K
-L
50 100 150 2001.5
2
2.5
3
3.5
src
port
K-L
time window
200 400 600 800 1000 1200 1400
2
2.5
3
3.5
dst p
ort K
-L
200 400 600 800 1000 1200 14002
2.5
3
3.5
src
port
K-L
time window
(a) endpoint 1, Blaster (b) endpoint 5, MyDoom
200 400 600 800 1000 1200 1400 1600
1.82
2.22.42.62.8
dst p
ort K
-L
200 400 600 800 1000 1200 1400 1600
2.5
3
3.5
src
port
K-L
time window
1000 2000 3000 4000 5000 6000 70002
2.5
3
dst p
ort K
L
1000 2000 3000 4000 5000 6000 7000
2
2.5
3sr
c po
rt K
L
time window
(c) endpoint 9, Rbot-AQJ (d) endpoint 3, SoBig
100 200 300 400 500 600 700
2
2.5
3
3.5
dst p
ort K
L
100 200 300 400 500 600 7001.5
2
2.5
3
3.5
src
port
KL
time window
50 100 150 200 250 300 350 400 450
1
1.5
2
2.5
3
dst p
ort K
-L
50 100 150 200 250 300 350 400 4502
2.5
3
3.5
src
port
K-L
time window
(e) endpoint 10, Zotob (f) endpoint 13, Witty Figure 36. Source and destination ports’ K-L divergences at infected endpoints.
163
B.5.2.2 K-L-based Traffic Perturbations in the Infected Profiles
The K-L divergences of different endpoints randomly infected with a single infection
of each malware are outlined in Figure 36. We first focus on malware with high scan
rates. From Figure 36 (a), (d), (e), and (f), it is clear that the K-L divergence highlights
anomalous behavior in both attack and non-attack ports for malware with high scan rates.
Comparing Figure 36 (a), (d), (e), and (f) with entropy-based perturbations of Figure 35
(a) and (d) establishes the effectiveness of using a port-by-port divergence measure to
highlight traffic anomalies. Specifically, a K-L-based anomaly detector can reveal
perturbations in the attack port distribution, which is an important characteristic that was
completely missed by entropy. Moreover, for high scan rate malware of Figure 36 (a),
(d), (e), and (f), perturbations in the non-attack port distribution in the K-L divergence are
much more profound than the entropy perturbations. These perturbations are revealed for
both destination [i.e., Blaster, Zotob.G, and SoBig.E] and source [i.e., Witty]
port malware.
For the low-rate malware [MyDoom-A and Rbot-AQJ], Figure 36 (b) and (c) show
obvious perturbations in the attack port divergence. Comparing these two figures with
Figure 35 (b), (c) clearly establishes the advantages of using K-L-based detection features
as opposed to entropy. Due to the low rate of these malware, even the K-L divergence
cannot reveal non-attack-port perturbations. Nevertheless, our results show that
perturbations in the attack port feature are more than sufficient to detection infection.
164
B.5.2.3 Evaluating Traffic Perturbations with Other Information Divergences
In the last section, we observed that the attack ports’ K-L divergence always gets
perturbed on a compromised endpoint. However, the non-attack ports are not perturbed
for low-rate worms. In this section, we evaluate three other information-theoretic
divergence measures with the objective of identifying a measure which can
simultaneously highlight perturbations in both attack and non-attack port distributions for
low-rate worms. Brief description of the three measures follows.
As before, let X and Y respectively represent the source and distribution port
distributions observed in the benign profiles, and nX and nY respectively represent the
source and distribution port distributions observed in window n . The first information
measure that we employ is the Jenson-Shannon (J-S) Divergence measure [44] defined
as:
( ) ( ) ( ) ( )1 2 1 2n n nJ X X H X X H X H Xπ π π π|| = + − −
and ( ) ( ) ( ) ( )1 2 1 2n n nJ Y Y H Y Y H Y H Yπ π π π|| = + − − ,
(B.51)
where 1π and 2π are weighting factors such that 1 2 1π π+ = , and ( ).H is the entropy
function.
The second information divergence measure used in this work is the K directed
divergence measure defined as [44]:
( ) 2log
2 2n
nn i nin nn i ii Sn
p ppK X X p p pp p
∈|| = +
∑ (B.52)
165
and ( ) 2log
2 2n
nn j njn nn j jj D
n
q qqK Y Y q q qq q
∈|| = +
∑ ,
where parameters of the above expressions are defined in (B.49) and (B.50).
The third and last information measure that we use is the Resistor-Average (R-A)
divergence measure defined as [45]:
( ) ( ) ( )1 1 1n n nR X X D X X D X X≡ +|| || ||
and ( ) ( ) ( )1 1 1n n nR Y Y D Y Y D Y Y≡ +|| || || ,
(B.53)
where ( ). .D is the K-L divergence.
Figure 37 shows source and destination port perturbations characterized by J-S, K
and R-A divergences. We only show perturbations for the low-rate My-Doom worm
because perturbations due to high-rate worms are adequately highlighted by the K-L
divergence. Figure 37 clearly shows that none of the three divergences under
consideration can highlight perturbations in the source (non-attack) ports. Thus in the
non-attack ports’ context, the divergences under consideration do not provide any
advantages. Also, from Figure 37 (a) and (b) it can be seen that even the destination
(attack) port perturbations in the J-S and K divergences are not as clear and profound as
the K-L case [see Figure 36 (b)]. The R-A divergence, however, provides clear
perturbations in the destination ports [Figure 36 (c)]. This divergence measure also has
the advantage of being symmetric, i.e., ( ) ( )n nR X X R X X|| = || and
( ) ( )n nR Y Y R Y Y|| = || . On the other hand, R-A divergence is more complex that K-L
166
divergence because it requires computation of two K-L divergences. One of these
divergences has to be computed over the entire sample space of the benign profile,
thereby presenting a significant complexity overhead for an endpoint. Hence, in problem
areas where divergence symmetry is important and complexity is not an issue, R-A
divergence is more appropriate than K-L divergence. In the present malware detection
context, we continue to use the K-L divergence for the remainder of this chapter.
In the following section, we train a machine learning tool using the K-L divergence of
benign and malicious data, which is then used for automated malware detection and
comparison of our approach with prior methods.
200 400 600 800 1000 1200 14000.97
0.98
0.99
1
1.01
dst p
ort J
-S
200 400 600 800 1000 1200 14000.99
0.995
1
1.005
src
port
J-S
time window
200 400 600 800 1000 1200 14000.992
0.994
0.996
0.998
1
dst p
ort K
-Div
erge
200 400 600 800 1000 1200 14000.994
0.996
0.998
1
src
port
K-D
iver
ge
time window
(a) J-S Divergence, endpoint 5, MyDoom (b) K-Divergence, endpoint 5, MyDoom
200 400 600 800 1000 1200 14008
10
12
14
16
dst p
ort R
-A
200 400 600 800 1000 1200 14008
10
12
14
16
src
port
R-A
time window (c) Resistor-Average Divergence, endpoint 5, MyDoom
Figure 37. Jenson-Shannon (J-S), K- and resistor-average (R-A) divergences of source and destination ports at infected endpoints.
167
B.5.3 Leveraging K-L Perturbations in an SVM-based Framework
To use K-L divergences of source and destination ports for automated malware
detection, we first used a simple thresholding mechanism where K-L values above and
below a certain threshold were flagged as anomalous. This simple technique, however,
resulted in high false alarm rates. Consequently, in this section we resort to the
sophisticated support vector machines (SVMs) [46] for real-time malware detection. We
first train the SVMs using K-L divergence values derived from a subset of the benign
profiles and malware data. The SVMs are then used to detect malware in the infected
profiles. We also compare the performance of the proposed detector with the techniques
proposed in [14] and [20].
From contemporary machine learning tools, we select SVMs to classify K-L
divergence values because: (a) SVMs are not probabilistic in nature. Probabilistic
intrusion detectors generally do not take the basic rate of incidence into account, thus
yielding low Bayesian detection rates. (b) SVMs employ a small subset of training
examples (called support vectors) for classification, and all remaining examples are
irrelevant to the classification task. Thus SVMs can train with very few positive (benign)
and negative (malware-based) examples, allowing timely and low-complexity training.
Few negative examples also improve detection rates for novel malware. (c) SVMs are
inherently designed for binary-decision tasks, such as anomaly detection.
B.5.3.1 SVM Training
In this section, we use a small subset of malicious and benign data to train support
vector machines for real-time detection of malware using the K-L divergence. Given
168
positive and negative training examples, an SVM finds a classification boundary that
maximizes the distance between the two classes, while minimizing classification error.
We use a degree-3 radial basis kernel function to train a C-SVM [46].
It should be noted that the use of malware-based, negative training examples does not
compromise the proposed detector’s ability to detect novel malware. The negative
examples only provide a rough quantification of the magnitude of K-L perturbations on a
compromised endpoint. This quantification can be provided by any malware that
highlights perturbations in the source and/or destination ports’ distributions. In general, a
high-rate source-port malware in conjunction with a high-rate destination-port malware
can encompass the increase and decrease in the K-L divergences of source and
destination ports. Malware traces can be hardcoded into the detector, and the training
algorithm can merge the malware traces with an endpoint’s traffic logs to compute the
negative K-L divergence examples.
We use the source and destination ports’ K-L divergence values to train two SVMs
for each endpoint. To train the SVMs, we take ten K-L divergence values from the
benign traffic profile. These values comprise the positive examples. We then take a total
of 13 negative examples by computing K-L divergence of benign traffic windows with
Blaster- and Witty-infected windows. Performance evaluations in the next section
illustrate that this small subset of the available training data can provide highly accurate
detection of novel malware, where the term novel refers to all the remaining malware not
used for SVM training.
169
B.5.3.2 Performance Evaluation and Comparison with Existing Techniques
In this section, we evaluate the performance of the proposed malware detector with
two existing techniques proposed in [14] and [20]. The rate-limiting detector [20] is the
only other technique that is designed specifically for endpoints and the maximum-entropy
detector [14] is one of the only two information-theoretic anomaly detection techniques.
We use the same parameters values and learning/detection algorithms that were
employed in [14] and [20]. We also tried to compare with the entropy-based technique by
Lakhina et al. [13]. However, we observed that it was impractical to migrate the detector
of [13] to endpoints because the detector required projection of high-dimensional feature
metrics into benign and anomalous subspaces at a border router. On an endpoint, the
same technique will result in only 3 possible subspaces, and in most cases it is not
possible to classify them as benign and anomalous using the thresholding technique of
[13].
1 2 3 4 5 6 7 8 9 10 11 12 1382
84
86
88
90
92
94
96
98
100
endpoint ID
aver
age
dete
ctio
n ra
te % Proposed K-L/SVM-based Detector
Maximum-Entropy DetectorRate-Limiting Detector
1 2 3 4 5 6 7 8 9 10 11 12 130
5
10
15
20
25
30
endpoint ID
aver
age
fals
e al
ram
rat
e %
Proposed K-L/SVM-based DetectorMaximum-Entropy DetectorRate-Limiting Detector
(a) detection rate (b) false-alarm rate Figure 38. Comparison of detection and false-alarm rates of the proposed K-L/SVM-
based malware detector with maximum-entropy and rate-limiting detectors. Each point is averaged over 12 malware with 100 random infections per malware per endpoint.
170
We inserted 100 non-overlapping infections of each malware in every endpoint’s
benign profile. As discussed earlier, each infection was approximately 15T = minutes,
with the exception of Witty that had each infection lasting approximately 1T =
minute (i.e., 20 000, packets at 357 sps). Hence, all results provided in this section are
averaged over one hundred experiments per endpoint per malware. We compute detection
and false alarm rates for each experiment as follows. For 100 infections of a particular
malware on an endpoint, the percentage detection rate for that malware is computed by
simply counting the number of infections that are detected by the malware detector. The
false alarm rate is computed by taking the ratio of the total number of false alarms with
the total evaluated time-windows (i.e., windows with one or more sessions).
The average detection and false alarm rates for each endpoint are shown in Figure 38.
It can be seen in Figure 38 (b) that the proposed K-L/SVM-based detector has negligible
false alarm rates at all endpoints. The highest false alarm rate we observed was
approximately 0 45%. , with endpoints 1 , 6 , 7 , and 8 exhibiting almost no false alarms
at all. Also, the detection rate of the proposed technique is 100% for all endpoints except
endpoints 2 and 4 ; for endpoints 2 and 4 , some instances of the low-rate MyDoom-A
and Rbot-AQJ worms were not detected. Nevertheless, even for endpoints 2 and 4 , the
average detection rate is above 90% . Hence, overall the proposed K-L/SVM-based
malware detector provides very high accuracy for the diverse set of endpoints considered
in this study.
Let us now compare the proposed detector and the maximum-entropy detector of
[14]. Figure 38 (a) shows that the proposed K-L/SVM-based detector provides much
higher detection rates than the maximum-entropy detector. Also, for the maximum-
171
entropy detector, the false alarm rates for the home endpoints [endpoints 3 and 4 ] are
extremely high. We believe that the high false alarm rates are due to peer-to-peer
applications running on the home endpoints of this study. Moreover, maximum-entropy
detector was designed for deployment at the perimeter where even in a short period of
time most of the 2 348, packet classes of [14] were observed. On an endpoint, many of
these classes are not present in the benign training data. We observed that even if the
maximum-entropy training is performed using a lot of benign data, the performance still
does not improve. (The maximum-entropy model was trained using 100 and 1000
benign sessions, but the performance in both cases was identical.) Also note that due to
the use of a sliding window, the maximum-entropy detector has higher training
complexity and incurs an inherent detection delay that is not present in our detector. The
run-time complexities of the two techniques are comparable as the maximum-entropy
technique requires frequent computation of K-L divergence over a large sample space of
2 348, outcomes, whereas our technique computes K-L divergence over small sample
spaces followed by SVM classification.
For the rate-limiting detector [20], the detection rates of all endpoints except endpoint
2 are much lower than the proposed K-L/SVM-based detector. Also, much like the
maximum-entropy detector, the false-alarm rates for home endpoints are quite high. (A
false alarm is raised when the rate-limiter reports an anomaly, but the session queue of
the rate limiter has no malicious sessions.) Thus the performance of the rate-limiting
detector, although better than the maximum-entropy detector, is still much worse than the
K-L/SVM-based detector proposed in this thesis. The inferior performance of the rate-
limiting detector shows that simply monitoring traffic volume at an endpoint is not
172
sufficient. In addition to session volume, the actual characteristics of the traffic must also
be taken into account for accurate detection.
Based on the results of this section, we conclude that the K-L/SVM-based malware
detector proposed in this chapter provides significantly better performance than the
techniques of [14] and [20].
Yes
No reset
update
No
Yes
No
Start
Network traces of a source port worm and a destination port worm
Generate source and destination port histogram from benign profile data
Benign data > d sessions
Time since last detection > t seconds Yes
SVM parameters
Observe all sessions on an endpoint
Source and destination port histograms in last window
Train SVMs using K-L divergence of
benign and worm data
Sessions in last window > µ+s
Store benign histograms of source and destination ports, and µ and s
Compute source and destination ports’ K-L
divergences using benign and window-based histograms
Use SVMs to classify the source and destination ports’ K-L values as benign or anomalous
No
Figure 39. A generalized flow diagram of the proposed K-L/SVM-based malware
detector. The shaded area contains real-time components.
173
B.5.4 Summary and Discussion Figure 39 outlines the data flow of the proposed malware detection technique. In
summary, once deployed the detector initially uses d sessions to characterize benign
source and destination port histograms. Traces of a high-rate source port and a high-rate
destination port malware are hardcoded in the detector. K-L divergence of the benign and
malware-based histograms is used to train SVMs. Parameters of the SVMs are then used
for real-time detection of malware. After every window of t seconds, the detector checks
whether the total number of sessions in the window are more than what was statistically
observed in the benign profile. If so, the detector computes the K-L divergence between
the window-based histograms and the benign histograms. The trained SVMs are then
used to classify the source and destination port K-L divergences.
In this chapter, we proposed a network-based malware detector that can detect self-
propagating malicious codes in real-time by leveraging the K-L divergence of benign and
real-time traffic features. In the following chapter, we use the data collected in this thesis
to develop another malware detection technique that correlates both network and host/OS
features to detect self-propagating malware.
174
CHAPTER B.6 MALWARE DETECTION USING JOINT NETWORK-HOST FEATURES
Traditional anomaly detectors are either host- or network-based. We argue that
significant improvements can be achieved if both network and host features are correlated
and then employed in a joint framework. To that end, in this chapter we propose two
endpoint-based joint network-host anomaly detectors both of which exploit the
observation that when a user is actively using his/her computer most of the benign traffic
is triggered by a small subset of keystrokes and mouse clicks. Based on this observation,
we propose to correlate the last input from the keyboard or mouse hardware buffer with
every new network session in a novel entropy-based information-theoretic framework.
B.6.1 Correlation in the Session-Key Data As mentioned before, we focus solely on outgoing unicast traffic. Also, for the
present anomaly detector we only focus on the scenario when the end-user is actively
using his/her computer, although he/she may not be accessing the Internet. This is
achieved by only processing sessions with non-zero keystroke values; recall that a zero
keystroke value implies that no key was pressed right before the session. Detection when
a user is inactive cannot employ keystroke data, thereby requiring purely network-based
approaches.
175
Figure 40 shows the normalized frequencies of the 20 most-used session initiation
keys for two endpoints. In both cases more than 85% of the times network sessions are
initiated by the left mouse click or the Enter key. (Similar results are
observed for the remaining endpoints.) Figure 41 shows the normalized histograms of all
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 13 40 32 4 77 46 73 9 2 38 65 82 8 69 34 37 66 89 17
virtual key code
freq
uenc
y
(a) endpoint 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 13 40 83 9 32 65 2
162 38 46 8 77 34 37 73 82 4 39 70
virtual key code
freq
uenc
y
(b) endpoint 12
Figure 40. Normalized histograms of 20 most-used session initiation keystrokes. Histograms are generated from the session-key data. Virtual keys codes 1 and 13
correspond to the left mouse click and the Enter key, respectively [48].
176
the keystrokes that are pressed on a host. Note that the all-keys distribution looks quite
different from the session-key distribution of Figure 40. For one thing, the marginal all-
keys distribution of Figure 41 is much more spread out than the session-key distribution
of Figure 40. That is, the variance of the marginal all-keys distribution is more than the
session-key distribution. Also, contrary to the session-key-based keystroke histogram,
less than 50% sessions are initiated by the two most-commonly used keys. Lastly, left
mouse click or Enter key are not in the two most-commonly used keys in either
Figure 41 (a) or (b). These results can be summarized as follows: (i) users frequently
employ only a few session initiation keys to trigger network sessions, thus there is strong
correlation between these few session initiation keys and network sessions; (ii)
frequencies of session initiation keys are very consistent across different users,
consequently making this a common benign feature that can be leveraged to detect
abnormal behavior; (iii) frequencies of keys that are generally used on a host are quite
different from frequencies of session initiation keys.
177
Based on the above discussion, we deduce that session-key correlation is a feature
that is common across users and can be used for malware detection. There are two
information-theoretic measures that can formally leverage this observation for real-time
worm detection. The first measure is the entropy of the keystroke histogram observed in a
time window. Since entropy quantifies the degree of dispersal or concentration of a
00.020.040.060.080.1
0.120.140.160.18
40 17 162 1 38 39 16 161 37 32 8 34 69 83 65 13 9 33 84 77
virtual key code
freq
uenc
y
(a) endpoint 5
00.050.1
0.150.2
0.250.3
0.350.4
40 38 1 37 8
160 16 32 39 162 69 17 65 13 84 83 73 33 34 79
virtual key code
freq
uenc
y
(b) endpoint 12 Figure 41. Normalized histograms of 20 most-used keystrokes. Histograms are generated from the all-keys data. Virtual keys codes 40, 38 and 17 correspond to the down arrow
key, the up arrow key and the control key, respectively [48].
178
probability distribution, according to Figure 41 the keystroke entropy in a malware-
infected window should be higher than the benign windows where only a few keystrokes
are being used to initiate sessions. The second information-theoretic measure that we use
to quantify the keystroke perturbations is mutual information. From Figure 40 it can be
deduced that in a benign time window mutual information of sessions and keystrokes that
are used to initiate the sessions should be very high. On the other hand, in a malware-
infected window this mutual information should decrease as the keystrokes will be drawn
from the marginal all-keys distributions. The following sections formally describe the
entropy and mutual information based detectors.
B.6.2 Malware Detection Using Keystroke Entropy
B.6.2.1 Definition of Keystroke Entropy
Entropy is an information-theoretic measure that can capture the spread/variance of a
distribution quite effectively [43]. Define nn i nX p i K= , ∈ as the histogram of
keystrokes in a time-window n , where nip is the number of times keystroke i was used
in time-window n . Note that due to MSDN’s virtual key code definition,
1 2 255nK …= , , , . Let n
nn ii Kp p∈= ∑ be the aggregate frequency of keystrokes
observed in window n . Then sample entropy of the keystroke histogram for window n
is
( ) 2logn
n ni inn ni K
p pH X p p∈= − ∑ .
(B.54)
179
If there is no traffic in a window n (i.e., 0np = ) then malware detection is not
performed. Based on previous results, we know that for legitimate sessions, nX has
small variance and therefore the keystrokes’ entropy should be low. On the other hand
once a self-propagating malicious code starts initiating sessions, the keystrokes will be
drawn from the marginal keystroke distribution of the all-keys data. Hence the variance
and consequently the entropy of nX should increase.
We compute keystroke entropy on a window-by-window basis. The results reported
in this chapter use a window size of 60 seconds. In each window with one or more
sessions, we compute the keystroke histogram nX which is used in equation (B.54) to
compute the entropy. The marginal keystroke histogram is generated from the first 500
entries of the all-keys data.
B.6.2.2 Entropy Perturbations in the Infected Profiles
We use the infected profiles described in Section B.4.6 to evaluate the performance of
the entropy-based detector throughout this chapter. Since the present detector does not
rely on source and destination ports, there is no need to evaluate against the simulated
malware described in Section B.4.5. Therefore, throughout this chapter we only focus on
detection using the 9 real worms collected for this study. When we used keystroke-
entropy for detection of randomly inserted infections, we observed a number of noisy
spikes due to variations in benign user behavior. We use a median filter to remove the
spikes that arise due to inherent changes in legitimate user behavior. Henceforth, all
results use an order-7 median filter.
The entropies of different endpoints randomly infected with a single infection of a
malicious code are outlined in Figure 42. It can be observed in Figure 42 that keystrokes’
180
entropy clearly highlights anomalous behavior in all cases. The increase in entropy is
revealed for both high- and low-rate malware, and for endpoints with high and low
session rates. Thus we conclude that entropy of keystroke histograms is a robust feature
that can be leveraged for self-propagating malware detection on network endpoints.
181
100 200 300 400 500 6000
1
2
3
4
5
6
7
8
9
10
Key
stro
ke E
ntro
py
time window 500 1000 1500 2000 2500 3000
0
1
2
3
4
5
6
7
8
9
10
Key
stro
ke E
ntro
py
time window
(a) endpoint 1, Blaster (b) endpoint 3, Forbot-FU
2000 4000 6000 8000 100000
1
2
3
4
5
6
Key
stro
ke E
ntro
py
time window500 1000 1500 2000 2500 3000 3500 4000 4500
0
1
2
3
4
5
6
7
8
9
Key
stro
ke E
ntro
py
time window
(c) endpoint 6, MyDoom-A (d) endpoint 9, Rbot-AQJ
500 1000 1500 2000 2500 3000 35000
1
2
3
4
5
6
7
8
9
10
Key
stro
ke E
ntro
py
time window200 400 600 800 1000 1200
0
1
2
3
4
5
6
7
8
9
10
Key
stro
ke E
ntro
py
time window
(e) endpoint 11, SoBig.E (f) endpoint 13, Zotob.G Figure 42. Entropy of the keystroke histograms at infected endpoints. Infection start times are marked with a circle. Infections last approximately15 minutes. Each non-overlapping
time-window is 60 seconds.
182
B.6.3 Malware Detection Using Session-Key Mutual Information
In this section, in addition to the keystroke distribution, we also characterize the
session information in a probabilistic framework. We show that the conditional mutual
information of the session and keystroke distributions can clearly highlight anomalous
behavior.
B.6.3.1 Mutual Information of Sessions and Keys
Mutual information [43] is an information-theoretic measure of the similarity between
two probability distributions. Consider two random variables X and Y with marginal
distributions ( )p x and ( )p y , and a joint distribution ( )p x y, . The mutual information of
these random variables is defined as
( ) ( ) ( )( ) ( )2log
x y
p x yI X Y p x y p x p y,; = ,∑ ∑ .
(B.55)
Mutual information is a non-negative measure of the similarity between X and Y , with
( ) 0I X Y; = when X and Y are independent. In general, ( )I X Y; increases with an
increase in the correlation between X andY .
To leverage mutual information in the present context, we define X as a binary
random variable which characterizes the probability of whether or not a session was
initiated in the last time window. That is,
0 no session in time window 1 one or more sessions in time windowX ∈ ⇒ , ⇒ .
Moreover, we define Y as a random variable characterizing the keystrokes’ probability
distribution. Specifically, the marginal ( )p Y distribution is simply the normalized all-
183
keys histogram, such as the ones shown in Figure 41. Then the session-keystroke mutual
information can be written as:
( ) ( ) ( )( ) ( ) ( ) ( )
( ) ( )2 20 10 log 1 log0 1
ny K
p x y p x yI X Y p x y p x yp x p y p x p y∈
= , = , ; = = , + = , = = ∑ .
(B.56)
We derive the marginal ( )p X distribution using the first 500 entries of each endpoint’s
benign session-key profile. More specifically, ( )p X is computed by counting the total
number of windows n with one or more sessions between the 1 -st session and the 500 -
th session. We also count the total number of windows N (with and without sessions) in
that time frame. Then, ( 1)p X N n= = / and ( 0) 1 ( 1)p X p X= = − = . The joint
distribution ( 1 )p x y j= , = then simply corresponds to the joint probability that a
network session was initiated using keystroke j .
From the data collection chapter, we know that the keystroke information is not
logged when there are no network sessions in a window. That is, we do not have the
distribution ( )0p x y= , . Hence we cannot use the mutual information expression of
(B.56) in its present form. To resolve this problem, we employ a partial mutual
information measure ( )1I X Y= ; , which only uses the ( )1p x y= , probability
distribution. Since the partial mutual information employs only one outcome of the
random variable X , it can be written as
( ) ( ) ( )( ) ( )2
11 1 log 1y
p x yI X Y p x y p x p y= ,= ; = = , =∑ .
(B.57)
184
Note that due to the binary nature of the session random variable X , the partial mutual
information ( )1I X Y= ; is in fact the self-information of ( )1,p x y= normalized by
( )1p x = . For brevity, we continue to refer to this measure as mutual information.
The above characterization describes the correlation between network sessions and
keystrokes in a simple and intuitive manner. Based on previous results, we know that for
legitimate activity X and Y are highly correlated. Therefore, their mutual information
should be high. Once a self-propagating malicious code starts initiating sessions, the
keystrokes will be drawn from the marginal ( )p X distribution and therefore the
correlation between X and Y should drop.
Like the last section, results reported in this chapter use a window size of 60 seconds.
In each window with one or more sessions, we compute the joint conditional
distribution ( )1p x y x, = . The joint distribution ( )1p x y x, = is to compute the
conditional mutual information. The marginal ( )p X and ( )p Y are generated from the
first 500 values of the all-keys and session-key data, respectively.
B.6.3.2 Mutual Information Perturbations in the Infected Profiles
Similar to the entropy-based keystroke perturbations, we observed some noisy mutual
information spikes. Therefore, like the entropy-based technique we use an order-7
median filter to remove these spikes. The mutual information of different endpoints
randomly infected with a single infection of a malicious code is outlined in Figure 43.
Clearly, session-keystroke mutual information clearly highlights anomalous behavior for
both high- and low-rate malware and endpoints. In the benign data, the mutual
185
information is consistently high because only a few keys are used to initiate most of the
sessions. Once compromised, the endpoint’s marginal keystrokes get flagged as session
initiation keys. The mutual information drops in Figure 43 are because the marginal all-
keys distribution has very little correlation with network sessions.
The keystroke-based measures proposed in this chapter are fairly independent of the
rate of session initiation. This is a unique attribute of the present techniques because other
network-based anomaly detectors implicitly or explicitly use this rate for detection.
Consequently detection and false alarm rates of such detectors are dependent on the
scanning rate of the malicious code. The techniques proposed in this chapter jointly
consider sessions and keystrokes and are therefore not entirely dependent on the session
rate.
In the following section, we develop an automated tool that uses keystroke entropy
and mutual information values for real-time malware detection.
186
100 200 300 400 500 60012
14
16
18
20
22
24
26
28
30
Ses
sion
-Key
Mut
ual I
nfor
mat
ion
time window 500 1000 1500 2000 2500 3000
14
16
18
20
22
24
26
28
30
32
Ses
sion
-Key
Mut
ual I
nfor
mat
ion
time window
(a) endpoint 1, Blaster (b) endpoint 3, Forbot-FU
2000 4000 6000 8000 10000
10
15
20
25
30
35
Ses
sion
-Key
Mut
ual I
nfor
mat
ion
time window0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
45
Ses
sion
-Key
Mut
ual I
nfor
mat
ion
time window
(c) endpoint 6, MyDoom-A (d) endpoint 9, Rbot-AQJ
500 1000 1500 2000 2500 3000 35008
10
12
14
16
18
20
22
24
26
Ses
sion
-Key
Mut
ual I
nfor
mat
ion
time window200 400 600 800 1000 1200
8
10
12
14
16
18
20
22
24
Ses
sion
-Key
Mut
ual I
nfor
mat
ion
time window
(e) endpoint 11, SoBig.E (f) endpoint 13, Zotob.G Figure 43. Mutual information of the session and keystroke random variables at infected endpoints. Infection start times are marked with a circle. Infections last approximately15
minutes. Each non-overlapping time-window is 60 seconds.
187
B.6.3.3 Automated Detection using Keystroke Perturbations
As mentioned in previous sections, we use an order-7 median filter to filter out the
noise in the keystroke entropy and mutual information values. To leverage the filtered
entropy values in a real-time and automated fashion, we train the entropy detector using
the first 50 benign keystroke entropy values and the mutual information based detector is
trained using the first 10 benign mutual information values of an endpoint. We find the
sample mean and sample standard deviation of the entropy values of an endpoint. An
alarm is raised when the filtered entropy value observed in a window is more than the
mean plus three standard deviations. Similarly, we find sample mean and sample standard
deviation of the mutual information values. An alarm is raised when the filtered mutual
information value in a window is less than the mean plus one standard deviation.
We use the infected profiles used in the last chapter for performance evaluation of the
present malware detectors. Thus there are 100 non-overlapping random infections of
each malicious code in every endpoint’s benign profile. As discussed earlier, each
infection is approximately 15T = minutes. Hence, all results provided in this section are
averaged over one hundred experiments per endpoint per malicious code. We compute
detection and false alarm rates for each experiment as follows. For 100 infections of a
particular malicious code on an endpoint, the percentage detection rate for that malicious
code is computed by simply counting the number of infections that are detected by the
malware detector. The false alarm rate is computed by taking the ratio of the total number
of false alarms with the total evaluated time-windows (i.e., windows with one or more
sessions).
188
The average detection and false alarm rates of the entropy and mutual information
based detectors are shown in Figure 44. Figure 44 (a) shows that the detection rate of the
entropy-based technique is 100% for all endpoints and all malware. Detection rate of the
mutual information detector is 100% for all endpoints except endpoint 1 which has an
average detection rate of 99 66%. . Thus both the proposed detectors provide very high
detection accuracy. Figure 44 (b) shows that the mutual information detector has
negligible false alarm rates. The keystroke-entropy detector has slightly higher false
alarm rates than the mutual information detector; the highest false alarm rate of 2.39%
was observed at endpoint 12 . Hence, overall the both malware detector proposed in this
chapter provide very high accuracy for the diverse set of endpoints and malware
considered in this study.
The proposed detectors provide much higher detection rates than the maximum-
entropy and the rate-limiting detectors. As mentioned before, the false alarm rates of the
maximum-entropy and rate-limiting detectors for the high session rate endpoints
[endpoints 3 and 4 ] are extremely high. The reasons for the inferior performance of
these detectors have been highlighted in the last chapter.
The detection accuracy of the keystroke- based detectors proposed in this chapter is
better than the K-L/SVM-based detector of the last chapter. The false alarm rate of the
keystroke-entropy detector is slightly higher than the K-L/SVM-based detector. The
mutual information detector provides false alarm rates which are comparable to the K-
L/SVM detector. Also, the keystroke-based detectors have lower complexity than the K-
L/SVM based detector since they does not require a complex learning tool for automated
detection. The complexity of computing the keystrokes’ entropy and mutual information
189
is also low because these measures are computed on a very small sample space
comprising only the session initiation keystrokes used in the last time-window. However,
the training time required for the keystroke-based detectors is higher than the K-L/SVM
detector. The high detection accuracy and low-complexity of the keystroke-based
malware detectors are a consequence of jointly using network- and host/OS-level
information. In summary, if high detection accuracy and low-complexity are the main
objectives, then the keystroke-based detectors should be used. If low false alarm rates and
small training times are desired, then the K-L/SVM-based detector is more suitable.
Nevertheless, all detectors proposed in this thesis provide highly accurate and fast
detection of self-propagating malware.
123 4 5 6 7 8 9 10 11 12 13
80
85
90
95
100
endpoint ID
aver
age
dete
ctio
n ra
te % Mutual Info Detector
Key-Entropy DetectorMaxEnt DetectorRate-Limiting Detector
1 2 3 4 5 6 7 8 9 10 11 12 130
5
10
15
20
25
30
35
endpoint ID
aver
age
fals
e al
arm
rat
e %
Mutual Info DetectorKey-Entropy DetectorMaxEnt DetectorRate-Limiting Detector
(a) detection rate (b) false-alarm rate Figure 44. Comparison of detection and false-alarm rates of the mutual information based
and keystroke-entropy based malware detectors with maximum-entropy [14] and rate-limiting [20] detectors. Each point is averaged over 9 malicious codes with 100 random
infections per malicious code per endpoint.
190
CHAPTER B.7 ATTACKS AND COUNTERMEASURES
In this chapter, we discuss attacks that can circumvent the proposed malware
detectors and possible countermeasures to mitigate these attacks.
B.7.1 Mimicry Attack In a mimicry attack [66], a malware tries to hide its traffic inside benign traffic to
avoid detection. There are two mimicry attacks that can be launched against the K-
L/SVM based malware detector. Under the first attack, a malware can use ports that are
frequently used by an endpoint. While this attack can mimic non-attack ports, mimicry of
attack ports is not possible because vulnerabilities targeted by a malware are associated
with fixed ports, and consequently the destination ports of outgoing scan packets are
fixed. Thus, even with mimicked non-attack ports, the proposed detector can detect
perturbations in the attack port distribution, as shown by the CodeRed II results in
Section B.5.2.2.
Another type of mimicry attack on the K-L/SVM detector can be launched by a very
low-rate malware which can hide its traffic within benign traffic, while keeping the total
number of sessions under γ , where γ is the threshold number of sessions below which
malware detection is not invoked. As mentioned in Section B.5.2.1, for the endpoints of
this study the values of γ were very small; ranging between 0 15. and 0 65. sessions per
minute, with an average of 0 33. sessions per minute. A mimicking malware with less
191
than γ sessions per time-window will have a very slow propagation rate, and hence will
allow human countermeasures.
A mimicry attack can be launched against the keystroke-based detectors by a malware
which always initiates its scanning sessions after a certain predefined time has elapsed
since the last keystroke. Such a malicious session will not be evaluated by the proposed
keystroke- based detectors. To mitigate this attack, the time threshold for logging the
session initiation keystroke can be made adaptive. Also, we are currently investigating
the efficacy of the keystroke-based detectors in a scenario when the last keystroke is
always logged irrespective of the time elapsed since that keystroke.
B.7.2 Attack by Acquiring System-Level Privileges On an endpoint where security policies and user-privileges are not appropriately
defined, a malware after compromising the endpoint can gain system-level privileges and
can in turn disable the malware detector or overwrite keyboard/mouse buffers [33]. This
vulnerability is a consequence of the design of contemporary operating systems and the
lack of appropriate user rights management. All endpoint-based malware detectors suffer
from this vulnerability. This attack can be mitigated by appropriate security policing and
user management. To completely defeat this attack, a trusted computing platform [67] or
a virtual machine [49] must be employed. Design of such operating systems is presently
an area of active research [68]- [71].
192
CHAPTER B.8 CONCLUSIONS AND FUTURE WORK
In this part, we proposed information-theoretic malware detection techniques for
network endpoints. The first technique leveraged the K-L divergence from an endpoint’s
benign port usage to detect malicious activity. The second set of techniques used the
entropy and mutual information of keystrokes that are used to initiate network sessions to
detect malware propagation. All of the proposed techniques were highly accurate and
provided significant improvements over existing methods.
As future work, we intend to increase the number of endpoints on which data are
collected. Moreover, we are currently collecting data on local area networks to see if the
network-based malware detector of Section B.5 can provide good performance when
deployed on LANs. We are also investigating effective countermeasures against the
attacks outlined in the last section.
193
PART-B REFERENCES
[1] D. Ellis, J. G. Aiken, K. S. Attwood, and S. D. Tenaglia, “A behavioral approach to worm detection,” ACM WORM, October 2004.
[2] C. C. Zou, L. Gao, W. Gong, and D. Towsley, “Monitoring and early warning of Internet worms,” ACM CCS, October 2003.
[3] J. Wu, S. Vangala, and L. Gao, “An effective architecture and algorithm for detecting worms with various scan techniques,” NDSS, February 2004.
[4] S. E. Schechter, J. Jung, and A. W. Berger, “Fast detection of scanning worm infections,” RAID, September 2004.
[5] J. Jung, V. Paxson, A. W. Berger, and H. Balakrishnan, “Fast portscan detection using sequential hypothesis testing,” IEEE Symposium on Security and Privacy, May 2004.
[6] N. Weaver, S. Staniford, and V. Paxson, “Very fast containment of scanning worms,” Usenix Security Symposium, August 2004.
[7] A Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” ACM Sigcomm, August/September 2004.
[8] A. Lakhina, M. Crovella, and C. Diot, “Characterization of network-wide traffic anomalies in traffic flows,” ACM/Usenix IMC, October 2004.
[9] P. Barford, J. Kline, D. Plonka, and A. Ron, “A signal analysis of network traffic anomalies,” ACM/Usenix IMC, November 2002.
[10] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change detection: Methods, evaluation, and applications,” ACM/Usenix IMC, October 2003.
[11] A. Soule, K. Salamatian, and N. Taft, “Combining filtering and statistical methods for anomaly detection,” ACM/Usenix IMC, October 2005.
[12] Y. Kim, W. C. Lau, M. C. Chuah, and H. J. Chao, “PacketScore: Statistics-based overload control against distributed denial-of-service attacks,” IEEE Infocom, March 2004.
194
[13] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” ACM Sigcomm, August 2005.
[14] Y. Gu, A. McCullum, and D. Towsley, “Detecting anomalies in network traffic using maximum entropy estimation,” ACM/Usenix IMC, October 2005.
[15] D. Moore, C. Shannon, G. M. Voelker, and S. Savage, “Network Telescopes,” CAIDA technical report, http://www.caida.org/outreach/papers/2004/tr-2004-04/.
[16] E. Cooke, M. Bailey, Z. M. Mao, D. Watson, F. Jahanian, and D. McPherson, “Toward Understanding Distributed Blackhole Placement,” ACM WORM, October 2004.
[17] M. Bailey, E. Cooke, F. Jahanian, J. Nazario, and D. Watson, “The Internet Motion Sensor: A distributed blackhole monitoring system,” NDSS, February 2005.
[18] D. Dagon, X. Qin, G. Gu, and W. Lee, “HoneyStat: Local worm detection using Honeypots,” RAID, September 2004.
[19] J. Twycross and M. M. Williamson, “Implementing and testing a virus throttle,” Usenix Security Symposium, August 2003.
[20] M. M. Williamson, “Throttling viruses: Restricting propagation to defeat malicious mobile code," ACSAC, December 2002.
[21] S. Sellke, N. B. Shroff, and S. Bagchi, “Modeling and automated containment of worms,” DSN, June/July 2005.
[22] D. Whyte, E. Kranakis, and P. C. van Oorschot, “DNS-based detection of scanning worms in an enterprise network,” NDSS, February 2005.
[23] A. Gupta and R. Sekar, “An approach for detecting self-propagating email using anomaly detection,” RAID, September 2003.
[24] J. Xiong, “ACT: Attachment chain tracing scheme for email virus detection and control,” ACM WORM, October 2004.
[25] W. Cui, R. H. Katz and W-T. Tan, “BINDER: An Extrusion-based Break-In Detector for Personal Computers,” Usenix Security Symposium, April 2005.
195
[26] K. Ilgun, R. A. Kemmerer, and P. A. Porras, “State Transition Analysis: A Rule-based Intrusion Detection Approach,” IEEE Transactions. on Software Engineering, vol. 21, no. 3, pp. 181-199, March 1995.
[27] S. Jha, K. Tan, and R.A. Maxion, “Markov Chains, Classifiers, and Intrusion Detection,” IEEE CSFW, June 2001.
[28] N. Ye, “A Markov Chain Model of Temporal Be-havior for Anomaly Detection,” IEEE Workshop on Information Assurance and Security, June 2000.
[29] W. DuMouchel, “Computer Intrusion Detection Based on Bayes Factors for Comparing Command Transition Probabilities,” Tech. Rep. 91, National Institute of Statistical Sciences, 1999.
[30] A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Kumar, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” SIAM Conference on Data Mining, May 2003.
[31] R. P. Lippmann et al., “The 1998 DARPA/AFRL Off-line Intrusion Detection Evaluation,” RAID, September 1998.
[32] R. P. Lippmann, J.W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 DARPA Off-line Intrusion Detection Evaluation,” ACM Computer Networks, vol. 34, 4, October 2000.
[33] Endpoint Security Homepage, http://www.endpointsecurity.org/.
[34] “Symantec Internet Security Threat Report - Trends for January 05 - June 05,” Volume VIII, September 2005.
[35] T. Raschke, “The New Security Challenge: Endpoints,” IDC/F-Secure, August 2005.
[36] N. Weaver, D. Ellis, S. Staniford, and V. Paxson, “Worms vs. Perimeters: The case for Hard-LANs,” IEEE Symposium on High Performance Interconnects (Hot Interconnects), August 2004.
[37] C. Wong, C. Wang, D. Song, S. Bielski, and G. R. Ganger, “Dynamic quarantine of Internet worms,” DSN, July 2004.
[38] C. Wong, S. Bielski, A. Studer, and C. Wang, “Empirical Analysis of Rate Limiting Mechanisms,” RAID, September 2005.
196
[39] Q. Li, E-C Chang, and M. C. Chan, “On effectiveness of DDOS attacks on statistical filtering,” IEEE Infocom, March 2005.
[40] A. Kuzmanovic and E. W. Knightly, “Low-rate TCP-targeted denial of service attacks,” ACM Sigcomm, August 2003.
[41] S. Staniford, V. Paxson, and N. Weaver, “How to 0wn the Internet in your spare time,” Usenix Security Symposium, August 2002.
[42] S. Panjwani, S. Tan, K. M. Jarrin, and M. Cukier, “An experimental evaluation to determine if port scans are precursor to an attack,” DSN, June/July 2005.
[43] T. M. Cover and J. A. Thomas, “Elements of Information Theory,” Wiley-Interscience, 1991.
[44] J. Lin, “Divergence Measures Based on the Shannon Entropy,” IEEE Transactions on Information Theory, vol. 37, no. 3, January 1991.
[45] D. H. Johnson and S. Sinanovic, “Symmetrizing the Kullback-Leibler Distance,” Technical Report, March 2001.
[46] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121- 167, 1998.
[47] “The Secure Hash Algorithm,” FIPS PUB 180-1, April 1995.
[48] MSDN Library, http://msdn.microsft.com.
[49] Microsoft Virtual PC 2004, http://www.microsoft.com/Windows/virtualpc.
[50] Symantec Security Response, W32.Zotob.G, http://securityresponse.symantec.com/avcenter/venc/data/w32.zotob.g.html.
[51] Sophos Virus Info, W32/Forbot-FU, http://www.sophos.com/virusinfo/analyses/w32forbotfu.html.
[52] Sophos Virus Info, W32/Sdbot-AFR, http://www.sophos.com/virusinfo/analyses/w32sdbotafr.html.
[53] Sophos Virus Info, Troj/Dloader-NY, http://www.sophos.com/virusinfo/analyses/trojdloaderny.html.
197
[54] Symantec Security Response, W32.SoBig.E@mm, http://securityresponse.symantec.com/avcenter/venc/data/[email protected].
[55] Symantec Security Response, W32.MyDoom.A@mm, http://securityresponse.symantec.com/avcenter/venc/data/[email protected].
[56] Symantec Security Response, W32.Blaster.Worm, http://securityresponse.symantec.com/avcenter/venc/data/w32.blaster.worm.html.
[57] Symantec Security Response, W32/Rbot-AQJ, http://www.sophos.com/virusinfo/analyses/w32rbotaqj.html.
[58] TrendMicro Virus Encyclopedia, WORM_RBOT.CCC, http://au.trendmicro-europe.com/smb/vinfo/encyclopedia.php?LYstr=VMAINDATA&vNav=3&VName=WORM_RBOT.CCC.
[59] Symantec Security Response, W32.Witty.Worm, http://securityresponse.symantec.com/avcenter/venc/data/w32.witty.worm.html.
[60] C. Shannon and D. Moore, “The spread of the Witty worm,” IEEE Security & Privacy, vol. 2, no. 4, pp. 46- 50, July/August 2004.
[61] Symantec Security Response, CodeRed II, http://securityresponse.symantec.com/avcenter/venc/data/codered.ii.html.
[62] D. Moore, C. Shannon, and J. Brown, “Code-Red: A case study on the spread and victims of an Internet worm,” ACM/Usenix IMC, November 2002.
[63] A. Kumar, V. Paxson, and N. Weaver, “Exploiting underlying structure for detailed reconstruction of an Internet-scale event,” ACM/Usenix IMC, October 2005.
[64] W. S. Sarle, “AI FAQ,” http://www.faqs.org/faqs/ai-faq/neural-nets/.
[65] S. Axelsson, “The base-rate fallacy and its implications for the difficulty of intrusion detection,” RAID, September 1999.
[66] D. Wagner and P. Soto, “Mimicry Attacks on Host-Based Intrusion Detection Systems,” ACM CCS, Nov. 2002.
[67] Trusted Computing Alliance, https://www.trustedcomputinggroup.org.
198
[68] G. Dunlap, S. King, S. Cinar, M. Basrai, and P. Chen, “ReVirt: Enabling intrusion analysis through virtual-machine logging and replay,” Usenix OSDI, December 2002.
[69] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh, “Terra: A virtual machine-based platform for trusted computing,” ACM SOSP, October 2003.
[70] B. W. Lampson, “Computer security in the real world,” IEEE Computer, vol. 37, no. 6, pp. 37–46, June 2004.
[71] M. Rosenblum and T. Garfinkel, “Virtual Machine Monitors: Current technology and future trends,” IEEE Computer, (38)5, pp. 39–47, May 2005.