resilience and survivability for future networking ... › downloads › deliverables ›...

Resilience and Survivability for futurenetworking: framework, mechanisms, and

experimental evaluation

Deliverable number 2.2a

Deliverable name First draft on new challenge detection approaches

WP number 2

Delivery date 28/02/2010

Date of Preparation 5/3/2010

Editor S. Martin (ULg)

Contributor(s) P. Smith (ULanc), M. Fry (USyd), S. Martin(ULg), L. Chiarello (ULg), M. Fischer(ETHZ),Ch. Rohner (UU), G. Popa (ETHZ)

Internal reviewer M. Schoeller, Ch. Lac

Deliverable D2.2a ResumeNet

Summary

This deliverable presents the problem of challenge detection, that is, how do we allow a systemto “understand” a challenging situation by letting it identify its occurrences and assert itsimpact. These two outputs are required to later select a proper mitigation strategy. Wemotivate this approach and detail how it improves from previous anomaly, intrusion and DoSdetection techniques.

We examplify this approach in the context of wireless mesh networks and opportunisticnetworks on specific challenges. The different stages of detection are mapped onto monitoring,analysis and configuration activities suited to the considered scenario.

We then present the design of a distributed information store, which assist and capturesinformation exchanges between the detection and correlation algorithms. The gathered dataare then automatically organised and stored so that it can be fed to machine-learning basedprocess involved in detection, remediation and later diagnostic tasks. We illustrate its behaviourin the well-studied case of resource starvation challenges such as DDoS.

2 out of 58


Contents

1 Introduction / Deliverable Overview 6

2 Challenge Detection for Resilience 72.1 Motivating Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Internet Resource Starvation Challenges . . . . . . . . . . . . . . . . . 8

2.1.2 Challenges in Community Wireless Mesh Networks . . . . . . . . . . . 8

2.1.3 Opportunistic Networks Selfishness Challenges . . . . . . . . . . . . . . 9

2.2 Requirements Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Challenge Detection : State of the Art Review 113.1 Challenge Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Scanning Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.2 Denial of Service Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.4 Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.5 Summary of Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Challenge Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Identifying Scanning Attacks . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.2 DoS Attack Classification . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.3 Traffic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.4 Fault Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


3.3 Impact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.1 Off-Line Impact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.2 On-Line (or Real-Time) Impact Analysis . . . . . . . . . . . . . . . . . 17


3.4 Real-Time and Distributed Challenge Monitoring . . . . . . . . . . . . . . . . 18

3.4.1 Gathering Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.2 Computational Issues for Traffic Classification . . . . . . . . . . . . . . 18

3.4.3 Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 19


3.5 Summary: Missing Links and Next Steps . . . . . . . . . . . . . . . . . . . . . 19

4 Challenge Detection in WMNs 214.1 Analysis: Impact of Interference . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Measurement Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 out of 58


4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Architecture: A Distributed Challenge Detection, Remediation and RecoverySystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Remediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.3 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.4 Resilience Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.5 Support Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.6 Information Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Realisation: A Distributed Interference Detection and Remediation System . . . 31

4.3.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.3 Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.4 Consulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.5 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Challenges Detection in Opportunistic Networks 345.1 Resilience for Opportunistic Networks . . . . . . . . . . . . . . . . . . . . . . 34

5.1.1 Service Specification of a Store-Carry-Forward Transport . . . . . . . . 34

5.1.2 Realising the D2R2 + DR Strategy . . . . . . . . . . . . . . . . . . . . 34

5.1.3 Resilience Metrics for Opportunistic Networking . . . . . . . . . . . . . 36

5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Distributed Information Store 396.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.1 Peer-to-Peer Distributed System . . . . . . . . . . . . . . . . . . . . . 40

6.1.2 Multi-Resolution Information . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.3 Learning-Ready Data Model . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.4 Keeping Management Apart . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Offered Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.3 Internal Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.3.1 Publish & Subscribe System . . . . . . . . . . . . . . . . . . . . . . . 42

6.3.2 Filtering and Aggregation System . . . . . . . . . . . . . . . . . . . . 44

6.3.3 Distributed Storage System . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3.4 Heterogeneous Storage . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3.5 Connectivity Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 out of 58


6.4 Available Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.4.1 Resilient Communication Infrastructure . . . . . . . . . . . . . . . . . 45

6.4.2 Peer-to-peer Storage Systems . . . . . . . . . . . . . . . . . . . . . . . 45

6.4.3 Pub/Sub Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.4.4 Correlation Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4.5 Standard Network Monitoring Protocols . . . . . . . . . . . . . . . . . 47

6.5 Concept Validation : DDoS Detection . . . . . . . . . . . . . . . . . . . . . . 48

6.5.1 Network-Network Interaction . . . . . . . . . . . . . . . . . . . . . . . 48

6.5.2 Network-Server Interaction . . . . . . . . . . . . . . . . . . . . . . . . 49

7 Conclusion and Future Works 50

A List Of Publications 52

B Top 5 Challenges in WMNs 52

5 out of 58


1 Introduction / Deliverable Overview

The broad aim of ResumeNet is to find ways of making networks more resilient to a wide rangeof challenges including malicious attacks, mis-configurations, accidental faults, and operationaloverloads. Our definition of resilience is therefore a superset of commonly used definitions forsurvivability, dependability and fault tolerance. Resilience solutions potentially impact all levelsof network architecture, design and operations.

Figure 1: The D2R2 +DR feedback loops for adaptation (real-time) and evolution (long-term)to meet resilience goals.

Our strategy for resilience is summarised in the D2R2 + DR model. In essence, we firstlyneed to build in mechanisms to defend against challenges, to make the network more robust.However, defence mechanisms may not be practical for permanent and wide deployment (e.g.processing cost is too high), or it may simply fail, for example in the face of new and unknownchallenges. So we require mechanisms to detect the onset of challenges in real time thatare impairing normal operation. We then need to remediate the effects of the challenge bymitigating its damage and minimising its impacts. We then fully recover from the challenge,for example by removing its root cause, to the original operating state. This sequence can beviewed as an on-line control loop, as depicted on Fig. 1. Our resilience strategy is completedby two background steps which firstly diagnose the root cause of a discovered challenge andits impact on operations, and then refine our strategy where necessary and possible throughdeployment of new resilience mechanisms for defence, detection and remediation.

A crucial part of this strategy thus involves detection of the onset of challenges in realtime, followed by identification of the challenge in order to initiate appropriate remedial action,allowing an acceptable level of service to be maintained. We believe that current approachesto anomaly detection are insufficient to this task. One goal of this deliverable is therefore tomotivate and advocate new autonomic, distributed challenge detection approaches.

We then vaildate the proposed approach by detailing how we can apply these guidelines totame the problem of channel interferences in wireless mesh networks. Indeed, while this scenariois fairly distinct from DDoS mitigation, it still requires collaborative analysis of measurementand correlation of multiple metrics. A proof-of-concept prototype is described, which allows usto further illustrate how our abstract activities of symptom detection, root cause identificationcan be mapped onto processes and interact with elementary remediation actuators in thisspecific context.

In a similar mindset, we have explored the feasibility of our strategy in opportunistic net-

6 out of 58


works. In such networks, node mobility compensates for the lack of end-to-end connectivity,but bide selfishness can severely affect performance and compromise reliability of data trans-mission. Our result in this area show that alternate forwarding protocols can increase reliabilityeven in challenging situations, and we investigate which detection mechanism can keep oper-ating cost at an acceptable level.

Finally, in order to ease future development of resilience solutions, we propose a distributeinformation store design, that bridges components involved in the real-time resilience controlloop (detection / remediation / recovery) through a simple publish/subscribe interface. Byparticipating in their message exchanges, it captures and organise automatically the informationthat is later used during diagnostic and refinement steps. It can be conceptually seen as anatural extension of the Information Sensing and Sharing (ISS) framework presented in theearlier deliverable D1.5 [SSF+09], where information exchanges by components are used toautomate organisation of that information into a range-queriable database. We provide adescription of the service’s programming interface and illustrate the use of such a service on aDDoS challenge.

This document is organised as follows: section 2 presents the three scenario that motivateour work, and our approach to challenge detection. Section 3 reviews state-of-art techniquesproposed for detecting attacks on the Internet and pinpoints their inadequacies in bringingresilience to the network. We illustrate our approach further on wireless mesh and opportunisticnetworks in sections 4 and 5, respectively. Finally, the design of the Distributed Store forChallenges and their Outcome (DISco) is presented in section 6.

2 Challenge Detection for Resilience

A summary of what is required from the detection phase of the resilience strategy, so thatappropriate remediation can be invoked, is as follows. In essence, we need to be able to dothe following, which build up a representation of the challenge in progress:

1. Detect the challenge (e.g., via anomaly or signature-based detection techniques)

2. Classify the challenge, or understand its root cause

3. Understand the (potential) impact of the challenge

We could apply simple remediation, e.g., aggressive ingress filtering of traffic, on the initialdetection of a challenge, e.g., a DoS attack. We argue below that such simplistic approachesare typical of the current state-of-the-art. However, the more we know about the nature ofthe challenge (i.e., what it is, its cause and the impact on services), the better the remediationthat can be invoked. This characterises our novel approach to challenge detection.

2.1 Motivating Scenarios

To motivate our requirements, we first outline two diverse scenarios. These are challengescenarios that have been identified in live networks and which can significantly impact normaloperations. We describe the difficulties faced in managing these challenges. We return laterto these scenarios in order to validate our new approach.

7 out of 58


2.1.1 Internet Resource Starvation Challenges

A compelling challenge scenario concerns the proper identification of apparent DistributedDenial of Service (DDoS) attacks. In the first phase of a DDoS attack, the attacker gainsaccess to a (often large) number of “zombie” systems by exploiting system vulnerabilities, thusforming a ‘botnet’. In the second phase, the attacker orders the zombies to each launch aflood of packets towards the victim, thus overwhelming the target system and network anddisabling its normal operation. DDoS attacks are increasing in frequency and sophistication, asreviewed in Sec. 3.1.2 and are utilised for serious cyber crime in the form of extortion attemptson commercial service providers and for attacks on national security infrastructures. In orderto keep one step ahead of defenders, DDoS attackers have adopted a strategy of makingmalicious traffic difficult to distinguish from legitimate traffic. For example, a “HTTP flood”coordinates a botnet to launch a flood of legitimate HTTP queries to a web server. A difficultproblem is then distinguishing between such attacks and legitimate overloads caused by “flashcrowds”, since they share common symptoms. Both represent challenges to normal operationbut they require different remediation strategies. This necessitates a challenge identificationstep, based on understanding the full range of symptoms that classify a challenge.

In this example case, the initial onset of either a DDoS attack or a flash crowd couldinitially be detected by measuring traffic at the destination network with regard to a thresholdand raising an alert. A simple approach to this would be to assume that the attack wasmalicious, and remediate the attack by aggressive ingress filtering. However, this would notbe an appropriate response to a flash crowd. Rather, this initial alert could trigger some initialremediation such as TCP rate limiting, while also invoking further analysis of the challenge. Inthis particular case, analysis of HTTP sources, or HTTP protocol sequences, in the individualflows may provide evidence of challenge identity. However, relatively complex per-flow analysisis currently very difficult or impossible to achieve in real time. An alternative, more tractableapproach would be to invoke analysis of the challenge on monitoring entities that are closer tothe HTTP sources. This coordinated approach is more scalable. It also enables more preciseanalysis and identification of the challenge and thus more appropriate remediation. In thecase of a malicious DDoS attack, it would permit more effective filtering, closer to the attacksources, and also the identification of sources for further remediation. For a legitimate flashcrowd event, it would enable optimum placement of rate limiting remediation.

2.1.2 Challenges in Community Wireless Mesh Networks

Wireless Mesh Networks (WMNs) can be used to provide Internet connectivity in areas wherewired deployment is prohibitively expensive, such as in rural areas [IBPR08]. A typical WMNconsists of a number of wireless access points (nodes) that route packets from clients to theirdestination, normally a set of egress points to the Internet. Mesh nodes can run a number ofservices, such as routing protocols, DNS, and AAA. WMNs are intended to be self-organisingand consequently have a minimal management overhead.

The challenges to normal operation that could occur in a WMN are numerous. Be-cause of the relative ease of physical access to the mesh, they are particularly susceptible toinfrastructure-based attacks (e.g., man-in-the-middle attacks). This is in addition to threatsthat other network types face – Internet worms and jamming attacks, for example. When anetwork is being managed by a community of non-expert users, mis-configuration of devicescould be a significant issue. Finally, WMNs are particularly susceptible to the elements, forexample, poor connectivity from signal attenuation is an issue and, as infrastructure is typically

8 out of 58


externally mounted, hardware failures are relatively more common.

Stopping these challenges from leading to significant network outages is difficult. The setof hardware resources that can be used to detect the onset of problems can have wildly varyingcomputational capabilities, e.g., from highly capable end-systems to minimally resourced wire-less access points. End-systems are normally considered off-limits as they belong to privateusers, unlike the mesh infrastructure. As mentioned earlier, WMNs can be managed by theirpotentially non-expert users. This often means there is an undesirable lead time before out-ages are fixed, because of a lack of time and expertise in dealing with the exceptional networkbehaviour caused by challenges.

In summary, community-driven WMNs provide an interesting domain in which to considernetwork resilience issues. This is because of the variety and rate of occurrence of challenges theyface, and the difficulty of addressing them due to the potentially limited hardware resourcesavailable and the limited time and expertise of the network’s administrators. This last pointregarding administrators suggests that some form of automatic detection and remediation is adesirable property of a resilient WMN.

2.1.3 Opportunistic Networks Selfishness Challenges

Access to the world’s networks has become a commodity in a large number of countries, whereinfrastructure, such as optical fibres, is readily available. However, there are regions, oftenremote, vast and sparsely populated, and with a relatively poor economic base, where thedeployment of constant connectivity is not viable. Projects like N4C1 or ZebraNet [JOW+02]aim to provide basic email and (cached) Web access to such remote areas using opportunisticnetworks, improving the quality of life for the population.

The N4C initiative, for example, aims to support the desire of the nomadic Sami people innorthern Sweden to attend school remotely, allowing the family to stay together for a longertime. Providing a resilient networking service is therefore of importance.

Opportunistic networks do not assume existence of end-to-end path, but instead rely onnode mobility, coupled with short-range communication, to disseminate information towardsintended recipient. The success of a data transfer will depend on contact opportunities,physical movement of the nodes, but also their willingness to share their storage capabilitiesand transmission power to deliver others’ messages.

2.2 Requirements Summary

These example scenarios provide motivation for new approaches to challenge detection in orderto realise our resilience strategy. We have also analysed a wider range of challenge scenarios asreported in Deliverable 1.1 “Understanding Challenges and their Impact on Network Resilience”[SS09]. On the basis of this analysis, we have developed an approach to challenge detection,which is shown diagrammatically in Figure 2 as a multi-stage process that is necessary tounderstand challenges so that appropriate mitigation can be carried out. Firstly, it is necessaryto detect the onset of a challenge based on its symptoms, and then to identify its root causeand potential impact on the system. The latter entails a practical risk management approach,which assesses the impact of the challenge vs the costs of remediation. The result of thisunderstanding will determine the selection and deployment of appropriate mitigation. Using

1http://www.n4c.eu

9 out of 58


this process model, and an analysis of a number of real challenge scenarios including the twooutlined above, we can identify an initial set of requirements:

• Real time detection

• Underpinned by an evolving model of normal behaviour

• Incremental and on-demand challenge identification

• Distributed resilience framework

Figure 2: Aspects of challenge detection necessary for resilience

It is obvious that we need automatic mechanisms for detecting the onset of challengesin real time. These will be based initially on our current understanding of known challenges(e.g., their signatures) but must also be capable of detecting new challenges via an implicitand evolving model of normal behaviour. Resource constraints may require these mechanismsto be relatively lightweight. They may thus (in real time) be capable of simply detecting theinitial onset of a potential challenge. We need an incremental strategy, to further identifya challenge following its onset. There is thus a requirement for on-demand invocation anddeployment of more sophisticated mechanisms for determining the root cause of the challengeand its potential impacts. While this is occurring, we may also wish to invoke some interimmitigation to reduce challenge impacts. To fully identify a challenge may require deploymentof mechanisms across different network nodes, which then cooperate to solve the problem.This requires a distributed resilience framework. Based on this discussion, we can identify thefollowing further requirements for detection:

• Recognition of resource constraints and impacts

• Secure and robust to challenges

• Incremental and scalable deployment

• Evolvable

These requirements include mechanisms that operate under acceptable resource constraints,and which do not themselves negatively impact the operation of the networked system. Nonethe-less, our mechanisms must also detect and identify challenges within reasonable time. Our

10 out of 58


framework must itself be secure and robust to outside challenges, since its critical work needsto be performed while the network is under challenge. It should permit incremental, flexibleand scalable deployment, and operation across heterogeneous network environments. Finally,it must be capable of evolution as we refine our resilience strategy and deploy new mechanisms.

In the next section, we review challenge detection work described in the literature. It is ourassessment of the state-of-the-art. Our aim is to identify the gaps between the state-of-the-artand our proposed approach to challenge detection. Thus the issues to be addressed by ourongoing research are also identified.

3 Challenge Detection : State of the Art Review

We review work to date under three sequential headings that reflect our strategy as shown inFig. 2. We firstly discuss initial detection of the onset of challenges as they occur, based on aset of challenge symptoms. We then review various techniques for diagnosing and identifyingchallenges, which is followed by discussion of impact analysis. The goal of our work is todevelop practical resilience mechanisms and, as will be seen, the complexity of performingdetection in real time is a recurring issue in the literature. We therefore review separately thestate-of-the-art in distributed, real-time challenge monitoring.

We focus on attacks on networks and networked systems. We review a range of challengesthat represent the vast bulk of challenges that fall within our definition and which have at-tracted most attention in the literature. Thus we consider scanning attacks, which are usedto infect computer systems with malware for a variety of purposes. Denial of Service (DoS)and Distributed Denial of Service (DDoS) are specific forms of attack that are an increasinglyserious problem, and are therefore reviewed. We then discuss more generic work on networkanomaly detection, for example volume anomaly detection in backbone networks. These maybe caused by known (e.g. DDoS) or as yet unidentified challenges. Since our definition ofchallenges is a superset of malicious attacks, we also review work on network fault detectionand diagnosis.

3.1 Challenge Onset Detection

3.1.1 Scanning Attacks

Scanning attacks are used by an intruder to probe for vulnerabilities in host computer systems.The aim is to infect a vulnerable system with malware which, when activated, will undertakeone or more malicious attacks. Malware includes viruses, worms, Trojan horses and botnets.Substantial work has been undertaken to detect scanning attacks at the level of the localaccess network. There are a range of commercial and open source intrusion detection systemsavailable (e.g. Snort and Bro). Nearly all use signature-based detection which is exploitspatterns of known attacks. Accordingly, they are unable to detect as yet unknown scanningattacks.

Recently, anomaly detection techniques have been explored to detect abnormal, but un-known, patterns of access to networked hosts. For example, in [ACP09], entropy-based de-tection utilising distributions of IP address and TCP port access is used. However, if anunknown scanning anomaly is detected, a further step of anomaly identification is requiredin order to pinpoint the problem. Existing scanning attack detection algorithms work quite

11 out of 58


effectively but are continually evolving. One key issue, however, is their suitability in certainresource-constrained environments.

3.1.2 Denial of Service Attacks

DoS and DDoS detection techniques follow approaches similar to intrusion detection systems.Signature-based approaches look for patterns of known attacks, such as “ping” floods. Oncean attack is identified as matching a known signature, remediation strategies can be invokedappropriate to that particular attack. However the problem with signature-based detection isthat it can only detect known attack patterns, so new forms of attack will remain undetected.The alternative approach is anomaly detection, which looks for abnormal patterns of behavioursuch as source IP address spoofing or illegal/abnormal protocol sequences such as SYN floods.Such detection techniques are relatively lightweight and deployable. However, as noted above,in response attackers have adopted a strategy of making malicious traffic more difficult todistinguish from legitimate traffic.

There are some excellent taxonomies of DoS and DDoS attacks, and surveys of defencemechanisms [HHP03, MR04, PLR07b]. A key issue is how to distinguish between an attack anda legitimate overload caused by “flash crowds” It is reported that HTTP flood functions area common feature of widely available botnet software, and that spoofed attacks are declining[PLR07b].

3.1.3 Anomaly Detection

Network intrusion detection systems aim to detect a range of attacks including unkown attacks.Most commercial IDS’s are signature-based. The power of more general anomaly detectionschemes lies in their models of normal and/or abnormal behaviour. Different anomaly detectiontechniques are founded on different assumptions about the data, for example the relativerareness of anomalies or statistical distributions of events. If these assumptions do not matchreality, the outcomes can be unacceptable rates of false negatives and false positives [CBK09].Some approaches use training data but, as with signature-based detection, they are only asgood as the data. Anomalies may be caused by malicious attacks, but may also be the resultof mis-configurations (e.g. of routing table), faults due to equipment malfunction or failure,or simply high volumes of legitimate traffic such as ‘alpha flows’.

There are now a number of well known techniques for rate-based anomaly detection, whichlook for rates of traffic that diverge from an estimate, using previous observations. Otherapproaches apply frequency information-based detection, applying formal methods such asspectral time series analysis and Principal Component Analysis to detect the onset of ‘volumeanomalies’ in backbone traffic [HHP03, LCD04]. Whereas entropy-based anomaly detectiontechniques have been used to measure changes in entropy of different traffic classes [SSRD07]and for more generalised anomaly detection [LCD05]. These techniques have demonstratedpromise when applied to traces of network traffic but they are complex [CBK09]. Some are alsoshown to be sensitive to parameters of the data sets, such as the level of traffic aggregation,and suffer from other robustness problems [RSRD07].

The ability to detect the onset of anomalies in real time is a function of the rate of trafficand the processing resources available. This is especially challenging on backbone links, butmay become more feasible with the emergence of powerful multi-core processors. In [SW09],it is demonstrated that the execution of multiple, rate-based detection algorithms in parallel

12 out of 58


is achievable. It can also achieve greater detection accuracy and sensitivity than any singlealgorithm. Nonetheless, anomaly detection is at a relatively early stage of development interms of practical, real-time deployment. Real-time issues are explored further below. Someof these techniques also seek to identify and classify anomalies once detected, which is alsodiscussed below.

3.1.4 Fault Detection

Most rearch work on fault detection has assumed that initial detection will be undertakenby network management tools. Failures or malfunctions are detected on the component andnotifications are sent using a protocol such as SNMP to network management software whichraises alarms. However a common problem is that a single event, e.g. a line failure, maygive rise to multiple alarms. The problem space in the lterature is therefore isolation and rootcause identification of alarms. This is therefore reviewed below, where we consider challengeidentification.

3.1.5 Summary of Issues

A summary of issues for challenge onset detection is as follows. Most detection systems in usetoday are based on signatures. More generic anomaly detection may often be limited by theavailability of computational resources, which is scarce in many environments such as wirelessmesh networks. DDoS detection is only currently carried out on the access network that isunder attack, and new forms of attack are hard to distinguish from normal events. Knownanomaly detection techniques are based on a variety of underlying assumptions that may notmatch reality under different circumstances. Volume anomaly detection in real time is resourcechallenging, leading to different choices of sampling that can impact the accuracy of detection.There are no techniques that can be applied to a variety of challenge types.

3.2 Challenge Identification

Once the onset of a challenge has been identified, the next step according to our strategyis to identify the challenge with as much precision as possible. Approaches to challengeidentification include classification of the challenge into a particular class, through to localisingor identifying the root cause or source of the challenge. We now review the different approachesto identification under the same challenge headings as the previous section.

3.2.1 Identifying Scanning Attacks

Scanning attack detection is generally signature-based. Thus, once a scanning attack has beendetected, its identity is assumed to be that of the known attack from which the signature wasderived, and thus appropriate (known) remediation can be initiated. Anomaly-based scanningtechniques published thus far do not suggest how to identify or classify the attack.

3.2.2 DoS Attack Classification

A range of formal methods have been explored in order to classify Dos and DDos attacks,including time series analysis, signal processing and wavelets. A technique that illustrates

13 out of 58


some of the issues in DoS attack classification is spectral analysis. For example, in [HHP03]the spectral charateristics of attack streams are analysed to distinguish between single sourceand multiple source attacks. This is validated through through off-line analysis and simulationexperiments. However, such analysis is complex and has not been demonstrated in real time. Itis also claimed that spectral analysis is only valid for TCP flows and can easily be spoofed by anattacker [PLR07b]. For all analysis-based techniques, there is a trade-off between complexity(and hence processing speed required) and accuracy of detection and identification [PLR07b].

For DoS or DDoS attacks we ideally want to identify the attack source(s). However, correctidentification and remediation is only possible through a distributed, coordinated approach[MR04, PLR07b, KMR02], and there are some significant barriers to achieving this. It isparticularly difficult to infer attack paths for DDoS attacks with a large number of sources.There have been a number of “traceback” proposals which aim to identify the IP sourceof attackers. Many schemes require active monitoring and maintenance of state in routers,which is not feasible in the network core. In [CZS09], a promising traceback identificationtechnique is demonstrated at the AS level, that requires minimal overhead and can be deployedincrementally. However, this does not solve the problem of cooperation between autonomousdomains.

Filtering of IP traffic at the ingress to the victim’s network is often used to mitigate anattack, but has been shown to be ineffective for certain types of attack, and does not discrim-inate between attack and legitimate traffic [PLR07b]. For a legitimate flash-crowd event it ismore appropriate to rate-limit sources or source aggregates rather than filter. Discriminationbetween attack and legitimate traffic, folowed by appropriate response, requires cooperationbetween ISPs, and there is a lack of strong incentives to deploy cooperative mechanisms[PLR07b, MR04].

3.2.3 Traffic Classification

Existing traffic classification techniques are either packet-based or flow-based. As the namesuggests, packet-based classification is based on inspection of individual packets, either allpackets or a sample. The most accurate classification can be achieved by inspecting theentire payload – so-called deep-packet inspection – utilising knowledge of application protocolsemantics. However, this is currently computational complex and, thus, impractical in mostreal-world circumstances [NA08]. A more tractable form of packet-based classification is port-based, using only transport layer ports in the packet header, but this can be spoofed bydetermined attackers.

Flow-based classification uses flow records generated by routers and utilises a variety oftechniques including machine learning to identify certain anomalies. Volume anomaly detectiontechniques aim to identify the particular origin-destination (OD) flow that is the cause of theanomaly and also to quantify the extent of the anomaly. Identification accuracy is impacted bythe level of sampling and aggregation, and also the trade-off with the computational resourcesavailable to the detector [SSRD07, RSRD07]. Most approaches have been verified only againsttrace data sets, and it is yet to be shown whether anything more than initial onset detection canbe performed in real time. It is shown in [KSV07] that it is feasible to detect certain anomaliesby aggregating per-flow state, thus making it computationally more realistic in real time.Subsequently, complementary monitoring techniques may be deployed to different parts of thenetwork to further investigate the anomaly. In [ACP09], it is shown that the application ofintelligent sampling of flows, specific to different anomalies and using entropy-based detection,

14 out of 58


provides detection and classification of anomalies that is superior to using un-sampled traffic.This perhaps provides an approach that can scale to real-time detection, but the authors notesome significant issues for practical deployment.

3.2.4 Fault Identification

Work on fault management has focused primarily on identification or localisation of root causes.Detection is assumed to occur via the raising of network management alarms. However, a singleevent such as a failure in a large and complex network may generate a range of symptoms(alarms), so the difficulty is then to diagnose the root cause.

From its earliest days, work has focused on formal techniques for modelling faults and theirrelationships. Early approaches proposed probabilistic models, such as Bayesian Networks[LWD92]. The notion of modelling dependencies between network objects, such as throughDependency Graphs also has a long history [KS95]. Subsequently, a wide range of approacheshave been developed using AI techniques, such as rule-based reasoning about event correlations,and modelling of dependencies and causal relationships in networks [lSS04]. To date, much ofthis work has been verified through formal methods or simulation, and practical deploymentof proposed techniques has been little explored.

Open research questions for fault localisation have been identified in [lSS04]. They aresummarised as follows. Faults require identification in application and service layers, and notjust the lower layers, since faults may propagate between layers and change their semantics.Most known techniques are inadequate to this task and there is a need for multi-layer faultlocalisation. Understanding temporal relationships between events can provide information toassist localisation, but there is little work that has addressed this. Some known techniquescould be used for temporal correlation, but they suffer from lack of computational scalability.Distributed fault localisation techniques for large and complex networks have been advocatedby many researchers, to address problems of scalability and accuracy of diagnosis. However,very little actual work has been undertaken. Similarly, in the Internet environment, benefitscan be gained via the cooperation of multiple independent organisations. But there is a lackof proposals to actually achieve such cooperation. Finally, the dynamic nature and resourceconstraints of wireless networks require more flexible fault models, that can be built andupdated over time. For wireless environments, multi-stage fault localisation, for example usingon-demand probing, is a more appropriate way of “homing in” on a fault.

There is ongoing work that addresses some of these questions. For example, in [WSAL09],a technique is proposed and evaluated based on a network-aware model of fault signatures,which can also evolve through learning. This technique is also shown to be computationallyscalable. However, in general, there are still many issues to be overcome.


A summary of issues for challenge identification is as follows. Classification of DoS and DDoSattacks in real time is not yet feasible, leading to a trade-off between complexity and accuracyof identification. Likewise, identification of attack paths and sources is not yet practical inresource terms at the network level, and requires cooperation between autonomous domains forwhich there are few incentives. Differentiating between attacks and normal traffic overloads,and thus identification of appropriate mitigation, is still difficult. Identification of anomaliesrequires appropriate choice of anomaly detection techniques, or combinations of techniques, to

15 out of 58


match the apparent anomaly. Distributed anomaly detection is often mooted as a key approachto these challenges and, thus, an important focus for future research [CBK09]. Formal faultidentification techniques have yet to be deployed and operated in practice, and there are manychallenges including multi-layer, temporal correlation of alarms, flexible models for resource-constrained environments, and multi-stage, distributed fault identification.

3.3 Impact Analysis

As discussed earlier, to enable appropriate automated mitigation of challenges, ideally, theimpact of the challenge on the (socio-technical) system that is being threatened should beunderstood. The rationale is that if we understand that a particular challenge has the potentialor is actually having a significant impact, we may be more inclined to invoke possibly riskymitigation strategies without the direct involvement of a human. Such decisions could beembodied in policies. There are a number of ways of assessing the impact of a challenge on asystem.

Fundamental to being able to measure the impact of a challenge on a system and itsassociated services are appropriate metrics. We can broadly consider two forms of metrics whenconsidering challenge impact: performance-based and organisational metrics. Performance-based metrics are those that are measurable in the system, such as queue lengths, round-tripdelay, and so on. A resource starvation attack, e.g., a DDoS attack, will affect these metrics,with more intense attacks having a greater impact on these forms of metric. However, someforms of challenge impact are not readily measurable, or understood in terms of their effecton performance of the networked system and its associated services. The recent Conficker(Downadup) worm scans a local area network to locate computers that are vulnerable. Toavoid detection, it adapts its scanning rate in relation to the available network bandwidth(determined via a number of probe HTTP requests). This worm has the potential to inflictsignificant damage (e.g., economic or social) on the organisation (or individuals) using thenetworked system. However, this is not quantifiable using performance-based metrics becauseof its stealthy behaviour. In some cases, there will be a relationship between these two formsof metric – SLAs describe performance-based requirements (e.g., 99.9995% packet delivery)with a monetary penalty associated with failure to meet the requirement.

Broadly speaking, there are two approaches to understanding the impact challenges haveon a networked system: off-line or on-line analysis.

3.3.1 Off-Line Impact Analysis

There are a number of off-line ways of determining the impact of a component failure on asystem and its organisation. For example, Component Failure Impact Analysis (CFIA) [vS08]can be used to determine the impact of a component failure on a system, which can beused to generate a Request For Change document, i.e., a document that identifies systemsthat are vulnerable and need investing in for improvement. In summary, this process involvesidentifying the components of a system, and categorising them based upon whether they areprovisioned with redundant components, and what form of redundancy is used (hot or coldstandby). Based upon this information, a measure of impact, e.g., in monetary loss due to lackof service, of component failure on the system can be gleaned for each item under analysis.

A drawback of the CFIA approach is that it only considers failure modes that do notalso affect the redundant services, e.g., random hardware failures. Intelligent adversaries and

16 out of 58


software bugs (that are present across non-diverse redundant nodes) could cause the failure ofprimary and secondary components. Clearly, a more developed challenge model is needed forthis approach to give a reliable measure of the impact of device failure in the context of theseforms of challenge.

Another approach is to use attack trees [SDP08] that can be used to determine the mostlikely attack scenarios. Here, a tree structure is developed at the root of which is a compromisedsystem state; the leaves and branches of the tree are populated with scenarios that could leadto the compromised state. Attack trees can be annotated with the cost of an attack or whethera specific edge is possible or not. This information can be used to determine the likely attacksfor a given system. Similar approaches can be used with fault and event trees to determine theprobability of failures occurring for non-attack scenarios. The challenge with these approachesis the need to be comprehensive, which may be difficult for complex networked systems, andthe difficulties of getting decent challenge occurrence probability statistics, which may be hardfor attacks by intelligent adversaries.

3.3.2 On-Line (or Real-Time) Impact Analysis

There has been much work that investigates how inferences can be made about the impact andnature of failures of the Internet routing infrastructure from end-systems that work collectivelyto probe the network [ZMZ08, BCKM02, FABK03]. The aim is to infer this informationwithout explicit support from network providers and proprietary toolsets. These approachestake a performance-based measure of impact, and consider round-trip delay, packet loss, androute reachability as metrics for understanding impact. In our framework, we can leveragethese approaches to get performance-based metrics on the impact of routing failures withouthaving explicit support from infrastructure.

Hariri et al. describe an agent-based framework for determining the impact of attacksor faults that affect the performance of networked systems [HGD+03, Sch99]. Although theframework is generic in nature, they discuss it in the context of resource starvation attackson network devices (e.g., routers) and servers. Central to this approach are measures ofcomponent and system impact factors that take into consideration an understanding of normaland unacceptable behaviour. For example, if one considers a router’s buffer utilisation, this maynormally reside at 50% and be considered unacceptable at 90%, for example. For an attackto move the system into a vulnerable state, and therefore invoke some remedial activities, therouter’s buffer utilisation must be greater-than or equal-to 90%.

For our framework, this approach to impact analysis is very interesting. However, it relieson a measure of normal behaviour, which may be difficult to ascertain in some contexts, e.g.,highly dynamic networks, such as MANETs. It is clearly targeted at understanding the impactof faults and attacks that have a readily measurable impact on the networked system so mayprove unsuitable, for example, to determine the impact of stealthy attacks, such as Conficker, asdiscussed earlier. Without some form of root-cause analysis, it may also be difficult to attributea measured impact to a particular fault or attack in the presence of multiple instances, e.g.,in a coordinated DDoS attack.


We have argued that to support appropriate automatic selection of remediation strategies, anunderstanding of the impact of a challenge would be beneficial. Without this information,

17 out of 58


it may not be clear whether it is appropriate to initiate a potentially risky remedy. We havesuggested there are two forms of metric that are useful for this task – performance-basedmetrics, those that are derived from measurable aspects of the system, and organisationalimpacts, for example, monetary cost, that are not easily measurable from the system features.We have described a number of off-line approaches to understanding the organisational impactof a challenge, including attack trees and component failure impact analysis. An issue forfurther investigation is to determine how organisational measures of impact can be determinedon-line, e.g., via a knowledge-base that can be queried containing information such as thatdetermined using CFIA. It is also not clear how the two measures of impact relate to eachother and should be used to determine a measure of impact, e.g., should organisational metricsbe used to support a decision making process in the absence of strong performance measures,and how should they be applied in the absence of a strong understanding of the root cause ofa challenge.

3.4 Real-Time and Distributed Challenge Monitoring

The problem of performing challenge detection and analysis in real time is a recurring issue.Distributed monitoring is also proposed by many as an ideal solution. We therefore brieflysurvey the current related work.

3.4.1 Gathering Information

Traditional network management, such as SNMP, periodically polls for low-level managementinformation from the different layers of the protocol stack at end or intermediate systems,such as the current state of protocol variables. As previously mentioned, this provides theraw (alarm) data for many fault management systems. However, this does not meet theinformation capture needs of more generic distributed challenge detection, such as anomalydetection. While there have been proposals for a common, network-wide measurement plane,such as [CPRW03], there has been no practical realisation of this idea. Nonetheless, someuseful work has addressed part of the problem space.

One approach to deriving more than just primitive network management data is crosslayering, which is a technique that has been widely studied as a means to enhance applicationperformance, e.g., [KPS+06]. To date most cross layer solutions have been very applicationand/or network specific, and lack generality or re-usability. In [BCD06], common interfaces aredefined to permit sharing of information across layers. But cross-layering can also potentiallybe used to ‘open up’ communication systems to more sophisticated forms of monitoring, bydynamic correlation of inputs from different layers.

3.4.2 Computational Issues for Traffic Classification

In Section 3.2.3, we explored the power of traffic classification, but there are also limits tothe complexity of processing that can be performed on volume data in real time. While thereare commonly used tools to capture data, e.g., NetFlow, performing actions such as trafficclassification is subject to resource constraints. There has been significant work, however, thatinvestigates possible trade-offs. In [BTA+06], it is demonstrated that flows can be classifiedby application with a reasonable level of accuracy by sampling just the headers of the first fewpackets in the flow. [CDGS07] uses a statistical fingerprinting technique, based on training

18 out of 58


data, to classify flows by application. These approaches reduce the amount of data that mustbe processed. Another approach [LB07] has shown that simply classifying flows into eitherone-way or two-way flows can reveal information, such as malicious attacks.

As for complexity, in [WZA06], a number of machine-learning algorithms are applied totraffic classification and their computational performance is compared. This shows considerabledivergence in classifier performance, indicating certain techniques are much more useful for on-line analysis. It is also shown that “feature reduction” can reduce complexity with little impacton accuracy. Memory resources can also be a constraint. For example ‘lossy counting’ is atechnique that has been used to identify ‘heavy hitter’ flows, but it consumes considerableamounts of memory. In [DHK08], a probabilistic variant of lossy counting is proposed whichsignificantly reduces memory requirements. However, this is not shown to be executable inreal time.

It has also been demonstrated that data captured at the network level is increasingly beingobscured by encryption and tunnelling. In [CMD+06], it is shown that effective monitoringcan be achieved at end systems, thus overcoming this issue. Cross-layering can also be appliedto this approach. However, to be effective network wide, an end-system approach will requiredistributed cooperation.

3.4.3 Distributed Monitoring

There has been some preliminary work to date on how distributed monitors may cooperateand share information. For example, Li et al. [LBZ+06] propose and validate a distributedquery processing system for monitored data, while [PLR07a] proposes and analyses techniquesfor optimum sharing of distributed information. However, an unsolved problem is that the‘best’ distributed solutions require cooperation between ISPs, for which there is currently littleeconomic incentive.


In summary, for distributed challenge monitoring, there are still unsolved problems of scalability.There is a trade-off between processing resource constraints and accuracy. The volume ofmonitored traffic may constrain the level of capture and analysis. There may be insufficientresources available, for example in wireless network nodes [HSHR09]. This suggests a needfor a hybrid, multi-level approach to challenge monitoring. However major issues then arisesuch as the robustness of the challenge detection platform itself, and the lack of incentives forcooperation between ISPs.

3.5 Summary: Missing Links and Next Steps

The presented review identifies substantial gaps between our resilience requirements and thecurrent state-of-the-art in directly related fields. This, in turn, identifies major issues to beaddressed.

• The complexity of real-time detection under resource constraints

• The generality and robustness of known detection techniques, and selection of appropri-ate techniques in order to analyse challenges in real time

19 out of 58


• Staged approaches to detection and identification

• Multi-layer and temporal correlation of events

• Robust, scalable distributed and co-operative monitoring

• Mechanisms and incentives for cross domain co-operation

• Evolvability of the challenge detection system (diagnosis and refinement)

Ongoing research is pursuing a number of directions to meet these issues. Existing andemerging detection techniques are being further explored with a view to finding one or a smallnumber of techniques that can be flexibly deployed, to a wide range of challenges, and whichcan evolve over time through the diagnosis and refinement loop. In this regard, entropy-basedtechniques show some promise [LCD05, LX01].

We are exploring policy-based approaches to specifying our resilience strategies, in par-ticular staged detection and identification, followed by invocation of appropriate remediation[SCS+09]. In conjunction and by using case studies of real ISPs, we are developing a physicalarchitecture of our distributed detection system.

Refinement and evolution of the system require a record of challenges that have been de-tected, identified and remediated. This is one motivation for having a Distributed InformationStore as a key component of our overall strategy.

20 out of 58


4 Challenge Detection in WMNs

In this section we provide a validation of our approach to challenge detection and its require-ments as identified above in the context of wireless mesh networks (WMNs). Our intentionis to provide a proof of concept by mapping our strategy in more detail to show how it canaddress real challenges in WMNs. We focus on one important challenge which is the problemof interference and show how it can be addressed effectively in WMN environments using ouroverall approach. Firstly, let us describe in more detail the challenges faced in real WMNs.

In wireless mesh networks, challenges are different from the ones in other networks—theusage of wireless communication makes them prone to attacks and faults that don’t exist inwired networks. Challenges often have a bigger impact on WMNs than on wired networks,because of the usage of a less reliable (compared to Ethernet) wireless technology but alsobecause of high dependencies between the mesh nodes and links between them. For examplea challenge which causes high traffic being sent between two nodes can, due to interferenceor congestion, affect the available bandwidth on a link between two other nodes. Finally, dueto the broadcast nature of wireless networks, attacks can be performed much easier comparedto wired networks.

Figure 3 shows a graph of challenges appearing in WMNs which may lead to high delayor low throughput in the application layer. The challenges are separated by layer and arrowsindicate how challenges cause each other; challenges whose cause is unknown, or of no furtherinterest, are called root causes (grey background). It is obvious that challenges from lowerlayers cause challenges on upper layers but rarely vice versa.

WMNs are often deployed in rural areas as an easy and affordable way to provide Internetaccess to a community. In such a deployment, many of the attacks mentioned in Figure 3are not likely to happen, due to lack of incentives as well as because the network is muchless exposed compared to a corporate network. On the other hand, problems on the linkand physical layer have a bigger impact and become more important with the increasing useof real-time and high-bandwidth networking applications, such as video streaming or onlinegaming.

We focus on interference because this problem plays a very important role in WMNs. Otherimportant challenges are detailed in Appendix B. Due to the meshed structure, interferenceis not only caused by other access points in proximity but also within the network itself. Withhigh usage of the mesh network, the presence and impact of interference increase which canlead to degraded, or even unusable, quality of service. We therefore view it as an excellentexample of a resilience challenge that is to be met via our strategy. The challenge can impactthe dependability of the entire network to an unacceptable extent, as well as impacting thequality of service of individual applications. Interference can be due to one or more distinct rootcauses that are not immediately identifiable from the initial symptoms. Thus detection andremediation of interference are good examples where neighbouring nodes support each otherto detect a challenge and find a remediation in order to optimise the state of the whole system.The nature of an interferer can be detected more easily and more accurately by analysing howit appears for different nodes and mitigation of interference may require the whole system tochange the wireless channel.

A WMN uses wireless communication, both for its backhaul as well as for client access; ingeneral such a network (as shown in Figure 4) consists of four different types of nodes:

Mesh Node A node that is connected to other mesh nodes through a wireless interface.

21 out of 58


Figure 3: Challenges in a WMN separated by layers, focusing on problems caused by interfer-ence and high utilisation on the physical and link layer.

22 out of 58


Figure 4: A wireless mesh network with four nodes: A (a gateway), B (a mesh node withoutany additional functionality), C and D (client access points with three client hosts each).

Client Access Point A mesh node that additionally provides wireless access to client hosts.

Internet Gateway A mesh node which is connected to the Internet. The WMN should containat least one Internet gateway in order to provide Internet access to the clients.

Client Host A host accessing a mesh node’s client network. Even though a client host is partof the network, we assume that we have no control over its behaviour and software.

We assume, that each mesh node is a Client Access Point and has two wireless interfaces,one used for the backhaul and the other one for client access, as it is the case in the CommunityWMN in Wray2. In order to prevent interference between the two, the backhaul interface uses802.11a, the client access interface 802.11b/g technology—this also ensures that all clients canaccess the WMN, because 802.11b/g is more common in commercial laptops than 802.11a.

These assumptions require all mesh nodes to use the same channel for the backhaul in-terface in order to retain a connected network. This fact, and hence the lack of possibilityto apply local remediations, and the fact that other devices will interfere more likely with the2.5 GHz frequencies than 5 GHz, convinced us to have a closer look at interference degradingthe service of the client access networks caused by other wireless devices in close proximity tothe WMN as well as by mesh nodes.

Before we present any solutions, let us analyse further the problem of interference to clientaccess points in WMNs in Section 4.1. In Section 4.2, we first provide a general design fordistributed detection, remediation and recovery of challenges in a WMN, prior to describingits implementation for interference detection and remediation in Section 4.3.

4.1 Analysis: Impact of Interference

In order to point out the impact of interference to the experienced quality of service, but alsoto better understand its impact on lower level metrics, we are currently performing a numberof experiments, exploiting in all of them the setup shown in Figure 5. We use two client accessmesh nodes and two client hosts (laptops). In the first pair (link a), traffic is transmitted ona single channel during the whole duration of the experiment, both from the access point tothe station, and vice-versa. On the second link (b), the interferers alternately remain silent

2A village near Lancaster, that served as use case earlier in D1.1[SS09]

23 out of 58


for a given time and then generate traffic for the same duration. The interferers’ channel isincreased after each traffic generation period from channel 1 to channel 13. We also measureinterference between so-called non-overlapping channels, since we cannot assume completeindependence, as shown in [FVR07] and also confirmed by our measurements.

Figure 5: Experiment setup with station S connected to access point A (link a) and station SI

connected to access point AI (link b)—the latter two can be combined as a single interfererI.

In general, interference can occur in two places—at the sender level or at the receiver level.Interference at the sender level is caused by ongoing transmissions from other nodes whichprevent the sender from sending a frame. Interference at the receiver level happens whena sent frame cannot be successfully received by the receiver due to the presence of anothertransmission.

We measure the impact of interference in three scenarios as shown in Figure 6. Note thatthis is different from the hidden terminal problem which can be solved by using the optionalRTS/CTS mechanism, since interferer I is possibly in another network and/or channel andtherefore doesn’t communicate with S and A, respectively. In scenario (a), all nodes are visibleto each other, in scenario (b), the station S doesn’t see the interferers I. Scenario (c) is verysimilar to (b) since both A and S act partly as sender and receiver. However, because the accesspoint A has additional functionality like beacon transmission and association management, weexpect different results from scenario (b) and (c).

4.1.1 Measurement Metrics

Interference in wireless networks cannot simply be reduced to physical interference betweenmultiple frequencies—it also involves complex interactions between the MAC protocols, someof which are, despite the standardisation, very specific to certain manufacturers. In order toexpose as many aspects of interference, we monitor a diverse set of metrics, which are retrievedeither from statistic tools or the /proc/net filesystem.

Madwifi Statistics Madwifi3 is a wireless driver for Linux systems which allows creation ofmultiple wireless virtual access point interfaces per physical device. It therefore separates thenotion of interface (link layer) and device (physical layer) and exposes their statistics separately

3http://madwifi-project.org

24 out of 58


Figure 6: Interferer I (consisting of an access point and a station), client access point A andstation S with their interference ranges placed in three different scenarios: (a) all nodes candirectly communicate with each other, (b) interferer I is not visible for station S, (c) interfererI is not visible for access point A.

through the 80211stats and athstats commands, respectively. It further provides some dataabout successful and failed transmissions for different frame sizes and transmission rates in the/proc/net/athX/ratestats_Y files (X being an interface number and Y one of the availablepacket size limits). There are many statistics that can be thus exposed from these and otherlayers of the protocol stack (eg TCP). For the purpose of our experiments, we utilise thefollowing metrics which are indicative of interference.

Bad CRC Number of packets received but dropped because of an invalid CRC value.

High On-Chip Retries (or too many retries) Number of packets dropped because the wirelessdevice failed too many times to successfully transmit the packet.

Element Unknown Number of packets received which couldn’t be forwarded to a host be-cause it originated from another network or an unassociated station.

Channel Mismatch Number of packets received which were sent on another channel.

SSID Mismatch Number of packets dropped because they were sent with another SSID.

4.1.2 Results

Figures 7 and 8 show two graphs with different metrics measured on the access point A withchannel 1, while an interferer was partially transmitting on channels 1 to 9. The nodes weresituated according to scenario (a) while using the highest possible transmission power settings.A single TCP stream is generated from node A to node S which results in A sending but alsopartly receiving data (acknowledgements).

Figure 7 shows an increased rate of received packets with unknown destination—a goodindication that another network is using the same or adjacent channels—for channels 1 to 3.For channels 4 to 6, Figure 8 shows an increased amount of packets received with invalid CRC.Due to the close proximity of all involved nodes, transmissions can be clearly received by allstations using the same and even adjacent channels (up to an offset of two). Only for furtherapart channels, an increased rate of packets with bad CRC is visible.

While sending packets, for channels 1 to 3, the sender A can mostly hear the transmissionsby the interferer and collision avoidance prevents it from sending packets while packets fromthe interferer are in the air. However, some collisions still happen. With increasing channel

25 out of 58


offset, sender A cannot hear the interferer’s ongoing transmissions and sends them nonethelesswhich leads to interference and failed reception at node S—this explains the increasing rateof packets dropped due to too many failed transmissions and retries in Figure 8.

Figure 7: Measurements of the 80211stats command: packets with unknown destination,packets with wrong channel number and packets with wrong SSID, respectively, relative tothe total number of received packets, the yellow shaded areas showing the times when aninterferer was transmitting on the indicated channel—simultaneously to the monitored accesspoint which was communicating on channel 1.

Figure 8: Measurements of the athstats command: packets dropped because of bad CRCrelative to all received packets and packets dropped because of too many unsuccessful retriesrelative to all sent packets, the yellow shaded areas showing the times when an interferer wastransmitting on the indicated channel—simultaneously to the monitored access point whichwas communicating on channel 1.

4.2 Architecture: A Distributed Challenge Detection, Remediation and Re-covery System

In this section we describe how the ResumeNet D2R2 +DR strategy, and more specifically thedetection, remediation and recovery cycle, as shown in Figure 9, can be applied appropriately to

26 out of 58


challenges in WMNs. Here, we further split detection, remediation and recovery in local-onlyand remote (distributed) support phases, in consistence with the requirements of our approach.

Figure 9: D2R2 +DR strategy and the steps of challenge detection, remediation and recoveryin a distributed system.

We also propose an initial architecture by describing a set of components and tools: theResilience Engine for scheduling and local execution of the detection and mitigation phases(Section 4.2.4), the Support component for message exchange and support by remote hosts(Section 4.2.5), as well as the Local Information Provider for data collection and aggregation(Section 4.2.6). Figure 10 shows these components as well as the exchanged messages andthe detection, remediation and recovery phases. Support Handlers, Challenge Handlers and(Remote) Commands can be implemented to suit the purpose to detect and mitigate differenttypes of challenges. Note that at this proof-of-concept stage we are not yet utilising theDistributed Information Store proposed in Section 6.

We divide the general process of challenge detection and mitigation into phases that wehave previously proposed, and which are split accordingly in recognition of WMN resourceconstraints and to minimise message overhead. Only after local detection of a problem, amore complex analysis which involves other hosts is triggered in order to more fully identifyand understand the challenge. Also during the remediation phase, the challenge’s impact isminimised locally fist, before a distributed algorithm is used to find a better solution for theoverall system. Finally, a distributed observation phase synchronises the point in time whereall nodes revert back to a standard configuration.

27 out of 58


Figure 10: Diagram showing the components of a distributed challenge detection, remediationand recovery system on one host.

4.2.1 Detection

In order to mitigate a challenge, the system needs to be aware that a certain challenge iscurrently in progress. The process of determining whether a challenge is ongoing and analysingits nature and impact is called Detection and consists of the following two phases:

Detect Data from the Information Provider (see Section 4.2.6) is consolidated in order todecide whether a challenge is present. The kind of algorithm used, e.g. machine-learning-based, entropy-based, etc., is left to the implementation.

Analyse In order to prevent false positives and to gain a more advanced understanding of thechallenge, further data from the Information Provider as well as from other nodes canbe requested. We provide a Messaging Service in order to easily exchange messages withother nodes in the WMN.

4.2.2 Remediation

After the type, cause and impact of a challenge have been determined locally and/or with thehelp from neighbouring nodes, certain actions need to be taken to mitigate the challenge. Theprocess of dealing with the challenge either temporarily or permanently is called Remediation;distributed Remediation can be split into three phases:

Minimise Based on the previously gained view on the challenge, a local action may be helpfulto temporarily mitigate the problem, or might even be necessary in order to be able toexecute the next phases. For example, a system under extreme traffic load might need tolimit traffic rates, or drop packets from certain sources, before more advanced algorithmscan be used to detect whether the load is caused by a DDoS attack or a legitimate flashcrowd.

Consult Not only detection but also remediation might need collaboration among the meshnodes. Different challenges need different types of algorithms for agreement on a global

28 out of 58


solution. As a general solution, we only provide a Messaging Service which can beused to exchange an arbitrary amount of messages among any pair of nodes. In theremediation framework [SCS+09], such negotiation phases and concerted decisions fitinto a “consultant” component.

Deploy Once a solution has been found in the previous phase, in the deployment phase theinvolved hosts are requested to execute the according actions; this might also involve asynchronisation technique.

4.2.3 Recovery

Certain remediation techniques may be appropriate as long as the challenge is present; after-wards, reverting to the original configuration or a permanently changed version thereof mightleave the system in a better operation state—this is the process of Recovery which is split intotwo phases:

Assess Once a remediation action has been deployed, the assess phase regularly monitors andexchanges information in order to determine whether the challenge is still ongoing.

Revert Once the system agrees on a new solution, or on using the original one, in the revertphase involved hosts are informed to revert to normal operation.

All these phases can be executed either from the Resilience Engine (described in Section4.2.4) which periodically triggers a cycle, or by an incoming request from another node (de-scribed in Section 4.2.5). Figure 11 shows host A communicating with its neighbouring hostsB, C and D in order to detect and mitigate challenges.

Figure 11: Diagram showing the components used for distributed operation of the system, e.g.host A communicating with hosts B, C and D.

4.2.4 Resilience Engine

The Resilience Engine is the most important component for detection, remediation and re-covery. It executes the previously described phases in cycles, as shown in Figure 12. First, a

29 out of 58


Figure 12: Phases of detection, remediation and recovery. Left the cycle for the local Reme-diation Engine, right the cycle for remote Support.

local detection algorithm is executed—if a challenge or a suspicion of a challenge is detected,control is passed to the analysis phase—if not, the engine returns to the wait state until a newexecution of detect is triggered. The analysis phase consolidates further data to either negatethe previously detected challenge—in which case the system returns to the wait state, or toconfirm it so that the Resilience Engine executes all the following phases in sequence.

Note that all phases are completely implemented by a Challenge Handler which is specificfor each challenge. Hence, the detect phase can be implemented to always return a suspicionif local-only detection is not possible or helpful. Also, any of the following phases can beimplemented to simply do nothing if it is not needed. In order to access data from previousphases, the Challenge Handler is able to store all previously used data and each phase canaccess it.

4.2.5 Support Engine

Some of the phases allow a node to request help from other nodes in the WMN—these nodeson the other hand are obliged to respond to the requests as far as the algorithm requires this.Incoming requests are handled by the Messaging Server which looks up the responsible Chal-lenge Handler and executes the corresponding Support Phase with the parameters extractedfrom the message. Each node manages a Handler List, a list of all available Challenge Handlersused by the Messaging Server to find the appropriate handler.

Unlike for the local Resilience Engine cycle, for the Support phases not all data neededto handle an incoming request are available directly from the current request. Also, a nodeshouldn’t be able to request a certain action without the appropriate history. Therefore, alsoa state for incoming requests needs to be maintained. However, in order to prevent resourcestarvation due to too many handled messages, these are deleted after a given timeout afterwhich it can be assumed that they’re not used anymore.

In a distributed system like a WMN it is possible that multiple nodes detect the samechallenge independently. While multiple detection of the same challenge isn’t a problem perse—multiple (possibly even contradicting) remediations definitely are. Therefore, before deal-

30 out of 58


ing with an incoming request, it has to be made sure that the challenge is not already beendealt with. Here, a well-defined deterministic algorithm is needed in order to avoid multi-ple nodes dropping each other’s requests which prevents the distributed system from workingproperly.

4.2.6 Information Provider

During some of the phases mentioned above, data from different commands needs to beconsolidated. However, the output of a single run of a command might be meaningless, becausethe increase of a counter during a certain time is of interest. Therefore, it is important thatcommands get executed regularly and that data is already available and pre-processed whenneeded. Pre-processing may also include calculations like average, maximum or minimum valueover a given time. On the other hand, there are some commands whose data is only rarelyneeded and it would be an unnecessary overhead to collect the data regularly. The LocalInformation Provider provides either data from the Store, or by invocation of a CommandModule if no cached data is available.

The main purpose of the Information Provider and the Store, however, is aggregation ofdata—with a single query, data from multiple CEP modules can be returned in one objectwithout executing them. A collector background process executes certain CEP modules regu-larly and feeds their output to the Store. In order to limit the Store’s size, only a given numberof samples can be stored—we use a round robin database for this purpose. This doesn’t havean impact on the system since for detection and analysis only current and recent data is ofinterest.

4.3 Realisation: A Distributed Interference Detection and Remediation Sys-tem

As a proof of concept for our new challenge detection approach and as a solution to theproblems described in section 4.1, we provide a realisation of the abstract system which dealswith interference.

First we introduce some variables and definitions which help pointing out the phases ofdetection and remediation of interference:

• The set of mesh access points V (n) visible to node n and Vj(n) = {v ∈ V |c(v) = j},the subset thereof using channel j.

• The set of visible external access points E(n) and Ej(n) = {e ∈ E|c(e) = j}, the subsetthereof using channel j.

• S(n), the set of stations associated with mesh access point n.

• The set of available channels C.

• R(n), the Received Signal Strength of an visible access point n.

• U(n), the utilisation of a network identified by access point n.

We define the metric of interference impact caused by external nodes on channel i as:

31 out of 58


MEi =

∑j∈C

αi,j

∑e∈Ej

R(e) · U(e)

with αi,j being the cross-channel interference coefficients.

Accordingly, we define the metric for mesh nodes themselves:

MVi =

∑j∈C

αi,j

∑v∈Vj

R(v) · U(v)

and the total interference impact (due to linearity):

ME+Vi = ME

i +MVi

4.3.1 Detection

In order to detect interference, we could use a machine-learning based classification algorithm.The data (as mentioned in Section 4.1.1) collected for the measurements can be used to trainthe classification algorithms; however, before the data can be used, certain features need tobe selected. Further, the data needs to be normalised relative to the amount of received andtransmitted traffic.

The detection algorithm used should have a low false negative rate in order to detect asmany cases of interference as possible. On the other hand, a slightly higher false positive rateis acceptable since in the following analysis step, the node may still determine that there isno interference present and stop the current cycle. A trade-off can be achieved here since ourinitial detection algorithm also needs to be relatively lightweight in recognition of the resourceconstraints of WMN nodes.

4.3.2 Analysis

After the classification algorithm detects interference, our staged approach invokes furtheranalysis in order to gain more understanding of the challenge. Basically, the phase of analysiscan be divided into two parts, a local part and a remote part.

Locally, a network scan is performed in order to detect all available networks. In order notto disconnect the node’s clients, however, some advanced techniques need to be used. On theclient side, a station can enter power-safe mode while performing a network scan in order notto disconnect from the access point. For an access point, however, this is not possible. In[PAM09], the authors use the network allocation vectors to virtually mark the channel as busy.While all stations assume that another station is currently transmitting and remain silent, theaccess point can safely perform a network scan.

In addition to the scan, the node should also gain an estimate of the utilisation of thedetected networks. This can be done by sniffing traffic, sniffing on channels other than theone currently connected to; however, it is again only possible by either disconnecting or usingthe above technique.

More accurate estimation of the node utilisation of the same network requires distributedcooperation. Accordingly, an Analysis Request is sent to all visible neighbours in order torequest statistics about their usage. This request also ensures that multiple nodes don’t deal

32 out of 58


with the problem independently, e.g. if node A and B detect interference, they might bothchange their channel to the same number which again causes interference. Through an analysisrequest, the problem gets merged and only one of the neighbouring nodes is actually responsibleto find a new channel allocation.

4.3.3 Minimisation

Using the previously collected data and calculating the metric of interference impact ME+Vc

for all channels c, a node can minimise interference by choosing channel cmin for which ME+Vcmin

is minimal.

4.3.4 Consulting

Figure 13: Three mesh nodes A,B,C and an interfering node I with their channel allocation(superscript), the edges between the nodes show dependencies of the channel allocation: (a)without an interferer, (b) with an interferer before minimisation, (c) after minimisation, (d)after negotiation/remediation.

Even though the minimisation phase minimises the interference for the current channelallocation in the network, a new allocation of channels may further decrease the impact ofinterference as shown in Figure 13. In (c) the best available channel is chosen without requiringany other nodes to change their channels. In the consulting/remediation phase, the wholenetwork can be involved in finding a new channel allocation, however, as few nodes as possibleshould actually change their channel. In (d) nodes A and C simply switch their channel numbersand the network is again in an interferenceless state.

The problem of finding a channel allocation in a network with certain dependencies comesdown to a node colouring problem where nodes are the network nodes and the edges aredependencies between the nodes, e.g. the requirement that the connected two nodes shouldn’thave the same channel allocated.

4.3.5 Deployment

Once the consulting phase found a suitable solution for the network, the channels have to bechanged in all involved nodes at the same time. If this was done while finding the solution,one channel change would cause interference at another node which causes again a channelchange and more interference, leading to a chain reaction. This further validates our staged,distributed approach to challenge detection and remediation.

33 out of 58


5 Challenges Detection in Opportunistic Networks

The ResumeNet strategy describes a real-time control loop to allow dynamic adaptation ofa networked system in response to challenges, and an off-line loop that aims to improve theperformance of the network (the real-time loop) via a process of reflection.

We have applied this general resilience strategy to an opportunistic networking scenario,illustrating its application. First, we show how our current implementation of the transportprotocol fits into the strategy picture followed by a full circle through the two control loopsat the centre of this strategy. These two control loops are triggered by the introduction of anadditional challenge to the simulation scenario, namely malicious nodes that do not behaveaccording to the transport service specification, by not forwarding data. The impact of thepresence of such nodes is assessed, as well as the success of a potential remedy, including itscosts.

This chapter is organised as follows. The resilience strategy is exemplified in a study caseon opportunistic networking in Section 5.1 Thereafter, the experimentation results derived byour opportunistic networking emulator are presented in Section 5.2 Our conclusions and anoutlook to future work close the chapter in Section 5.3

5.1 Resilience for Opportunistic Networks

In this section, we describe the application of our resilience strategy to an opportunistic net-working scenario.

5.1.1 Service Specification of a Store-Carry-Forward Transport

In opportunistic networks, no end-to-end path is assumed. Typically, mobile nodes store, carry,and forward messages upon encounter with other nodes using short range communication. Astore-carry-forward (SCF) transport service [SKH+02] allows the forwarding of data to a nodethat is not connected to the source at the time the data is sent, if there exists a ‘temporalpath’ between source and destination. A ‘temporal path’ is a set of links that connects asource and destination over time, where link ln+1 exists after link ln has existed, and the datais forwarded. Node mobility is thus important for data dissemination, in that it causes contactopportunities between different nodes and also allows nodes to physically transport data tobridge areas where no connectivity might be available.

A malicious node which does not forward data properly breaks this service specification bynot forwarding data when a ‘temporal path’ exists. It is important to note that this servicespecification assumes unlimited storage capacity. The results presented below are also basedon this assumption. We are currently working on refining the simulations including this servicespecification to reflect a more realistic opportunistic network by limiting the nodes’ memory.However, this also introduces complications, which we detail in Section 5.3 pointing to ourfuture work.

5.1.2 Realising the D2R2 + DR Strategy

Based on the presented network design and the D2R2 + DR resilience strategy described in[SSF+09], we now illustrate how the system could be enhanced to cope with misbehavingnodes that do not forward data for other nodes. The overall strategy we adopt to mitigate

34 out of 58


misbehaving nodes is depicted in Figure 14. In summary, we start with an understanding ofthe potential capabilities of the network, e.g., in terms of delivery ratio and delay, based onsimulations or modelling, for example. If we detect there is a deviation from this because ofmalicious nodes, we adapt the configuration of the store-carry-forward transport mechanismin order to remediate. A more detailed description of the realisation of the strategy follows.

Store and Forward

Transport

Store copy counters of previous

communication

Adapt SCF-con�guration

Reference values from simulation

Deliveredservice

DefenceRemediation/Recovery

Detection

Figure 14: Control loop steering the transport protocol

Defence A network for opportunistic communication using a SCF transport service is inher-ently built to cope with the challenge of episodic connectivity, as described above. Theabsence of an end-to-end path is the most dominant challenge the system has to dealwith. It implements the error isolation defence line, as it contains the challenge withinthe service making the absence of the end-to-end path transparent to the application.Other defence mechanisms could include, for example, using a game theoretical approachto promoting node participation, as proposed by [BDFV10]; a similar approach is beingdeveloped in our project.

Detection A simple approach to detecting the presence of misbehaving (or unhelpful) nodesis for sources to maintain a history of the nodes a message was sent via. In-turn,receiver nodes maintain a list of nodes they have seen, and the nodes that successfullydelivered a message to them. Either when a source and destination pair meet duringnormal operation, or during an off-line period, e.g., when the devices are attached tosome infrastructure, the receiver transmits its state to the sender. Using the informationreceived from the receiver, the sender can deduce which nodes were helpful in forwardinga given message (i.e., delivered a message), and those who were not, both because theynever saw the receiver and because they chose not to transmit the message (i.e., theysaw the receiver, but did not forward the message).

This simple algorithm can be extended in various ways, for example, nodes could begin toshare their local knowledge about the utility of various nodes – disseminate informationabout unhelpful nodes and those that are useful for certain destinations. The utility ofthis algorithm in different forwarding scenarios, here we assume two-hop SCF, and theway various parameters of the algorithm affect its utility, e.g., the amount of state tomaintain, will be investigated in the project.

Remediation If a node detects the presence of maliciously behaving nodes, it adapts the SCFtransport service configuration enabling epidemic forwarding. In contrast to two-hopforwarding, epidemic forwarding utilises multi-hop paths from source to destination. Aforwarding node is not restricted to forward the data to its destination but can alsoforward it to other nodes in the network. Thereby, a wider spreading of the data isrealised leading to more potential ’temporal paths’.

35 out of 58


Recovery Using epidemic forwarding as a remedy against malicious nodes comes with thecost of increased utilisation of storage on the network nodes (see Section 5.2 for details).Therefore, the SCF transport service should recovery back to its normal operation usingtwo-hop forwarding, as soon as the malicious nodes have disappeared from the system.This requires additional detection capabilities, which are currently under investigation inthe project.

Diagnosis A diagnosis of the successfulness of epidemic forwarding as a remedy against ma-liciously behaving nodes reveals that all data can be forwarded to its destination and,as a side-effect, the delivery delay is decreased. But the diagnosis also reveals the highcosts this forwarding scheme incurs on the system.

Refinement Based on this diagnosis, measures to better balance between the resource usageof the scheme and the delivery success can be designed. We added an aging mechanismto the management of the forwarding storage. Messages older then a certain thresholdget deleted from the store. The results of this measure are also shown below.

5.1.3 Resilience Metrics for Opportunistic Networking

One of the challenging tasks is to quantitatively characterise resilience in order to evaluate theefficacy of architectures and mechanisms being developed in the ResumeNet project. This is anespecially hard problem because of the numerous levels at which networks are addressed and theinteraction between these levels. Given our multi-level resilience approach, we are developinga framework that enables resilience evaluation at any arbitrary level. First, we define a serviceat any given layer boundary. We then quantify the resilience of the network at this boundaryusing a two-dimensional state-space model [MHS06]. Along one dimension, we characterisethe service at a given layer boundary using the metrics that are desired from such a service(e.g., storage size). Along the other dimension, metrics that define the operational state at thelayer boundary (i.e., metrics that affect those defined in the service dimension) are specified(e.g., data delivery ratio). Finally, we quantify resilience as a measure of service degradationin the presence of challenges (perturbations) to the operational state of the network.

Nacce

pta

ble

un

acce

pta

ble

Normal

operation

Degraded

operation

C

EA

Data

deliv

ery

ratio

Node storage

Figure 15: Resilience metric state space

In Figure 15, this approach is depicted for our opportunistic networking scenario. The

36 out of 58


system’s normal operation (N) is affected by the presence of malicious nodes which degradethe provided service (C). Applying epidemic forwarding as a remedy allows the system todeliver the desired service again but with an increased cost (E). Introducing aging during therefinement phase improves the system to maintain the specified service, while reducing thecost again (A). Finally, recovery brings back the system into it normal operation (N).

5.2 Evaluation

We evaluate the proposed resilience strategy by the presented example of an opportunisticnetwork scenario. The experimentation results are obtained with the Haggle architecture[NGR09] running on 20 (virtual) nodes with controlled connectivity. Connectivity between twonodes follows a two state Markov model with average contact time and average inter-contacttime of 30 seconds and 150 seconds, respectively. This topology avoids large connected clustersbut still gives enough contact opportunities to exchange data. The nodes generate data every2 seconds with a random destination among the 20 nodes. Haggle offers the possibility todynamically change the forwarding strategy and resource management policies, for example toage out data carried for a long time. As a default, we use a two-hop forwarding scheme [GT02]and no data aging. Two-hop forwarding has the property to limit the number of transmissionsin the network. Intermediate forwarding nodes may only get data directly from the source;once they acquire data, they can only pass it on to the destination node.

The study considers four scenarios: normal operation with all nodes using the two-hopforwarding strategy, challenged operation with 8 out of the 20 nodes refusing to forward dataas an act of selfish behaviour. Assuming perfect detection, we do remediation on the remainingco-operating nodes by using epidemic forwarding, and refinement where we periodically agedata based on the time in a node’s data buffer to compensate for the additional redundancyintroduced by epidemic forwarding.

For our evaluation we consider the number of data received at the destination, the end-to-end delay for all received data, and the number of data in the data buffer of a particularco-operating node, measured over a period of 90 seconds. The results are summarised in Table1.

scenario delivered data end-to-end delay buffer

normal (N) 227 37 s 127

challenged (C) 134 (−41%) 31 s (−16%) n/a

remediation (E) 226 (±0%) 18 s (−51%) 942 (+642%)

refinement (A) 246 (+8%) 17 s (−54%) 177 (+39%)

Table 1: Evaluation results.

Selfish behaviour of 8 out of the 20 nodes reduces the number of delivered data significantlyto 59% of the data delivered during normal. The end-to-end delay to deliver data is reducedby a few seconds, most likely because data with longer delivery time in normal operation didnot get delivered at all in challenged operation and, thus, does not contribute to the result.By using epidemic forwarding as remediation mechanism, the co-operating nodes compensatefor the challenge introduced by the selfish nodes achieving almost the same delivery of dataas in normal operation. Furthermore, end-to-end delay of the delivered data is much smallerbecause the restriction for redundancy of two-hop forwarding no longer applies and faster,but longer paths can be used to deliver the data. However, as redundancy is not restrictedanymore, the number of data on the inspected node increases dramatically from 127 during

37 out of 58


normal operation to 942 for remediation. The refinement process is considering this aspect byaging data that was stored in a node for more than 45 seconds. As a result, the data storedgets reduced to 177 that is again close to the normal operation. End-to-end delay did not getaffected, while the number of delivered messages even improved.

To understand that aspect, we need to know that Haggle ranks the data to be transferredto another node in order of importance and only transfers a certain number of data to limitcongestion. Also, the longer data is in the buffer of a node, the higher the likelihood that italready has been delivered to the destination by other nodes. By aging older data, we thusgive priority to newer data to be transferred. Also, the shorter end-to-end delay allows moredata to actually arrive at the destination during the period of observation.

5.3 Conclusions

We have applied the ResumeNet resilience strategy to an opportunistic networking scenario,showing first results how this strategy can enhance the network over time.

The next steps of our work is to refine the service specification for the store-carry-forwardtransport service to reflect resource limitations of the network nodes. From this limitation,several complications arise impacting the detection mechanisms.

38 out of 58


6 Distributed Information Store

Identification of a challenge implies the processing of information coming from sensors andcontext information to produce “challenge notification” events. Here, we call sensor any com-ponent that reports a situation. It can be protocol-specific variable monitoring (e.g. number ofpending TCP connections) as well as more sophisticated entropy-based or anomaly detectionsystems. The detection algorithms that translate these into challenge notification will typicallygather information from multiple sources in order to increase their confidence level, includingreports from sensors (or range of sensors) located in remote places of the network, sensorsfrom other protocols or background information such as routing table entries.

Notification of challenges trigger the activation of mitigation components which will takeactions to remediate the challenge. Because these decision are taken automatically and requirea response time that no human operator can offer, it is important to record the progress ofthe challenge as a whole (triggering conditions, evolution of the impact, side-effects) so thatthe fitness of the automated solution can be analysed later to adjust response thresholds andparameters.

We identified three key problems that hinder the development and deployment of efficientdetection and remediation techniques, and we suggest that a common information dispatchingand storage system is the proper abstraction to address them:

• There may be many sensors, reporting more information than we can afford to relayon the network. Because we expect that multiple algorithms will be deployed, eachoperating as an autonomous agent to identify a specific kind of challenge, we want thesensors to remain unaware of the number and identity of their listeners. Moreover, therelative network location of detection algorithms and sensors impacts the accuracy weneed on the information we receive.

• Detection, remediation and diagnostic actions are delayed from sensing activities, yet theymay require detailed information on past events that preceded a trigger. Therefore, therequired lifetime of individual events is hard to predict, but it requires careful managementgiven the potential amount of generated information.

• New components will be deployed over time, to better identify and remediate unforeseenand foreseen challenges. They will likely alter the coupling between data by introducingnew relationships and attributes. While this is an essential feature to guarantee successfulevolution of machine-learning algorithms beyond their initial programming, it also impliesthat no database schema can be established in advance. Yet the dynamics of informationmakes identifier-based solutions such as IF-MAP [TGC09] not applicable “as is”.

To address these problems, the Distributed Information Store for Challenges and theirOutcome (DISco) provides the following features:

• a aggregation-capable publish/subscribe function that relays information between sen-sors, detectors, and mitigators.

• an annotation system, coupled with more conventional database-like lookups that allowsdetectors and mitigators to further classify sensors information and adjust its lifetimeaccordingly.

39 out of 58


• a distributed (peer-to-peer) storage system that provides system-wide longer-term persis-tence for data that have been “elected for diagnostic” taking into account the existenceof “natural” storage space such as routing tables.

6.1 Design Principles

The following design principles steered us from the problem description to the architectureproposed in sections 6.2 and 6.3.

6.1.1 Peer-to-Peer Distributed System

Distributed monitoring infrastructures typically follow a strongly hierarchical approach where adevice in a low level of the hierarchy receives and process an important amount of informationabout a small area and report its conclusions to the level immediately above, up to a centraldevice that has coarse and complete view of the network and take decisions that are forwardedand enforced down the hierarchy. We argue that such an approach lacks scalability by thefact it excessively decouples data-plane monitoring and enforcement from decisions that aredelegated to the management plane.

Advances in unstructured peer-to-peer hash tables and messaging (especially pub/sub)systems would comparatively allow any number of devices to cooperate so that the controlplane of a device can obtain network-wide context it lacks to locally process fine-grained eventsdescribing its own behaviour, decide the required changes and inform peers of its decision toavoid inconsistent global behaviours.

6.1.2 Multi-Resolution Information

When a detection or remediation algorithm receives an event report, it is essential that it cangather additional information that was not explicitly included in that report. Thanks to thecoupling between the pub-sub system and the distributed database, we must let algorithms“zoom” into an event by collecting additional information with specified scope in time, locationand layers. This is a principle we share with [Xia09], although the rest of our approach differsfrom that proposal.

6.1.3 Learning-Ready Data Model

Information relayed by DISco must ultimately be useful as input to machine learning algorithms,especially classification and regression algorithms that will provide configuration parameter ofthe adaptive detection and remediation blocks, but also to automate the identification ofmeaningful symptoms for a given problem among a huge amount of measurable parameters.To that regard, we preferred machine-oriented representation (tuples of numbers) over logentries. Alarms, reports, notification are mapped to such tuples where a special memberserves the purpose of identifying the nature of the event (the event identifier) and the restconsist of an arbitrary amount of attributes.

Identifiers for events and attributes relayed through the DISco need to have a commonlyagreed semantic for all components in the system. We propose to reuse for this purpose theconcept of Vocabulary Specification Trees (VSTs) described in the monitoring framework of the

40 out of 58


European ANA project [GGH+09]. Vocabularies provide easy-to-manage and extensible collec-tion of terms that are hierarchically organised by a “IS-A” relationship. We would, for instance,express that bandwidth is a connection-related metric by placing it under the connectionnode of the vocabulary tree rooted at metrics. As with ontologies, VSTs allow to distinguishconcepts: metrics.connection.bandwidth is different from resources.link.bandwidth,which express how fast raw data can be sent over a link. Each concept is thus a node in atree and can be refined by adding children concepts to it.

6.1.4 Keeping Management Apart

Tasks occurring in the management plane typically occur at a different pace and with a levelof abstraction that differs from control and data planes. Therefore, our proposal doesn’tinclude any mechanism for, e.g., (human-readable) self-description of exchanged information.Assignment of numerical identifiers to event and attributes, for instance, can be synchronisedindependently with assignment of other devices at initialisation. Numerical IDs are suitedto run-time processing, while their connection to concepts in the VST allows extension to aricher database, e.g., by linking reports to metrics as “challenge.X impairs metrics.Y”. Thisdatabase can be stored in a separate relation table available to human-assisted, or solver-based,diagnostic and refinement tasks.

6.2 Offered Services

DISco provides four main services that allow components of the resilience framework to shareinformation. Events can be published and subscribed to, as in any publish & subscribe system,specifying at the same time some filters and / or aggregators. The identifiers for the eventsand their attributes are known throughout the system thanks to a vocabulary, as explained insection 6.1.3. When an event is received, the subscriber can reply to this event in order toprovide feedback (such as annotation tags). Eventually, lookups can be performed in order toretrieve information about past events.

Publish Input: - event id- list of attribute-value pairs

A published event is only composed of the event identifier and a list of associated at-tributes. The publisher remains a simple process that sends information to the system withoutconsideration to how (and how much) aggregation or filtering has to be performed on theseevents before being transmitted to the subscribers (if any is interested in its publications).

Subscribe

Input: - event id- rejected attributes (“attribute filters”)- constraints on attributes (“event filters”)- specification of aggregation

When interested in a particular type of events, the subscriber provides the event identifierand specifies the desired granularity and carried information of received events. Filters can bedefined to reject some attributes (considered by the subscriber of zero interest), which can becalled attribute filters. This allows more flexibility than having to list explicitly all the desiredattributes. Other filters can be used to associate matching rules to attributes, such that onlyevents with matching values will be forwarded to the subscriber (event filters).

41 out of 58


Together with these filters, an aggregator can be appointed, based on pre-defined aggrega-tion types (depending on the event) plus the granularity level. The granularity can be specifiedboth in terms of maximum aggregation level (i.e., no more than x elementary events in anaggregated one) or publication rate (i.e., no more than x aggregated events per second).

Reply

Input: - event id- constraints on attributes- constraints on timestamps- tags

Replies allow a subscriber to send back information to the system following a receivedevent. The main purpose of this service is to provide an easy mechanism to annotate events.These annotations (tags) are useful to keep trace of correlation steps and to indicate relevanceof events in order to adjust their lifetime in the store. A single reply can be used to annotatea group of similar events based on timestamps and optional attribute constraints.

Lookup

Input: - searched events Output: - List of Tuples- searched tags- constraints on attributes- constraints on timestamps- specification of aggregation

Lookups are used to retrieve past events from the persistent store. These are basicallyrange-based queries on attributes and timestamps for particular events and / or tags. As fordatabase requests, lookups can return all the matching events, or aggregate them (reusing theaggregators that can be used during subscription) in order to limit the size of the response.

6.3 Internal Architecture

The general, high-level architecture of the Distributed Information Store is depicted in Fig-ure 16. Basically, it consists of a publish & subscribe system, which uses a distributed storageto provide data persistence at short and longer-term. Information to be published may first befiltered and/or aggregated. A lookup proxy will be used to take advantage of already existinglocal storages (such as routing tables), giving access to such information in a transparent wayif relevant. These sub-systems will be running on top of a connectivity layer ensuring sufficientreliability to the internal communications. A brief description of all these elements is givenhereafter.

6.3.1 Publish & Subscribe System

The Publish & Subscribe mechanism is used for near real-time notification of events, and isthus the privileged path for information exchange between detection and remediation blocks.While detection blocks are quite naturally associated to publishers, and remediation blocks tosubscribers, it is also expected that several “correlation blocks” will be using the informationextracted from various sources they subscribed to, in order to publish some higher-level events.In typical situations, several steps of correlation will be necessary to gradually obtain a quitecomplete view of an ongoing challenge:

42 out of 58


Connectivity

DISco Core Functions

Lookup Proxy

Filters / Aggregators

Publish

Publish & Subscribe

Reply

Sensors

Subscribe

Distributed Store

LocalStores

Look up

Correlators

Mitigators

RetentionManager

Store

Publish

Look up

Look up

Figure 16: General architecture of the Distributed Information Store.

• At the bottom, simple “sensors” are publishing purely descriptive data on what they aremonitoring.

• Detection blocks may subscribe to such raw data, and publish an event when they detectsuspicious behaviour.

• One step further, other blocks gather such events and try to correlate them, publishingnew higher-level notifications.

• Hopefully, at the final step, the challenge can be fully characterised and the remediationblocks are notified.

• Still, more information can be published by the remediation blocks, for instance to specifythe countermeasures taken.

Beside these two actions (publish, subscribe), it is envisioned to provide a reply mechanism.This ”reply” to publications can be used to annotate the data, e.g., a subscriber tells the notifierthat the published summary of some data is of importance and that the detailed data shouldbe kept for a longer duration. The annotation tags can provide link between low-level andhigh-level (correlated) events. It might also guide the auto-configuration of aggregators (seenext section).

43 out of 58


6.3.2 Filtering and Aggregation System

DISco is intended to be deployed in large systems, with possibly huge volume of information toprocess. While there is no a priori limitations to the amount of published data and granularityof events, a practical solution has to keep bandwidth and storage usage as low as possible.This leads to two possible solutions: either publishers are responsible for limiting the amountof events they generate, or the subscribers specify to the system the granularity level they areinterested in.

Nevertheless, only the second approach will succeed in efficiently reducing the data volumewithout loosing essential information, mainly for two reasons. First, publishers have no (oronly few) knowledge of the level of interest in their publications, possibly leading them tolargely under- (or over-)estimate the optimal granularity. Second, subscribers will be able todynamically reconfigure (e.g., through a new subscription) the desired granularity in order totell publishers to be more verbose when they detect suspicious events. This allows to maintainrelatively low management overhead during nominal operations, while gathering more precisedata during challenges, enabling better detection and/or diagnosis.

When subscribing to a particular event, the subscriber can specify limitations to the amountof published notifications through

• Filters, discarding non relevant events and/or attributes, and

• Aggregators, gathering several events to produce a single, coarser-grained notification.

It is important to underline that the only role of aggregation is to merge similar events, thecombination of which is an event of the same type (attributes are altered, though). This mustnot be confused with correlation, which extracts information from several events in order todeduce one of a new type. This correlation will not be performed by the DISco itself, althoughcorrelation blocks are the typical example of elements going to be both a subscriber and apublisher.

When specifying aggregation, the system has to be told how and how much to aggregate.Although it could be imagined to let full freedom in this specification, through, for instance,programmable aggregators, this would raise huge implementation (and even maybe security)problems. Instead, predefined aggregators will be selectable, depending on the type of event,and following the subscribers’ needs. These will be defined, for instance, using the providedvocabulary.

Considering the granularity level, two kinds of specification can be provided. On one hand,the event rate has to be controlled strictly, and the aggregator delivers a periodic summary,containing more or less base events, depending on the number of generated events during theinterval considered. On the other hand, we want a single aggregated event to always containthe same number of base events. In this case, traffic will not be uniform over time, allowingto better capture critical situations.

6.3.3 Distributed Storage System

The published events are kept in a distributed storage across participating nodes for furtheranalysis. This includes both detection algorithms requesting recent events (short-term storage),and delayed processes running diagnosis on a larger scale (long-term storage).

44 out of 58


Many distributed stores are based on distributed hash-tables (DHT), basically using ahash of an element identifier to determine the node on which it has to be stored. While thisapproach is used by many peer-to-peer systems which need to search single elements basedon their identifiers, DISco is required to handle more evolved lookups, supporting range-basedqueries on several attributes (such as IP range, time intervals, thresholds on values, and soon), while considering others as wild-cards. Practically, two multi-attribute, range-queriablesystems of interest have been identified and are shortly discussed in Section 6.4.2.

As already mentioned, a particular attention has to be paid to managing the lifetime of thestored reports. Ideally, element removal will be something that is handled in an autonomousway by the DISco itself through the Retention Manager. It will use information such asnumber of subscribers, past lookups, and annotations as hints that a specific data entry needsto be “promoted” to a longer storage (typically, for the diagnosis phase). Static and manualconfiguration should be used only to define defaults and characterise retention length dependingon the hints mentioned above.

6.3.4 Heterogeneous Storage

While DISco provides, from a logical point of view, a single store (even if physically distributed)for all published events, it could be beneficial to take into account pre-existing “natural” storesof information (such as BGP routing tables), and give access to them through the DISco in aunified way. Instead of duplicating this information directly in DISco, the goal is to keep onlylocation hints, relying on the lookup proxy to make this information accessible transparently(possibly performing some data formatting too).

6.3.5 Connectivity Layer

It corresponds to a resilient communication infrastructure that can be provisioned with limited,but guaranteed bandwidth (i.e., defended against volume-based attacks), and possibly usingsecure channels (i.e., encrypted communications and known partners). Its goal is to decoupleDISco-related traffic (i.e., management traffic) from the monitored traffic. We also assumethat it provides peers authentication and integrity of the DISco-internal traffic we need toensure no forgery or falsification of reports could occur.

6.4 Available Components

6.4.1 Resilient Communication Infrastructure

A resilient communication infrastructure is developed in Task 3.5, using overlay connectivityfor management traffic. It should meet all the requirements of the connectivity layer mentionedin the previous section.

6.4.2 Peer-to-peer Storage Systems

DISco storage capabilities rely on an underlying distributed system, which is required to supportmulti-attribute range queries in order to be able to perform any necessary lookup. This excludesmany DHT-based systems since the use of hash functions to (pseudo-)randomly distributeelements amongst participating nodes prevents from easily performing range queries. Other

45 out of 58


structures have been developed to handle this kind of queries, but we identified only two ofthem being of interest and supporting multi-attribute range queries: Mercury [BAS04] andSkipTree [AGT10].

Mercury. Mercury is the distributed database at roots of ANA’s M.C.I.S [CGG+08]. Itsupports multi-attribute range-based search and performs explicit load balancing. Its internalworking is briefly described hereafter.

The nodes of the system are logically divided into attribute hubs, each of them responsiblefor a specific attribute. Each hub is organised as a circular overlay, and each node in thathub is responsible for a contiguous range of the attribute. When routing a particular requestwith several specified ranges, the attribute corresponding to the a priori most restrictive range(i.e. the one expected to return the less results) is selected. The request is then routed to theassociated hub, and other attribute ranges are only used to filter the answer.

The consequence is that for each stored object, a copy is held in each attribute hub. Thismeans that an object is duplicated for each of its attributes on which we want to be able toperform range queries efficiently, quickly becoming an issue in high dimensional spaces.

SkipTree. The SkipTree structure heavily relies on SkipNet [HJS+03], keeping its fundamen-tal locality properties while extending it in order to handle multi-dimensional spaces. Basically,the key space is partitioned in n regions corresponding to the n network nodes. A binarypartition tree is used to perform the assignment of regions to nodes and to help routing anyrequest to the appropriate nodes.

The locality property refers to the ability to explicitly place data on specific nodes, oron a specific subset of nodes, for instance only within a given organisation. Moreover, itguarantees that messages between two nodes within the same organisation are routed withinthat organisation only.

SkipTree has very interesting properties, but is still under some development and evaluationon large testbeds. In contrast, Mercury is available for long and already used in many systems.

6.4.3 Pub/Sub Systems

Many decentralised publish & subscribe systems are based on an underlying DHT (or otherdistributed structure), while, similarly, many distributed stores are supplied with an implemen-tation of a publish & subscribe interface too. For instance, Mercury provides such a featurethrough its subscription language and protocol.

An example of another decentralised pub/sub system is SCRIBE [RKCD01], built on top ofPastry [RD01] (a DHT-based system), using it to create topics and disseminate events throughefficient multicast trees. A ready-to-use implementation of SCRIBE is already available for theOMNeT++ network simulator4, allowing for a quick deployment of the base functionalities ofDISco.

4http://www.omnetpp.org/

46 out of 58


6.4.4 Correlation Engines

Correlating events to draw conclusions is a key feature of the challenge detection process.We have specifically investigated the possibilities offered by ISS and Chronicles, two enginesdeveloped by projects partners and present here the way they can be integrated with DISco.

Information Sensing and Sharing framework (ISS [SFH07]) has been developed at Lancasteras part of the functional composition framework of the ANA project. While it wouldn’t reallybe used inside DISco, it is a good way to build correlators and more sophisticated sensors onnodes. DISco’s publish and subscribe system naturally extends the point-to-point data deliverythat are already present in ISS.

Chronicles recognition is a mechanism for temporal events correlation that has been suc-cessfully applied to network intrusion detection at Orange Labs [MD03]. It features its owninter-connection mechanisms to build hierarchically organised distribution of detection and ef-forts, which again could benefit from DISco’s peer-to-peer nature to improve scalability anddependability. Its strong dependency on time-related aspects puts an interesting constraint onhow DISco could perform aggregation and filtering.

We are aware that those two mechanisms may not cover the full spectrum of detectionstechniques that we might want to apply to provide network resilience. This speaks in favourof having correlation tasks kept out of DISco and implemented as part of the DISco “clients”.

6.4.5 Standard Network Monitoring Protocols

Out of the existing network monitoring protocols, we have specifically investigated NetFlow,SNMP and syslog for interoperability with DISco. Both serve different purposes and are widelydeployed in existing products. They typically fit a sensor component of Fig. 16, but, as withexternal data stores discussed in Section 6.3.4, they need an additional translation functionthat convert their reports into DISco event reports.

This translation typically includes the identification of event label as well as attributesextraction and conversion by matching the external notification against known patterns. Inthe case of NetFlow, the mapping can be pretty straightforward, especially thanks to theavailability of compound values support in DISco. On the other hand, syslog entries wouldrequire a deep knowledge of the applications that generated them to proceed with matchingand extraction, and, to a large extent, it would be preferable to alter the application so thatthey natively support DISco pub/sub interfaces and have an external exporter translating DIScoevents into human-readable syslog events rather than the other way round.

Finally, the SNMP protocol provides much more than the functionality we propose in DISco,and its trap mechanism (used to report notifications asynchronously) is the most interestingfeature for our real-time approach. It should be noted, however, that despite an SNMP trapis linked to an object which value can be later looked up, SNMP daemons typically do notkeep track of individual data evolution, and it wouldn’t be possible to look into more detailsat a reported situation unless the aggregation happened after the translation step. Again,thus, publishing translated SNMP traps into DISco should be seen as a cheap, transitionalalternative to the native support of DISco in the network stack.

47 out of 58


Figure 17: Network under DDoS challenge, L being is the overloaded bottleneck link, R−T−Vthe path to the victim through the network

6.5 Concept Validation : DDoS Detection

6.5.1 Network-Network Interaction

To illustrate the way DISco works, let us use the following denial of service attempt on thenetwork depicted in Fig. 17. Attackers target link L that is required to reach a victim attachedto V . They additionally identified that traffic towards destinations attached to U also usesthis link and, thus, uses addresses Ui as well to dilute the signature of their attack. We alsoassume that attackers decided to have their attack traffic dropped a few hops after L5 in orderto further evade detection by systems in V ’s network. This, however, makes their traffic looksingular in the network containing L.

Routers in this example monitor the amount of IP packets that are rejected by the forward-ing process, including those who have their time to live exhausted. They report this throughevent.network.drops.forwarding.rfc791-ttl-exceeded which contains (as attributes)flow identification (made of source/destination addresses, ports and transport protocol, in thecase of IPv4), location of the reporting router and timestamp of occurrence. These events arepublished by V and U in the distributed store. Similarly, an important amount of “queue full”events occurring at R will trigger the execution of DoS-detection algorithms local to R suchas identifying destinations of largest flows. A further analysis process A that was dormantin a system close to R previously subscribed to “any heavy-flow report event” from R (e.g.,computed by a lossy counting approach, see sec. 3.4), and possibly other routers in the samepoint of presence (PoP).

Upon reception of heavy flows reports from R, A will additionally subscribe to eventsreporting network-related errors downstream from R, enabling collection of reports from T ,V and U . We assume here that A is an “expert” software agent that looks for and identifythe specific kind of DDoS attack we described above. The publish/subscribe mechanism fullyallows multiple similar systems to execute concurrently and perform their own analysis usingthe same initial events.

The following features of DISco are highlighted in this example:

Local holding of data: information about packets dropped at V and U are put under the

5by means of their Time To Live field or any similar hop count limiting technique

48 out of 58


control of DISco, but not yet transferred to a remote system until interest in suchinformation is expressed through a subscribe call. Yet, it is important that such datacan be looked up a posteriori, for instance when process A tries and gathers recent paststatistics to figure out the dynamics of the challenge. Temporal aggregation can still beapplied to reduce the available granularity of information over time.

Selective subscription: while A needs extra information from T , U and V , it is only interestedin information related to a fraction of the traffic those routers forward. For instance, if itidentified 4.2.0.0/16 to be the destination of heavy hitters, we will add a constraint onattributes stating that attribute.flow.rfc791-destination-address must matchthat range.

Compound values: note that while A describes filtering on “destination address”, V and Uput flow identification together in a compound value. The schema of this compoundneeds to be known by DISco peers so that filtering/aggregation components are able toextract and compare the destination addresses.

Aggregating events from multiple sources: A typically makes no difference between re-ports coming from U and V as long as they match the filter. This highlights the needfor describing a region of the network through an attribute constraint.

Flexibility: A could broaden its monitoring criterion by subscribing to event.network.drops*,and receive notification of packet losses in the network regardless of whether they aredue to TTL issues, congestion (queue full or early notifications), broken link, unknowndestination, etc. This relieves A from knowing the actual network protocol stack details(as a ICMP snooping agent would have to) and allows monitoring of events originatedby different layers as needed.

6.5.2 Network-Server Interaction

We consider here a more “classical” DDoS attack, where the victim server S is really receivingapplication-level requests through transport-level connections. S locally observes those requestpatterns and their effect on system resources such as CPU and memory load, or access tointernal databases. Deviation from sustainable behaviours are reported as events deriving fromevent.server.overload* and include at least flow identification of the “faulty” connection.

When the agent A coaching a router like R observes abnormal traffic share towards S’sprefix (again, using, e.g., lossy count mechanism), it may subscribe to “server overload” eventsto help deciding whether the currently observed challenge is a DDoS attempt. This, assumingthat a DDoS is more likely to use resource-intensive requests while, during a flash crowd, timeand resources needed to serve the requests do not deviate from normal behaviour and only theamount of requests per unit of time gets wild.

Similarly, resilience agent co-hosted with S would subscribe to “challenge detection reports”and “remediation action reports” produced by networks delivering traffic to S. If needed,the DISco peers that receive this subscription can ensure that the agent only subscribe toinformation it is entitled to receive (that is, check the presence of an IP-destination filter).

This approach is especially attractive in content delivery networks (CDNs) [KMS+09],where a single economic entity owns both the access network/routers and server farms. Itcan still be of high interest as a way to train learning-capable detectors. In that alternative,detection and remediation algorithms do not directly use “server overload” events, but instead

49 out of 58


use information coming from multiple symptoms to identify symptoms combination that reveala DDoS challenge. In a refinement step, diagnostic agents look up for overload events in thetime interval between “challenge detected” and “end of challenge detected”. The correlationbetween the expected state (not challenged, detection in progress, remediation applied, ...)with the number of “overload” reports is used to assert the suitability of the detection process.

7 Conclusion and Future Works

We have presented a set of requirements of an architecture for a challenge detection system fornetwork resilience. The aim of the architecture is to enable mechanisms for remediation (suchas those presented in D2.3a [SCS+09]) to make appropriate mitigation decisions. We identifythree main activities a detection system should carry out: detection of the symptoms of achallenge (e.g., via anomaly or signature-based methods), challenge identification (e.g., viaclassification or root cause analysis), and impact analysis. We have surveyed the state-of-artin these areas and highlighted where future work is necessary. Specifically, for example, weshow there are issues surrounding the complexities of performing real-time detection in resourceconstrained environments, and the generality and robustness of know detection techniques indeployment scenarios.

Taking the identified limitations into account, further work on an architecture for detectionwill be conducted, using that presented in Section 4 for WMNs as a basis. We will investigateappropriate monitoring mechanisms, in conjunction with activities on Task 1.5 in the project,and when possible draw on research results from the EU-funded ANA and ECODE projects formonitoring and detection functionality, respectively. Ongoing work on detection for WMNs willinvolve validating the general architecture in the context of other challenges, e.g., an attack,and further developing the prototype implementation of the architecture to detect wirelessinterference, such that appropriate remediation (e.g., channel selection) can be carried out.

To validate our general architecture for resilience detection, future work will involve furtherdeveloping the scenarios we want to evaluate our work through, in addition to the WMNscenario presented here. Specifically, we wish to use a scenario based around an operator’snetwork, using a representative topology and challenge, such as a botnet-based DDoS attack.(Ongoing activities involve determining current trends in botnet-based attacks, which we willdraw on to develop this scenario.) As part of developing the scenario, we will determineappropriate simulation environments and challenge models (e.g., traffic traces) to be used.

We also have applied the ResumeNet resilience strategy to an opportunistic networkingscenario, showing first results how this strategy can enhance the network over time.

The next steps of our work is to refine the service specification for the store-carry-forwardtransport service to reflect resource limitations of the network nodes. From this limitation,several complications arise impacting the detection mechanisms.

We presented in this deliverable the design and service API of DISco, our distributedinformation store that is intended to glue together detection, remediation and diagnostic stepsof the D2R2+DR strategy. Along these lines, we intend to bring up a prototype implementationof DISco’s core feature in the OMNeT++ in order to validate our design and provide anenvironment for future performance/dimensioning studies.

We will also bring additional care to the actual aggregation mechanisms that will be em-bedded in DISco. A critical aspect, for instance, is the expression of aggregates of networkaddresses or network flows with sufficient flexibility and the ability to invert this aggregation

50 out of 58


on reply calls in order to update the initial reports with the additional information providede.g. by root cause analysis components.

In addition to the OMNeT implementation, we plan to work on an integration of DIScowith the ad hoc Information Provider, Support Engine and Messaging Client in the prototypedescribed in section 4.2. Special care will be taken to guarantee inter-operability despite themulti-language nature of the implementation.

51 out of 58


A List Of Publications

• “On Realising a Strategy for Resilience in Opportunistic Networks”, Marcus Scholler,Paul Smith, Christian Rohner, Merkouris Karaliopoulos, Abdul Jabbar, James P.G. Ster-benz, and David Hutchison, to be published at Future Network and Mobile Summit2010, Florence, Italy, 2010

B Top 5 Challenges in WMNs

The following list comprises some of the most important challenges which can affect wirelessmesh networks:

1. Jamming. This is a challenge which can affect any wireless communication system. Ingeneral, the jammer aims to make undecodable the signal at the destination. Therefore,producing the effects of a denial of service at the physical layer is relatively easy in thecase of wireless networks. It is often assumed that the attacker has access to a moreadvanced set of equipment (directional antennas with high transmission power a.o.).The attack is successful when the jamming power (as received by the jammed device)is so high that it disrupts the message reception. FHSS and DSSS are being used toprevent jamming, but they have limitations.

2. Attacks on routing and neighborhood discovery. Due to the complete decentralizednature of wireless mesh networks, there is a large number of attacks which can be per-formed with respect to routing. The general goals of an attacker are: increase controlover the communication channel between targeted nodes, degrade the quality of ser-vice and increase the resource consumption at some nodes (CPU, memory etc.). Thetechniques being used include, but are not limited to: forging control packets with fakeinformation or sending them under false identities (spoofing), creating wormholes/tun-nels, eavesdropping, replaying or deleting routing packets. A typical attack consists inredirecting a route via a specified node, which subsequently performs certain actions inaccordance to the aforementioned attacker goals. It is relatively difficult to prevent suchattacks in the absence of a full trust relation between nodes that form the network.

3. Network layer selfishness/packet forwarding selfishness. Nodes selected as part of aroute can decide to drop deterministically, or probabilistically, packets that they shouldin fact forward. This is being done to either limit the local resource consumption, whichcould be used better by reallocating them for own purposes, or in order to limit the powerconsumption (in case of mobile devices with limited power). Selfish devices will generallydeclare during the route discovery stage that they will perform packet forwarding in orderto avoid detection of their selfish behaviour by other nodes, which may in turn refuse tooffer them forwarding services. When applied to a congestion control mechanism such asTCP, a careful planned packet dropping can create the Jellyfish attack, which creates atthe source level the impression that one of the nodes on the route is severely congested,without the possibility to actually identify it.

4. Selfishness at MAC layer. This occurs when one or more stations try to reserve moreairtime for themselves, thus breaking the specifications of the MAC protocol. In case ifIEEE 802.11 protocols, this occurs by colliding packets with other stations to force them

52 out of 58


back off, manipulating Network Allocation Vector (NAV) values or by setting the size ofthe contention window deliberately to low levels.

5. Interference. Ideally, the radio links in a wireless mesh network would be scheduled insuch a way as to maintain the fairness between all the data flows, while at the sametime maximizing the global throughput. However, in practice, estimating locally theinterference and avoiding network regions with high interference is relatively difficult toachieve. Therefore, flows may experience congestion and packet dropping.

References

[ACP09] Georgios Androulidakis, Vassilis Chatzigiannakis, and Symeon Papavassiliou. Net-work anomaly detection and classification via opportunistic sampling. IEEE Net-work, 23(1):6–12, 2009.

[AGT10] Saeed Alaei, Mohammad Ghodsi, and Mohammad Toossi. Skiptree: A new scal-able distributed data structure on multidimensional data supporting range-queries.Comput. Commun., 33(1):73–82, 2010.

[BAS04] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: supporting scalable multi-attribute range queries. In Proc. ACM SIGCOMM’ 04, pages 353–366, 2004.

[BCD06] E. Borgia, M. Conti, and F. Delmastro. Mobileman: design, integration, andexperimentation of cross-layer mobile multihop ad hoc networks. CommunicationsMagazine, IEEE, 44(7):80 – 85, July 2006.

[BCKM02] Anat Bremler-Barr, Edith Cohen, Haim Kaplan, and Yishay Mansour. Predictingand bypassing end-to-end internet service degradations. In IMW ’02: Proceedingsof the 2nd ACM SIGCOMM Workshop on Internet measurment, pages 307–320,New York, NY, USA, 2002. ACM.

[BDFV10] Levente Buttyan, Laszlo Dora, Mark Felegyhazi, and Istvan Vajda. Barter tradeimproves message delivery in opportunistic networks. Ad Hoc Netw., 8(1):1–14,2010.

[BTA+06] Laurent Bernaille, Renata Teixeira, Ismael Akodkenou, Augustin Soule, and KaveSalamatian. Traffic classification on the fly. SIGCOMM Comput. Commun. Rev.,36(2):23–26, 2006.

[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: Asurvey. ACM Comput. Surv., 41(3):1–58, 2009.

[CDGS07] Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli. Traffic clas-sification through simple statistical fingerprinting. SIGCOMM Comput. Commun.Rev., 37(1):5–16, 2007.

[CGG+08] F. Cantin, V. Goebel, B. Gueye, D. Kaafar, G. Leduc, S. Martin, M. May, L. Peluso,T. Plagemann, R. Roth, M. Siekkinen, and T. Zseby. The monitoring part of theana architecture (v1). Technical Report FP6-IST-27489, D.3.3v1, Sixth FrameworkProgramme - Situated and Autonomic Communications (SAC), February 2008.

53 out of 58


[CMD+06] Evan Cooke, Richard Mortier, Austin Donnelly, Paul Barham, and Rebecca Isaacs.Reclaiming network-wide visibility using ubiquitous endsystem monitors. In ATEC’06: Proceedings of the annual conference on USENIX ’06 Annual Technical Con-ference, pages 32–32, Berkeley, CA, USA, 2006. USENIX Association.

[CPRW03] David D. Clark, Craig Partridge, J. Christopher Ramming, and John T. Wroclawski.A knowledge plane for the internet. In SIGCOMM ’03: Proceedings of the 2003conference on Applications, technologies, architectures, and protocols for computercommunications, pages 3–10, New York, NY, USA, 2003. ACM.

[CZS09] Andre Castelucio, Artur Ziviani, and Ronaldo M. Salles. An as-level overlay networkfor ip traceback. Netwrk. Mag. of Global Internetwkg., 23(1):36–41, 2009.

[DHK08] Xenofontas Dimitropoulos, Paul Hurley, and Andreas Kind. Probabilistic lossycounting: an efficient algorithm for finding heavy hitters. SIGCOMM Comput.Commun. Rev., 38(1):5–5, 2008.

[FABK03] Nick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans Kaashoek.Measuring the effects of internet path faults on reactive routing. In SIGMETRICS’03: Proceedings of the 2003 ACM SIGMETRICS international conference onMeasurement and modeling of computer systems, pages 126–137, New York, NY,USA, 2003. ACM.

[FVR07] P. Fuxjager, D. Valerio, and F. Ricciato. The myth of non-overlapping channels:interference measurements in ieee 802.11. In Wireless on Demand Network Systemsand Services, 2007. WONS ’07. Fourth Annual Conference on, pages 1 –8, Jan.2007.

[GGH+09] V. Goebel, B. Gueye, T. Hossmann, G. Leduc, S. Martin, Ch. Mertz, E. Munthe-Kaas, T. Plagemann, M. Siekkinen, and D. Witaszek. Integrated monitoring sup-port in ana (v1). Technical Report FP6-IST-27489, D.3.7v1, Sixth FrameworkProgramme - Situated and Autonomic Communications (SAC), February 2009.

[GT02] Matthias Grossglauser and David N.C. Tse. Mobility increases the capacity ofad hoc wireless networks. In IEEE/ACM Transaction on Networking, volume 10,pages 477–486, August 2002.

[HGD+03] S. Hariri, Qu Guangzhi, T. Dharmagadda, M. Ramkishore, and C.S. Raghavendra.Impact analysis of faults and attacks in large-scale networks. Security Privacy,IEEE, 1(5):49 – 54, Sept.-Oct. 2003.

[HHP03] Alefiya Hussain, John Heidemann, and Christos Papadopoulos. A framework forclassifying denial of service attacks. In SIGCOMM ’03: Proceedings of the 2003conference on Applications, technologies, architectures, and protocols for computercommunications, pages 99–110, New York, NY, USA, 2003. ACM.

[HJS+03] Nicholas J. A. Harvey, Michael B. Jones, Stefan Saroiu, Marvin Theimer, and AlecWolman. Skipnet: a scalable overlay network with practical locality properties. InProceedings of the 4th conference on USENIX Symposium on Internet Technolo-gies and Systems - Volume 4, pages 9–9, Berkeley, CA, USA, 2003. USENIX Asso-ciation. Available from World Wide Web: http://portal.acm.org/citation.cfm?id=1251460.1251469.

54 out of 58


[HSHR09] Fabian Hugelshofer, Paul Smith, David Hutchison, and Nicholas J.P. Race. Open-lids: a lightweight intrusion detection system for wireless mesh networks. In Mo-biCom ’09: Proceedings of the 15th annual international conference on Mobilecomputing and networking, pages 309–320, New York, NY, USA, 2009. ACM.

[IBPR08] J. Ishmael, S. Bury, D. Pezaros, and N. Race. Deploying Rural Community WirelessMesh Networks. IEEE Internet Computing, 12(4):22–29, 2008.

[JOW+02] Philo Juang, Hidekazu Oki, Yong Wang, Margaret Martonosi, Li Shiuan Peh,and Daniel Rubenstein. Energy-efficient computing for wildlife tracking: designtradeoffs and early experiences with zebranet. SIGOPS Oper. Syst. Rev., 36(5):96–107, 2002.

[KMR02] Angelos D. Keromytis, Vishal Misra, and Dan Rubenstein. Sos: secure overlayservices. In SIGCOMM ’02: Proceedings of the 2002 conference on Applications,technologies, architectures, and protocols for computer communications, pages61–72, New York, NY, USA, 2002. ACM.

[KMS+09] Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, ArvindKrishnamurthy, Thomas E. Anderson, and Jie Gao. Moving beyond end-to-endpath information to optimize cdn performance. In Anja Feldmann and LaurentMathy, editors, Internet Measurement Conference, pages 190–201. ACM, 2009.Available from World Wide Web: http://dblp.uni-trier.de/db/conf/imc/imc2009.html#KrishnanMSJKAG09.

[KPS+06] S. Khan, Y. Peng, E. Steinbach, M. Sgroi, and W. Kellerer. Application-drivencross-layer optimization for video streaming over wireless networks. Communica-tions Magazine, IEEE, 44(1):122 – 130, Jan. 2006.

[KS95] Irene Katzela and Mischa Schwartz. Schemes for fault identification in communi-cation networks. IEEE/ACM Trans. Netw., 3(6):753–764, 1995.

[KSV07] Ramana Rao Kompella, Sumeet Singh, and George Varghese. On scalable attackdetection in the network. IEEE/ACM Trans. Netw., 15(1):14–25, 2007.

[LB07] DongJin Lee and Nevil Brownlee. Passive measurement of one-way and two-wayflow lifetimes. SIGCOMM Comput. Commun. Rev., 37(3):17–28, 2007.

[LBZ+06] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, W. Hong, and G. Iannaccone.MIND: A distributed Multi-dimensional Indexing system for Network Diagnosis. InINFOCOM 2006. 25th IEEE International Conference on Computer Communica-tions. Proceedings, pages 1 –12, April 2006.

[LCD04] Anukool Lakhina, Mark Crovella, and Christophe Diot. Diagnosing network-widetraffic anomalies. In SIGCOMM ’04: Proceedings of the 2004 conference on Appli-cations, technologies, architectures, and protocols for computer communications,pages 219–230, New York, NY, USA, 2004. ACM.

[LCD05] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining anomalies usingtraffic feature distributions. SIGCOMM Comput. Commun. Rev., 35(4):217–228,2005.

55 out of 58


[lSS04] Ma lgorzata Steinder and Adarshpal S. Sethi. A survey of fault loca-lization techniques in computer networks. Science of Computer Pro-gramming, 53(2):165 – 194, 2004. Available from World Wide Web:http://www.sciencedirect.com/science/article/B6V17-4CRY8NV-1/2/8a821aa4536c9ca7c829bbd3a68d675b. Topics in System Administration.

[LWD92] A.A. Lazar, Weiguo Wang, and R.H. Deng. Models and algorithms for networkfault detection and identification: a review. In Singapore ICCS/ISITA ’92. ’Com-munications on the Move’, pages 999 –1003 vol.3, Nov 1992.

[LX01] Wenke Lee and Dong Xiang. Information-theoretic measures for anomaly detection.In SP ’01: Proceedings of the 2001 IEEE Symposium on Security and Privacy, page130, Washington, DC, USA, 2001. IEEE Computer Society.

[MD03] Benjamin Morin and Herve Debar. Correlation of intrusion symptoms: an applica-tion of chronicles. In RAID’03 : Proceedings of the 6th International Conferenceon Recent Advances in Intrusion Detection, pages 94–112, 2003.

[MHS06] Abdul Jabbar Mohammad, David Hutchison, and James P.G. Sterbenz. Poster:Towards quantifying metrics for resilient and survivable networks. In ICNP ’06:Proceedings of the 14th IEEE International Conference on Network Protocols,pages 17–18, November 2006.

[MR04] Jelena Mirkovic and Peter Reiher. A taxonomy of ddos attack and ddos defensemechanisms. SIGCOMM Comput. Commun. Rev., 34(2):39–53, 2004.

[NA08] Thuy T. T. Nguyen and Grenville J. Armitage. A survey of techniques for internettraffic classification using machine learning. IEEE Communications Surveys andTutorials, 10(1-4):56–76, 2008. Available from World Wide Web: http://dblp.uni-trier.de/db/journals/comsur/comsur10.html#NguyenA08.

[NGR09] Erik Nordstrom, Per Gunningberg, and Christian Rohner. A search-basednetwork architecture for mobile devices. Technical Report Technical Re-port 2009-003, Uppsala University, Department of Information Technology,http://www.it.uu.se/research/publications/reports/2009-003/, 2009.

[PAM09] S. Pediaditaki, P. Arrieta, and M.K. Marina. A learning-based approach for dis-tributed multi-radio channel allocation in wireless mesh networks. In Proc. IEEEInternational Conference on Network Protocols (ICNP), pages 31 –41. IEEE, Oct.2009.

[PLR07a] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Information sharingfor distributed intrusion detection systems. J. Netw. Comput. Appl., 30(3):877–899, 2007.

[PLR07b] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Survey of network-based defense mechanisms countering the dos and ddos problems. ACM Comput.Surv., 39(1):3, 2007.

[RD01] Antony Rowstron and Peter Druschel. Pastry: Scalable, distributed object loca-tion and routing for large-scale peer-to-peer systems. In IFIP/ACM InternationalConference on Distributed Systems Platforms (Middleware), 2001.

56 out of 58


[RKCD01] Antony Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel.Scribe: The design of a large-scale event notification infrastructure. NetworkedGroup Communication, pages 30–43, 2001. Available from World Wide Web:http://dx.doi.org/10.1007/3-540-45546-9_3.

[RSRD07] Haakon Ringberg, Augustin Soule, Jennifer Rexford, and Christophe Diot. Sen-sitivity of pca for traffic anomaly detection. SIGMETRICS Perform. Eval. Rev.,35(1):109–120, 2007.

[Sch99] Bruce Schneier. Attack trees. Dr Dobb’s Journal, 24(12), December 1999.

[SCS+09] P. Smith, C.Doer, M. Schoeller, N. Kheir, and J. Lessmann. First draft on theremediation, recovery and measurement framework, ResumeNet deliverable D2.3a.deliverable, ResumeNet, March 2009.

[SDP08] Vineet Saini, Qiang Duan, and Vamsi Paruchuri. Threat modeling using attacktrees. J. Comput. Small Coll., 23(4):124–131, 2008.

[SFH07] M. Sifalakis, M. Fry, and D. Hutchison. A common architecture for cross layer andnetwork context awareness. In Proc. 1st International Workshop on Self-OrganisingSystems, IWSOS 07, 2007.

[SKH+02] James P. G. Sterbenz, Rajesh Krishnan, Regina Rosales Hain, Alden W. Jackson,David Levin, Ram Ramanathan, and John Zao. Survivable mobile wireless net-works: issues, challenges, and research directions. In WiSE ’02: Proceedings ofthe 1st ACM workshop on Wireless security, pages 31–40, New York, NY, USA,2002. ACM.

[SS09] P. Smith and M. Schoeller. Understanding challenges and their impact on networkresilience, deliverable 1.1b. deliverable, ResumeNet, October 2009.

[SSF+09] P. Smith, M. Schoeller, A. Fessi, M. Karaliopoulos, C. Doerr, and R. Bruncak.First interim strategy document for resilient networking, ResumeNet deliverableD1.5a. deliverable, ResumeNet, August 2009.

[SSRD07] Augustin Soule, Fernando Silveira, Haakon Ringberg, and Christophe Diot. Chal-lenging the supremacy of traffic matrices in anomaly detection. In IMC ’07: Pro-ceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages105–110, New York, NY, USA, 2007. ACM.

[SW09] Shashank Shanbhag and Tilman Wolf. Accurate anomaly detection through par-allelism. Netwrk. Mag. of Global Internetwkg., 23(1):22–28, 2009.

[TGC09] Trusted Computing Group (TNC) IF-MAP binding for SOAP specification ver-sion 1.1. http://www.trustedcomputinggroup.org/files/resource files/51F74E9B-1D09-3519-AD2DAE1472A3A846/TNC IFMAP v1 1 r5.pdf, May 2009.

[vS08] Leo van Selm. ISO/IEC 20000: An Introduction. Van Haren Publishing, 2008.

[WSAL09] Ting Wang, Mudhakar Srivatsa, Dakshi Agrawal, and Ling Liu. Learning, indexing,and diagnosing network faults. In KDD ’09: Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 857–866,New York, NY, USA, 2009. ACM.

57 out of 58


[WZA06] Nigel Williams, Sebastian Zander, and Grenville Armitage. A preliminary perfor-mance comparison of five machine learning algorithms for practical ip traffic flowclassification. SIGCOMM Comput. Commun. Rev., 36(5):5–16, 2006.

[Xia09] Yang Xiao. Flow-net methodology for accountability in wireless networks. IEEENetwork, 23(5):30–37, September/October 2009.

[ZMZ08] Ying Zhang, Z. Morley Mao, and Ming Zhang. Effective diagnosis of routing disrup-tions from end systems. In NSDI’08: Proceedings of the 5th USENIX Symposiumon Networked Systems Design and Implementation, pages 219–232, Berkeley, CA,USA, 2008. USENIX Association.

58 out of 58

resilience and survivability for future networking ... › downloads › deliverables ›...

Documents