discovering network topology in the presence of byzantine faults

13

Click here to load reader

Upload: s

Post on 15-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering Network Topology in the Presence of Byzantine Faults

Discovering Network Topology in thePresence of Byzantine Faults

Mikhail Nesterenko and Sebastien Tixeuil

Abstract—We pose and study the problem of Byzantine-robust topology discovery in an arbitrary asynchronous network. The problem

is an abstraction of fault-tolerant routing. We formally state the weak and strong versions of the problem. The weak version requires that

either each node discovers the topology of the network or at least one node detects the presence of a faulty node. The strong version

requires that each node discovers the topology regardless of faults. We focus on noncryptographic solutions to these problems. We

explore their bounds. We prove that the weak topology discovery problem is solvable only if the connectivity of the network exceeds the

number of faults in the system. Similarly, we show that the strong version of the problem is solvable only if the network connectivity is

more than twice the number of faults. We present solutions to both versions of the problem. The presented algorithms match the

established graph connectivity bounds. The algorithms do not require the individual nodes to know either the diameter or the size of the

network. The message complexity of both programs is low polynomial with respect to the network size. We describe how our solutions

can be extended to add the property of termination, handle topology changes, and perform neighborhood discovery.

Index Terms—Fault tolerance, topology discovery, Byzantine faults.

Ç

1 INTRODUCTION

IN this paper, we investigate the problem of Byzantine-tolerant distributed topology discovery in an arbitrary

network. Each node is only aware of its neighboring peersand it needs to learn the topology of the entire network.

As reliability demands on distributed systems increase,the interest in developing robust distributed applicationsgrows. One of the strongest fault models is Byzantine [12]:the faulty node behaves arbitrarily. This model encom-

passes a rich set of fault scenarios. Moreover, Byzantinefault tolerance has security implications, as the behavior ofan intruder can be modeled as Byzantine.

However, in most studies to date, Byzantine faults areconsidered in completely connected networks. That is, therouting between the nodes in the system is assumed to beestablished in spite of the faults. In this paper, we address theproblem of Byzantine-robust routing. To allow us to abstractfrom the details of routing table establishment and main-tenance, we state the problem of the topology discovery andstudy its properties and solutions to it. Besides beingattractive in its own right, the solutions to the topologydiscovery problem easily translate into routing algorithms.

One approach to deal with Byzantine faults is byenabling the nodes to use cryptographic operations suchas digital signatures or certificates. This limits the power ofa Byzantine node as a nonfaulty node can verify the validityof received topology information and authenticate the

sender across multiple hops. However, this option maynot be available. For example, the nodes may not haveenough resources to manipulate digital signatures. More-over, cryptographic operations implicitly assume the pre-sence of trust infrastructure: secure channels to a key serveror a public key infrastructure. Establishing and maintainingsuch infrastructure in the presence of Byzantine faults maybe problematic.

Another way to limit the power of a Byzantine process isto assume synchrony: all processes proceed in lock-step.Indeed, if a process is required to send a message with eachpulse, a Byzantine process cannot refuse to send a messagewithout being detected. However, the synchrony assump-tion may be too restrictive for practical systems.

1.1 Our Contribution

In this study, we explore the fundamental properties oftopology discovery. We select the weakest practicalprogramming model, establish the limits on the solutions,and present the programs matching those limits.

Specifically, we consider networks of arbitrary topologywhere up to fixed number of nodes k are faulty. Theexecution model is asynchronous. We are interested insolutions that do not use cryptographic primitives. Thesolutions should be terminating and the individual pro-cesses should not be aware of the network parameters suchas network diameter or its total number of nodes.

We state two variants of the topology discovery problem:weak and strong. In the former—either each nonfaulty nodelearns the topology of the network or one of them detects afault; in the latter—each nonfaulty node has to learn thetopology of the network regardless of the presence of faults.

As negative results, we show that any solution to the weaktopology discovery problem cannot ascertain the presence ofan edge between two faulty nodes. Similarly, any solution tothe strong variant cannot determine the presence of an edgebetween a pair of nodes at least one of which is faulty.Moreover, the solution to the weak variant requires the

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009 1777

. M. Nesterenko is with the Computer Science Department, Kent StateUniversity, Kent, OH 44240. E-mail: [email protected].

. S. Tixeuil is with LIP6 and INRIA Grand Large LIP6-CNRS 7606,Universite Pierre et Marie Curie - Paris 6, 104 avenue du PresidentKennedy 75016 Paris, France. E-mail: [email protected].

Manuscript received 9 Mar. 2008; revised 2 Dec. 2008; accepted 8 Jan. 2009;published online 10 Feb. 2009.Recommended for acceptance by P. Srimani.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2008-03-0092.Digital Object Identifier no. 10.1109/TPDS.2009.25.

1045-9219/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society

Page 2: Discovering Network Topology in the Presence of Byzantine Faults

network to be at least ðkþ 1Þ-connected. In case of the strongvariant, the network must be at least ð2kþ 1Þ-connected.

The main contribution of this study are the algorithmsthat solve the two problems: Detector and Explorer. Thealgorithms match the respective connectivity lower bounds.To the best of our knowledge, these are the firstasynchronous Byzantine-robust solutions to the topologydiscovery problem that do not use cryptographic opera-tions. In practice, these bounds can be viewed as workingtolerances of the algorithms. That is, the algorithms arespecified to operate correctly until the number of faultsexceeds the theoretical tolerance limit.

Explorer solves the stronger problem. However, Detectorhas better message complexity. In both algorithms, faultynodes may generate arbitrary number of messages. How-ever, in fault-free operation, Detector either determines inOð�n3Þ messages, where � and n are the maximumneighborhood size and the number of nodes in the system,respectively. Explorer finishes in Oðn4Þ messages.

Detector is not as robust as Explorer. Detector just alerts thesystem about the presence of the fault instead of masking it.The system has to have an external mechanism to cope withfaults. For example, the system operators may have tomanually take measures to neutralize the faulty node.However, Detector requires weaker connectivity and it isasymptotically more efficient. Thus, the system designersmay consider this variant of the solution in case the faultsare rare or the system cannot guarantee the necessaryconnectivity requirements for a more robust solution.

We extend our algorithms to

1. terminate,2. handle topology changes,3. discover neighbors if ports are known,4. discover a fixed number of routes instead of

complete topology, and5. reliably propagate arbitrary information instead of

topological data.

1.2 Related Work

A number of researchers employ cryptographic operationsto counter Byzantine faults. Avramopoulos et al. [2]consider the problem of secure routing. Therein see thereferences to other secure routing solutions that rely oncryptography. Perrig et al. [21] survey robust routingmethods in ad hoc sensor networks. The techniques coveredthere also assume that the processes are capable ofcryptographic operations.

A naıve approach of solving the topology discoveryproblem without cryptography would be to use a Byzan-tine-resilient broadcast [3], [7], [10], [20]: each nodeadvertises its neighborhood. However, all existing solutionsfor arbitrary topology known to us require that the graphtopology be a priori known to the nodes.

Let us survey the noncryptography-based approaches toByzantine fault tolerance. Most programs described in theliterature [1], [13], [14], [18] assume completely connectednetworks and cannot be easily extended to deal witharbitrary topology. Dolev [7] considers Byzantine agreementon arbitrary graphs. He states that for agreement in thepresence of up to k Byzantine nodes, it is necessary and

sufficient that the network is ð2kþ 1Þ-connected and thenumber of nodes in the system is at least 3kþ 1. However, hissolution requires that the nodes are aware of the topology inadvance. Also, this solution assumes the synchronousexecution model. Recently, the problem of Byzantine-robustreliable broadcast has attracted attention [3], [10], [20].However, in all cases, the topology is assumed to be known.Bhandari and Vaidya [3] and Koo [10] assume two-dimen-sional grid. Pelc and Peleg [20] consider arbitrary topologybut assume that each node knows the exact topology a priori.A notable class of algorithms tolerates Byzantine faults witheither space [17], [19], [22] or time [16] locality. Yet, theemphasis of space local algorithms is on containing the faultas close to its source as possible. This is only applicable to theproblems where the information from remote nodes isunimportant such as vertex coloring, link coloring, or diningphilosophers. Also, time local algorithms presented so farcan hold at most one Byzantine node and are not able to maskthe effect of Byzantine actions. Thus, local containmentapproach is not applicable to topology discovery.

Masuzawa [15] considers the problem of topologydiscovery and update. However, Masuzawa is interestedin designing a self-stabilizing solution to the problem andthus his fault model is not as general as Byzantine: heconsiders only transient and crash faults.

1.3 Outline

The rest of the paper is organized as follows: After statingour programming model and notation in Section 2, weformulate the topology discovery problems, as well as statethe impossibility results in Section 3. We present Detectorand Explorer in Sections 4 and 5, respectively. We discussthe composition of our programs and their extensions inSection 6. In conclusion, we state several problems that ourresearch opens in Section 7.

2 NOTATION, DEFINITIONS, AND ASSUMPTIONS

2.1 Graphs

A distributed system (or program) consists of a set ofprocesses and a neighbor relation between them. Thisrelation is the system topology. The topology forms a graphG. Let n and e be the number of nodes1 and edges in G,respectively. Two processes are neighbors if they share anedge in G. A set P of neighbors of process p is neighborhoodof p. In the sequel, we use small letters to denote singletonvariables and capital letters to denote sets. In particular, weuse a small letter for a process and a matching capital onefor this process’ neighborhood. Since the topology issymmetric, if q 2 P , then p 2 Q. Let � be the maximumnumber of nodes in a neighborhood.

A node-cut of a graph is the set of nodesU such thatG n U isdisconnected or trivial. A node-connectivity (or just connectiv-ity) of a graph is the minimum cardinality of a node-cut of thisgraph. In this paper, we make use of the following fact aboutgraph connectivity that follows from Menger’s theorem [24]:if a graph is k-connected (where k is some constant), then forevery two vertices u and v, there exists at least k internallynode disjoint paths connecting u and v in this graph.

1778 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009

1. We use terms process and node interchangeably.

Page 3: Discovering Network Topology in the Presence of Byzantine Faults

2.2 Program Model

A process contains a set of variables. When it is clear fromthe context, we refer to a variable var of process p as var:p.Every variable ranges over a fixed domain of values. Foreach variable, certain values are initial. Each process pinitially contains a set of neighbor identifiers P . Each pair ofneighbor processes shares a pair of special variables calledchannels. We denote by Ch:b:c the channel from process b toprocess c. Process b is the sender and c is the receiver. Thevalue for a channel variable is chosen from the domain of(potentially infinite) sequences of messages. That is, eachchannel is a FIFO queue. Each process is capable ofrecognizing the sender of the message.

A state of the program is the assignment of a value toevery variable of each process from its correspondingdomain. A state is initial if every variable has initial value.Each process contains a set of actions. An action has theform hnamei : hguardi�!hcommandi. A guard is a Booleanpredicate over the variables of the process. A command is asequence of assignment and branching statements. A guardmay be a receive statement that accesses the incomingchannel. A command may contain a send statement thatmodifies the outgoing channel. A parameter is used todefine a set of actions as one parameterized action. Forexample, let j be a parameter ranging over values 2, 5, and9; then a parameterized action ac:j defines the set of actions

ac:ðj ¼ 2Þ ½� ac:ðj ¼ 5Þ ½� ac:ðj ¼ 9Þ:

Either guard or command can contain quantified constructs[6] of the form: ðhquantifierihbound variablesi : hrangei :htermiÞ, where range and term are Boolean constructs.

2.3 Semantics

An action of a process of the program is enabled in a certainstate if its guard evaluates to true. An action containingreceive statement is enabled when appropriate message is atthe head of the incoming channel. The execution of thecommand of an action updates variables of the process. Theexecution of an action containing receive statement removesthe received message from the head of the incoming channeland inserts the value the message contains into the specifiedvariables. The execution of send statement appends thespecified message to the tail of the outgoing message.

A computation of the program is a maximal fair sequence ofstates of the program such that the first state s0 is initial andfor each state si, the state siþ1 is obtained by executing thecommand of an action whose state is enabled in si. That is, weassume that the action execution is atomic. The maximality ofa computation means that the computation is either infinite orit terminates in a state where none of the actions are enabled.The fairness means that if an action is enabled in all butfinitely many states of an infinite computation, then thisaction is executed infinitely often. That is, we assume weakfairness of action execution. Note that we require the receivestatement to appear as a stand-alone guard of an action. Thismeans that if a message of the appropriate type is at the headof the incoming channel, the receive action is enabled. Due toweak fairness assumption, this leads to fair message receiptassumption: each message in the channel is eventuallyreceived. Observe that our definition of a computationconsiders asynchronous computations.

To reason about program behavior, we define Booleanpredicates on program states. A program invariant is apredicate that is true in every initial state of the program,and if the predicate holds before the execution of theprogram action, it also holds afterward. Note that by thisdefinition, a program invariant holds in each state of everyprogram computation.

2.4 Faults

Throughout a computation, a process may be eitherByzantine (faulty) or nonfaulty. A Byzantine processcontains an action that assigns to each local variable anarbitrary value from its domain. This action is alwaysenabled. Yet, the weak fairness assumption does not applyto this action. That is, we consider computations where afaulty process does not execute any actions. Observe that weallow a faulty node to send arbitrary messages. We assume,however, that messages sent by such a node conform to theformat specified by the algorithm: each message carries thespecified number of values, and the values are drawn fromappropriate domains. This assumption is not difficult toimplement as message syntax checking logic can beincorporated in receive action of each process. We assumeoral record [12] of message transmission: the receiver canalways correctly identify the message sender. The channelsare reliable: the messages are delivered in FIFO order andwithout loss or corruption. Throughout the paper, weassume that the maximum number of faulty nodes in thesystem is bounded by some constant k.

2.5 Graph Exploration

The processes discover the topology of the system byexchanging messages. Each message contains the identifierof the process and its neighborhood. Process p exploredprocess q if p received a message with ðq;QÞ. When it is clearfrom the context, we omit the mention of p. An exploredsubgraph of a graph contains only explored processes. AByzantine process may potentially circulate informationabout the processes that do not exist in the systemaltogether. A process is fake if it does not exist in thesystem, a process is real otherwise.

3 THE TOPOLOGY DISCOVERY PROBLEM:STATEMENT AND SOLUTION BOUNDS

Problem statement.

Definition 1 (Weak Topology Discovery Problem). Aprogram is a solution to the weak topology discovery problemif each of the program’s computation satisfies the followingproperties: termination—either all nonfaulty processes deter-mine the system topology or at least one process detects a fault;safety—for each nonfaulty process, the determined topology isa subset of the actual system topology; validity—the fault isdetected only if there are faulty processes in the system.

Definition 2 (Strong Topology Discovery Problem). Aprogram is a solution to the strong topology discovery problemif each of the program’s computation satisfies the followingproperties: termination—all nonfaulty processes determinethe system topology; safety—the determined topology is asubset of the actual system topology.

NESTERENKO AND TIXEUIL: DISCOVERING NETWORK TOPOLOGY IN THE PRESENCE OF BYZANTINE FAULTS 1779

Page 4: Discovering Network Topology in the Presence of Byzantine Faults

According to the safety property of both problemdefinitions, each nonfaulty process is only required todiscover a subset of the actual system topology. However,the desired objective is for each node to discover as much ofit as possible. The following definitions capture this idea. Asolution to a topology discovery problem is complete if everynonfaulty process always discovers the complete topologyof the system. A solution to the problem is node-complete ifevery nonfaulty process discovers all nodes of the system. Asolution is adjacent-edge complete if every nonfaulty nodediscovers each edge adjacent to at least one nonfaulty node.A solution is two-adjacent-edge complete if every nonfaultynode discovers each edge adjacent to two nonfaulty nodes.

Observe that for ðkþ 1Þ-connected graphs, an adjacent-edge complete solution is also node-complete, and there-fore, simply complete. If k ¼ 1, then an edge completesolution is also node-complete and simply complete.

3.1 Solution Bounds

To simplify the presentation of the negative results in thissection, we assume more restrictive execution semantics.Each channel contains at most one message. The computa-tion is synchronous and proceeds in rounds. In a singleround, each process consumes all messages in its incomingchannels and outputs its own messages into the outgoingchannels. Note that the negative results established for thissemantics apply for the more general semantics used in therest of the paper.

Theorem 1. There does not exist a complete solution to the weaktopology discovery problem if the number of faults k is greaterthan one.

Proof. Assume that there exists a complete solution to theproblem. Consider k � 2 and topology G1 that is notcompletely connected. Let none of the nodes in G1 befaulty. By the validity property, none of the nodes maydetect a fault in such topology. Consider a computations1 of the solution program where each node discoversG1. Let p 2 G1, q 6¼ p, and r 6¼ p be three nodes in G1,with q and r being nonneighbor nodes in G1. Since G1

is not completely connected, we can always find twosuch nodes.

We form topology G2 by connecting q and r in G1. Letq and r be faulty in G2. We construct a computation s2

which is identical to s1. That is, q and r, being faulty, inevery round output the same messages as in s1. Since s2

is otherwise identical to s1, process p determines that thetopology of the system is G1 6¼ G2. Thus, the assumedsolution is not complete. tu

Theorem 2. There does not exist a node or adjacent-edge-complete solution to the weak topology problem if theconnectivity of the graph is lower or equal to the total numberof faults k.

Proof. Assume the opposite. Let there be a node andadjacent-edge-complete program that solves the problemfor graphs whose connectivity is k or less. Let G1 and G2

be two graphs of connectivity k.This means that G1 and G2 contain the respective cut

node sets A1 and A2 whose cardinality is k. Rename theprocesses in G2 such that A1 ¼ A2. By definition, A1

separates G1 into two disconnected sets B1 and C1.Similarly, A2 separates G2 into B2 and C2. Assumewithout loss of generality that B1 6¼ B2. Specifically,there is a node q1 such that q1 2 B1 but q1 62 B2 and a pairof nodes q2 and q3 that share an edge in B1 but not in B2.

Since A1 ¼ A2, we can form graph G3 as A1 [B2 [ C1.Let s1 be any computation of the assumed program

in the system of topology G1 and no faulty nodes. Sincethe program solves the weak topology discoveryproblem, the computation has to comply with all theproperties of the problem. By validity property, no faultis detected in s1. By termination property, each node inG1, including some node p 2 C1, eventually discoversthe system topology.

By safety property, the topology discovered by p is asubset of G1. Since the solution is complete, thediscovered topology is exactly G1. Let s2 be anycomputation of the assumed program in the system oftopology G2 and no faulty nodes. Again, none of thenodes detects a fault and all of them discover thecomplete topology of G2 in s2.

We construct a new computation s3 of the assumedprogram as follows. The system topology for s3 is G3

where all nodes in A1 are faulty. Each faulty node r 2A1 behaves as follows. In the channels connecting r tothe nodes of C1 � G3, each round r outputs themessages as in s1. Similarly, in the channels connectingr to the nodes of B2 � G3, r outputs the messages as ins2. The nonfaulty nodes of B2 and C1 behave as in s1

and s2, respectively.Observe that for the nodes of B2, the topology and

communication is indistinguishable from that of s2.Similarly, for the nodes of C1, the topology and commu-nication is indistinguishable from that of s1. Notice thatthis means that none of the nonfaulty nodes detect a fault inthe system. Moreover, node p 2 C1 decides that the systemtopology is the subset ofG1. Yet, by construction,G1 6¼ G3.Specifically,B1 6¼ B2. None of the nodes inB2 are faulty. Ifthis is the case, then by the assumptions on B1 and B2,either s3 violates the safety property of the problem or theassumed solution is neither adjacent-edge nor node-complete. The theorem follows. tu

Theorem 3. There does not exist an adjacent-edge-complete

solution to the strong topology discovery problem.

Proof. Assume that such a solution exists. Consider system

graph G1 that is not completely connected. Let p 2 G1 be

an arbitrary node. Let q 6¼ p and r 6¼ p be two non-

neighbor nodes of G1. We form topology G2 by

connecting q and r in G1.We construct computations s1 and s2 as follows. Let s1

and s2 be executed on G1 and G2, respectively. And let qbe faulty in s1 and r be faulty in s2. Set the output of q ineach round to be identical in s1 and s2. Similarly, set theoutput of r to be identical in both computations as well.Since the output of q and r in both computations isidentical, we construct the behavior of the rest of thenodes in s1 and s2 to be the same.

Due to termination property, p has to decide on thesystem topology in both computations. Due to the safetyproperty, in s1 process, p has to determine that the

1780 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009

Page 5: Discovering Network Topology in the Presence of Byzantine Faults

topology of the graph is a subset of G1. However, sincethe behavior of p in s2 is identical to that in s1, p decidesthat the topology of the system graph is G1 in s2 as well.This means that p does not include the edge between qand r to the explored topology in s2. Yet, one of thenodes adjacent to this edge, namely q, is not faulty. Anadjacent-edge complete program should include suchedges in the discovered topology. Therefore, theassumed program is not adjacent-edge complete. tu

Theorem 4. There does not exist a node or two-adjacent-edgecomplete solution to the strong topology problem if theconnectivity of the graph is less than or equal to twice thetotal number of faults k.

Proof. Assume that there is a program that solves theproblem for graphs whose connectivity is 2k or less. LetG1 and G2 be two different graphs whose connectivity is2k. Similar to the the proof of Theorem 2, we assume thatG1 ¼ A1 [B1 [ C1 and G2 ¼ A2 [B2 [ C2, where thecardinalities of A1 and A2 are 2k, A1 ¼ A2, B1 \ C1 ¼ ;,B2 \ C2 ¼ ;. Also, B1 6¼ B2. Specifically, there is a nodeq1 such that q1 2 B1 but q1 62 B2 and a pair of nodes q2

and q3 that share an edge in B1 but not in B2.Form G3 ¼ A1 [B2 [ C1. Divide A1 into two subsets

A01 and A001 of the same number of nodes. Construct acomputation s1 with system topology G1 where all nodesin A01 are faulty; and another computation s3 with systemtopology G3 where all nodes in A001 are faulty. The faultynodes in s1 in the channels connecting A01 to C1

communicate as the (nonfaulty) nodes of A01 in s3.Similarly, the faulty nodes in s3 in the channels connectingA001 to C1 communicate as the nodes of A001 in s1. Observethat s1 and s3 are indistinguishable to the nodes in C1. Letthe nodes in C1, including p 2 C1, behave identically inboth computations. According to the termination prop-erty of the strong topology discovery problem, everynode, including p, has to determine the system topologyin both s1 and s3. Due to safety, the topology that pdetermines in s1 is a subset of G1. However, p behavesidentically in s3. This means that p decides that the systemtopology in s3 is also a subset of G1. Since G1 6¼ G3

(specifically, B1 6� B2) and none of the nodes in B2 arefaulty, then either s3 violates the safety property of theproblem or the assumed is neither two-adjacent-edge nornode-complete. The theorem follows. tu

4 DETECTOR

4.1 Outline

Detector solves the weak topology discovery problem forsystem graphs whose connectivity exceeds the number offaulty nodes k. The algorithm leverages the connectivity ofthe graph. For each pair of nodes, the graph guarantees thepresence of at least one path that does not include a faultynode. The topology data travel along every path of the graph.Hence, the process that collects information about anotherprocess can find the potential inconsistency between theinformation that proceeds along the path containing faultynodes and the path containing only nonfaulty ones.

Care is taken to detect the fake nodes whose informationis introduced by faulty processes. Since the processes do not

know the size of the system, a faulty process may

potentially introduce an infinite number of fake nodes.

Graph connectivity is leveraged to detect such fake nodes.

As faulty processes are the only source of information about

fake nodes, all the paths from the real nodes to the fake ones

have to contain a faulty node. Yet, the graph connectivity is

assumed to be greater than k. If a fake node is ever

introduced, one of the nonfaulty processes eventually

detects a graph with too few paths leading to the fake node.To simplify the presentation, we describe Detector

assuming that the system contains at most one faulty node

(k ¼ 1). We then explain how to generalize the algorithm to

an arbitrary fixed k.

4.2 Detailed Description

The program is shown in Fig. 1. Each process p stores the

identifiers of its immediate neighbors. They are kept in set

P . Each process is aware of the upper bound k ¼ 1 on the

number of faulty processes in the system. Process p

maintains the following variables. Boolean variable detect

indicates if p discovers a fault in the system. Boolean

variable start guards the execution of the action that sends

p’s neighborhood information to its neighbors. Set TOP

(for topology) stores the subgraph explored by p; TOP

contains tuples of the form: (process identifier, its neighbor-

hood). In the initial state, TOP contains ðp; P Þ.Function connectivity evaluates the topology of the

subgraph stored in TOP . Recall that a node u is unexplored

by p if for every tuple ðs; SÞ 2 TOP , s is not the same as u.

That is, u may appear in S only. We construct graph G0 by

adding an edge to every pair of unexplored processes

NESTERENKO AND TIXEUIL: DISCOVERING NETWORK TOPOLOGY IN THE PRESENCE OF BYZANTINE FAULTS 1781

Fig. 1. Process of Detector.

Page 6: Discovering Network Topology in the Presence of Byzantine Faults

present in TOP . We calculate the value of connectivity asfollows. If the information of TOP is inconsistent, that is

ð9u; v; U; V : ððu; UÞ 2 TOP Þ ^ ððv; V Þ 2 TOP Þ :

ðu 2 V Þ ^ ðv 62 UÞÞ;

then connectivity returns to 0. Otherwise the functionreturns the connectivity of graph G0. In the proof ofalgorithm’s correctness, we demonstrate that the connectiv-ity G0 may fall below k only if there is a fault in the graph.

Processes exchange messages of the form (processidentifier, its neighborhood id set). A process contains twoactions: init and accept. Action init starts the propagation ofp’s neighborhood throughout the system. Action acceptreceives the neighborhood data of some process, records it,checks against other data already available for p, andpossibly further disseminates the data. If the data receivedfrom neighbor q about a process r contradict what p alreadyholds about r in TOP or if the newly arrived informationimplies that G is less than 2-connected, p indicates that itdetected a fault by setting detect to true. That is, theconnectivity of the resultant graph is less than kþ 1.Alternatively, if p did not previously have the informationabout r, p updates TOP and sends the received informationto all its neighbors.

Observe that the propagation of information about theneighborhood of a certain process is independent of theinformation propagation of another process. Thus, we willfocus on the propagation of the information about aparticular nonfaulty process a.

Let COR contain each process b such that b is not faultyand TOP:b holds ða;AÞ. Let a itself belong to COR if start:ais false.

Lemma 1. The following predicate is an invariant of Detector:

ð8 nonfaulty b; c : b 2 COR; c 2 B :

ðc 2 CORÞ_ðða;AÞ 2 Ch:b:cÞÞ _ð9 nonfaulty j : j 2 N : detect:j ¼ trueÞ:

ð1Þ

The predicate states that unless one of the nonfaultyprocesses in the program detects a fault, if a process bbelongs to COR, then each neighbor c of b either belongs toCOR as well or the channel from b to c contains ða;AÞ.Proof. To prove that Predicate 1 is an invariant of the

program, we need to show that it holds in the initial stateof any computation and it is closed under the executionof actions of Byzantine as well as nonfaulty processes.The predicate holds initially as the first disjunct isvacuously true.

Note that no action of a Byzantine process immedi-ately affects the validity of the predicate. Observe alsothat a nonfaulty process can only set detect to true. Thus,once this happens, the predicate holds throughout therest of the computation. Suppose detect is false in allprocesses of the program. Then, the predicate is violatedonly if there is a nonfaulty pair of neighbors b and c suchthat b belongs to COR, c does not, and there is nomessage ða;AÞ in the channel from b to c. Note that anonfaulty process adds the first value ðr;RÞ to TOP and

never changes it afterward. Thus, provided thatdetect ¼ false, to violate the predicate, a process has tojoin COR without sending ða;AÞ to its neighbors orconsume a message with ða;AÞ without joining COR. Letus examine the actions of a nonfaulty process and ensurethat neither of this happens.

Observe that init is only of interest in a. This actionsets start:a ¼ false, which, by definition, adds a toCOR. Also, init atomically sends ða;AÞ to all neighborsof a. Thus, the predicate is not violated by theexecution of init.

Let us now consider accept in an arbitrary nonfaultyprocess u. Let the message received by u carry ðr; RÞ.Observe that accept affects Predicate 1 only if r ¼ a.accept may make u join COR or consume a messagewith ða;AÞ. Note that if u is already in COR, the receiptof a message with ða;AÞ does not violate the predicate.Also, u joins COR only if it receives ða;AÞ. Hence, theonly case we have to consider is when u does notbelong to COR before the execution of accept, u receivesða;AÞ and joins COR.

The behavior of u in this case depends on whether ithas an element ðs; SÞ in TOP:u such that s ¼ a. Sinceu 62 COR, if ða; SÞ 2 TOP:u, then S differs from A. In thiscase, if u receives ða;AÞ, then it sets detect ¼ true. Thispreserves the validity of the predicate. Alternatively, ifsuch an entry in TOP:u does not exist, then the receipt ofða;AÞ causes u to join COR and forwards ða;AÞ to all itsneighbors. This preserves the predicate as well.

Thus, Predicate 1 holds in the initial state of everycomputation of the program and is preserved by itsevery action, which means that this predicate is aninvariant of the program. tu

Lemma 2. If a computation of Detector contains a state wherethere is a process u that belongs to COR that has a nonfaulty

neighbor v that does not, then further in the computation, eithersome nonfaulty process sets detect ¼ true or v joins COR.

Proof. According to Lemma 1, Predicate 1 is an invariant ofthe program. Hence, if u belongs to COR and itsnonfaulty neighbor v does not, then channel Ch:u:v

contains a message with ða;AÞ. Due to fair messagereceipt assumption, ða;AÞ is received. Observe that if v isnot in COR and it receives ða;AÞ, then either v setsdetect ¼ true or joins COR. tu

Lemma 3. Every computation of Detector contains a state where

either detect ¼ true in some nonfaulty process or everynonfaulty process belongs to COR.

Proof. The proof is by induction on the number of nonfaultyprocesses in the program. As a base case, we show that aitself eventually joins COR. Recall that we assume that aitself is not faulty. Observe that the program starts in astate where start:a is true. If this is so, init is enabled.Moreover, init is the only action that sets start:a to false.Thus, init stays enabled until executed. By weak fairnessassumption, init is eventually executed. When thishappens, a joins COR.

Assume that COR contains i: 1 � i < n processes atsome state of a computation and there is a nonfaultyprocess that does not belong to COR. We assume that the

1782 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009

Page 7: Discovering Network Topology in the Presence of Byzantine Faults

connectivity of the graph exceeds the maximum numberof faulty processes. Thus, there is a nonfaulty processu 2 COR that has a nonfaulty neighbor v 62 COR.According to Lemma 2, this computation contains astate where COR contains v. Thus, every nonfaultyprocess eventually joins COR. tu

Lemma 4. If a computation of Detector contains a state where

nonfaulty process u explores a fake process v, then this

computation contains a state where detect ¼ true in some

nonfaulty process.

Proof. Observe that the only source of fake processinformation is a Byzantine process. Hence, if u exploresa fake process v, then every path to v leads through aByzantine process. Thus, in a graph with a fake node, themaximum number of node disjoint paths between a realand a fake node is no more than k ¼ 1.

According to Lemma 3, eventually, either detect ¼true at a nonfaulty process or u explores every nonfaultyprocess in the system. In this case, u detects that all pathsto the fake node v lead through no more than k processesand sets detect ¼ true. tu

Lemma 5. If the system does not contain a faulty process, then

the connectivity of the explored graph G0 is at least kþ 1.

Proof. We prove the lemma by inductively removing each

node p of G that is not present in G0 and replacing it with

edges between p’s neighbors. If a graph is kþ 1-

connected, then each node has at least kþ 1 neighbors.

Note that at least one node is always explored in G0.

Thus, the number of nodes in G0 is no less than kþ 1. In

our proof, we make use of the Menger’s theorem and

show that such removal does not decrease the number of

node-independent paths connecting two arbitrary nodes

below kþ 1.Let G0 ¼ G. If G is kþ 1-connected, so is G0. Let G0:i be

a graph with i > ðkþ 1Þ nodes of G remaining in G0.Assume that the connectivity of G0:i is greater than kþ 1.

Consider how the removal of an arbitrary node p 2G0:i affects node-independent paths connecting a pair ofarbitrary nodes. Note that it is sufficient to only considerpaths whose ends are the neighbors of p as other affectedpaths include the neighbors of p as segments.

Let us start with the case where p shares an edge withevery node in G0:i. If p is removed, G0:ði� 1Þ becomescompletely connected and its connectivity is kþ 1 > k.Now, let us consider the case where there is a node u thatis not a neighbor of p in G0:i. Let v and w be twoneighbors of p.

Consider the case where v and w do not share an edge.There may be an independent path v; p; w that connectsthe two nodes in G0:i. After the removal of p, v and w areconnected. This means that a path v; p; w is replaced byanother path v; w. That is, the independent path ispreserved in G0:ði� 1Þ.

Consider the case where v and w share an edge in G0:i.That is, an edge that connects them is not added inG0:ði� 1Þ. In this case, the two nodes may lose aninternally node disjoint path v; p; w that connects them.Assume that the number of nodes in the neighborhood ofp is greater than kþ 1. Since the neighborhood is

completely connected in G0:ði� 1Þ, there are at least kþ1 internally node disjoint paths between v and w inG0:ði� 1Þ. Assume that the number of nodes in p’sneighborhood is exactly kþ 1. In this case, there are onlyk internally node disjoint paths between v and w thatonly use the neighbors of p. We now show that there isanother path.

According to Menger’s theorem, there are at least kþ1 internally node disjoint paths that connect p and u inG0:i. Each path then has to go through a differentneighbor of p. This includes v and w. That is, there existtwo paths p; v; . . . ; u and p; w; . . . ; u in G0:i that do notshare nodes except p and u. Neither do these pathscontain any other neighbors of p. Consider pathv; . . . ; u; . . . ; w composed of the segments of these twopaths. The new path does not contain neighbors of p.This new path exists in G0:ði� 1Þ and it is internally nodedisjoint with the other k paths that connect v and w. Thatis, the connectivity of G:ði� 1Þ remains at least kþ 1. Thelemma follows by induction. tu

Lemma 6. Any computation of a detector program contains astate where a Byzantine process is detected only if there isindeed a Byzantine process in the system.

Proof. A nonfaulty process sets detect to true if itencounters divergent information about some node’sneighborhood or when it detects that connectivity is lessthan 2 (i.e., kþ 1). However, a nonfaulty process nevermodifies the neighborhood information about otherprocesses. Hence, if the program does not have a faultyprocess, all the information about a particular neighbor-hood that is circulated in the system is identical. Also,according to Lemma 5, if there are no faulty processes inthe system, the connectivity never falls below 2ðkþ 1).Hence, detect is set to true only if indeed the systemcontains a faulty process. tu

Theorem 5. Detector is an adjacent-edge complete solution tothe weak topology discovery problem in case the connectivity ofsystem topology graph exceeds the number of faults.

Proof. To prove the theorem, we show that everycomputation of Detector conforms to the properties ofthe problem. We then show that the discovered topologyis adjacent-edge complete.

Termination property follows from Lemma 3, safetyfrom Lemma 4, while validity follows from Lemma 6. Notethat Lemma 3 states that unless a fault is detected, theneighborhood of every nonfaulty process is added toCOR. That is, edges adjacent to nonfaulty processes aredetected by every nonfaulty process. Thus, Detector isadjacent-edge complete. Hence the theorem. tu

Note that with k ¼ 1, an adjacent-edge completesolution is also node and simply-complete. Hence thefollowing corollary:

Corollary 1. Detector provides a complete solution to thetopology discovery problem in case the system is doublyconnected and contains at most one fault.

4.3 Efficiency Evaluation

Since we consider an asynchronous model, the number of

messages a Byzantine process can send in a computation is

NESTERENKO AND TIXEUIL: DISCOVERING NETWORK TOPOLOGY IN THE PRESENCE OF BYZANTINE FAULTS 1783

Page 8: Discovering Network Topology in the Presence of Byzantine Faults

infinite. We evaluate the efficiency of Detector in case there are

no faults in the system. To make the estimation fair, we

assume that the unit is logðnÞ bits. Since it takes that many bits

to assign unique process identifiers tonprocesses, we assume

that one identifier is exactly one unit of information. A

message in Detector carries up to � þ 1 identifiers, where � is

the maximum number of nodes in the neighborhood of a

process. Observe that a process can receive at most

n messages from each incoming channel. Thus, the total

number of messages that can be sent by Detector is 2en, where

e is the number of edges in the graph. The message

complexity of the program is in Oð2en�Þ. If e is proportional

to n2, then the complexity of the program is in Oð�n3Þ.

4.4 Modification to Handle k > 1k > 1 Faults

Observe that the above discussion and correctness proofsare still valid if the possible number of faulty nodes isgreater than one and the required network graph con-nectivity is kþ 1.

5 EXPLORER

Outline. The main idea of Explorer for each process is tocollect information about some node’s neighborhood suchthat the information goes along more than twice as manypaths as the maximum number of Byzantine nodes. Whilethe paths are node disjoint, the information is correct if itcomes across the majority of the paths. In this case, therecipient is in possession of confirmed information. It turnsout that the topology information does not have to comedirectly from the source. Instead, it can come fromprocesses with confirmed information. The detailed de-scription of Explorer follows.

To simplify the presentation, we describe and provecorrect the version of Explorer that tolerates only oneByzantine fault. We describe how this version can beextended to tolerate multiple faults at the end of the section.

5.1 Description

Since we first describe the 1-fault-tolerant version of Explorer,we assume that the graph is 3-connected. The program isshown in Fig. 2. Similar to Detector, each process p in Explorer,stores the ids of its immediate neighbors. Process pmaintainsthe variable start, whose function is to guard the execution ofthe action that initiates the propagation of p’s own neighbor-hood. Unlike Detector, however, p maintains two sets thatstore the topology information of the network: uTOP andcTOP . Set uTOP stores the topology data that are uncon-firmed; cTOP stores confirmed topology data. Set uTOPcontains the tuples of neighborhood information that preceived from other nodes. Besides the process id and theset of its neighbor ids, each such tuple contains a set of processidentifiers that relayed the information. We call it visited set.The tuples in cTOP do not require a visited set.

Processes exchange messages where, along with theneighbor identifiers for a certain process, a visited set ispropagated. A process contains two actions: init and accept.The purpose of init is similar to that in the process ofDetector. Action accept receives the neighborhood informa-tion of some process r, its neighborhood R which wasrelayed by nodes in set S. The information is received fromp’s neighbor—q.

First, accept checks if the information about r is alreadyconfirmed. If so, the only manipulation is to record thereceived information inuTOP . Actually, this update ofuTOPis not necessary for the correct operation of the program, but itmakes the proof of correctness easier to follow.

If the received information does not concern alreadyconfirmed process, accept checks if this information differsfrom what is already recorded in uTOP either in r or in R.In either case the information is broadcast to all neighborsof p. Before broadcasting, p appends the sender—q to thevisited set S.

If the information about r and R has already beenreceived and recorded in uTOP , accept checks if thepreviously recorded information came along an internallynode disjoint path. If so, the information about r is added tocTOP . In this case, this information is also broadcast to allp’s neighbors. Note, however, that p is now sure of theinformation it received. Hence, the visited set of nodes inthe broadcast message is empty.

5.2 Correctness Proof

Similar to the Detector algorithm, we are focusing on thepropagation of the neighborhood informationA of a singularnonfaulty process a. Note that we useA to denote the correctneighborhood information. We use A0 for the neighborhoodinformation of a that may not necessarily be correct.

To aid us in the argument, we introduce an auxiliary setSENT to be maintained by each process. Since this set doesnot restrict the behavior of processes, we assume that theByzantine process maintains this set as well. SENT containseach message sent by the process throughout the computa-tion. Note that uTOP records every message received by theprocess in the computation. Hence, the comparison of uTOPand SENT allows us to establish the channel contents.

1784 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009

Fig. 2. Process of Explorer.

Page 9: Discovering Network Topology in the Presence of Byzantine Faults

Since, a message cannot be received without being sentand vice versa, the following proposition states theinvariant of the predicate that affirms it.

Proposition 1. The following predicate is an invariant of theExplorer program:

ð8b;nonfaulty c; A0; V : c 2 B :

ððða;A0; V Þ 2 Ch:b:cÞ_ðða;A0; V [ fbgÞ 2 uTOP:cÞÞ ,ðða;A0; V Þ 2 SENT:bÞÞ:

ð2Þ

The predicate states that for any process b and its nonfaultyneighbor c, the information about the neighborhood of a isrecorded in SENT:b if and only if this information is en routefrom b to c or is recorded in uTOP:c with b appended to thesequence of visited nodes V .

Before we proceed with the correctness argument, wehave to introduce additional notation. We say that someprocess c confirms ða;A0Þ if it adds this tuple to cTOP:c. Weview the propagation of A0 as construction of a tree ofprocesses that relayed A0. This tree carries A0. A tree containstwo types of nodes: a root and a nonroot. If process c isnonroot, then for some V , ða;A0; V Þ 2 SENT:c andða;A0; V Þ 2 uTOP:c. That is, a nonroot is a process thatforwarded the information received from elsewhere with-out alteration. If c is a root, then ða;A0; V Þ 2 SENT:c butða;A0; V Þ 62 uTOP:c. Node c’s ancestor in a tree is the nodethat lies on a path from c to the root.

Observe that the root of a tree can only be the process aitself, the Byzantine node, or a node that confirms ða;A0Þ. Notealso that since each nonfaulty process c sends a message abouta’s information at most twice, c can belong to at most twotrees. Moreover, c has to be the root of one of those trees.

The proposition below follows from Proposition 1.

Proposition 2. If some process d is the ancestor of anotherprocess c in a tree carrying ða;A0Þ and ða;A0; V Þ 2 uTOP:c,then d 2 V .

Lemma 7. If a nonfaulty node c confirms ða;A0Þ, then A0 ¼ Aand a is real.

Proof. Let us first suppose that a is real. Further, suppose cis the first nonfaulty process in the system, besides a, toconfirm ða;A0Þ. To add ða;A0Þ to cTOP:c, any processc 6¼ a has to contain ða;A0; V Þ 2 uTOP:c and receive amessage from one of its neighbors b carrying ða;A0; V 0Þsuch that V \ V 0 � fag. In our notation, this means that cbelongs to a tree that carries ða;A0Þ and receives amessage from b (possibly belonging to a different tree)that carries the same information: ða;A0Þ. Let us considerthat b and c belong to the same or different trees.

Suppose b and c belong to the same tree. If this is thecase, the messages that c receives have to share nodes inthe visited sets V and V 0. However, for c to confirmða;A0Þ, the intersection of V and V 0 has to be a subset offag. That is, the only common node between the two setsis a. Observe that a does not forward the informationabout its own neighborhood if it receives it fromelsewhere. Thus, if a belongs to a tree, then a is its root.In this case, A0 ¼ A.

Suppose b and c belong to different trees. Recall thatfor c to confirm ða;A0Þ, both of these trees have to carry

ða;A0Þ. However, if A0 6¼ A, then the root of the tree iseither the faulty node or another node that confirmedða;A0Þ. Yet, we assumed that c is the first node to do so.Thus, if c receives a message from b, the only tree thatcarries the information ða;A0Þ such that A0 6¼ A is rootedin the faulty node. Thus, even if b and c belong todifferent trees, A0 ¼ A.

Similarly, if a is fake, unless another node confirmsða;A0Þ, there is only one tree that carries ða;A0Þ and it isrooted in the faulty node. In this case, no other nodeconfirms ða;A0Þ. tu

Lemma 8. Every computation of Explorer contains a state

where each nonfaulty process belongs to at least one tree

carrying ða;AÞ.Proof. We prove the lemma by induction on the number of

nodes in the system. To prove the base case, we observethat the init action is enabled in a in the beginning ofevery computation. This action stays enabled unlessexecuted. Thus, due to weak fairness of action executionassumption, init is eventually executed in a. When it isexecuted, a forms a tree carrying ða;AÞ.

Let us assume that there are i: 1 � i < n nonfaultynodes that belong to trees carrying ða;AÞ. Since thenetwork is at least 3-connected, there exists a nonfaultyprocess c that does not belong to such a tree but has aneighbor b that does.

If b belongs to a tree carrying ða;AÞ, then SENT:bcontains an entry ða;A; V Þ for some set of visited nodesV . If c does not belong to such a tree then, bydefinition, ða;A; V 0Þ 62 uTOP:c. In this case, according toProposition 1, Ch:b:c contains ða;A; V Þ. Similar argu-ment applies to the other neighbors of c that belong totrees carrying ða;AÞ. That is, c has incoming messagesfrom every such neighbor.

According to the fair message receipt assumption,these messages are eventually received. We can assume,without loss of generality, that c receives a message fromb first. Since c does not contain an entry ða;A; V 0Þ inuTOP:c, upon receipt of the message from b, c sends amessage with ða;A; V [ fbgÞ, attaches this message toSENT:c, and includes it in uTOP:c. This means that cjoins the tree carrying ða;AÞ.

Thus, every nonfaulty node eventually joins a treecarrying correct neighborhood information about a. tu

A branch of a tree is either a subtree without the root orthe root process alone. The following proposition followsfrom Proposition 1:

Proposition 3. If a computation of Explorer contains a statewhere a nonfaulty node c and its neighbor b either belong totwo different trees carrying the same information ða;AÞ or totwo different branches of the tree rooted in a, then thiscomputation also contains a state where c confirms ða;AÞ.

Lemma 9. Every nonfaulty process c eventually confirms ða;AÞ.Proof. The proof is by induction on the number of nodes in

the system. The base case trivially holds as a itselfconfirms ða;AÞ in the beginning of every computation.Assume that i nonfaulty processes have ða;AÞ in cTOP ,where 1 � i < n. We show that if there exists another

NESTERENKO AND TIXEUIL: DISCOVERING NETWORK TOPOLOGY IN THE PRESENCE OF BYZANTINE FAULTS 1785

Page 10: Discovering Network Topology in the Presence of Byzantine Faults

nonfaulty process c, it eventually confirms ða;AÞ. Twocases have to be considered: there exists only one treecarrying ða;AÞ, and there are multiple such trees.

Let us consider the first case. Note that in everycomputation, there eventually appears a tree rooted in a.In this case, we may only consider a tree so rooted. Sincethe network is at least 3-connected, there exists a simplecycle containing a and not containing the faulty process.According to Lemma 8, every process in the cycleeventually joins this tree. Observe that, by our definitionof a tree branch, there always is a pair of neighborprocesses b and c that belong to different branches of atree rooted in a and carrying ða;AÞ. In this case,according to Proposition 3, one of the two nodeseventually confirms ða;AÞ.

Let us now consider the case of multiple trees carryingða;AÞ. Again, according to Lemma 8, each nonfaultyprocess in the system joins at least one of these trees.Since the network is at least 3-connected, there exists anonfaulty process c belonging to one tree that has aneighbor b belonging to a different tree. In this case,according to Proposition 3, c confirms ða;AÞ.

By induction, every nonfaulty process in the systemeventually confirms ða;AÞ. tu

Theorem 6. Explorer is a two-adjacent-edge complete solutionto the strong topology discovery problem in case of one faultand the system topology graph is at least 3-connected.

Proof. Explorer conforms to the termination and safetyproperties of the problem as a consequence of Lemmas 9and 7, respectively.

Observe that a nonfaulty node may potentiallyconfirm incorrect neighborhood information about aByzantine node. That is, an edge reported by the faultyprocess is either missing or fake. However, due to theabove two lemmas, if two nodes are nonfaulty, theinformation whether there is an adjacent edge betweenthem is discovered by every nonfaulty node. Hence,Explorer is two-adjacent-edge complete. tu

5.3 Modification to Handle k > 1k > 1 Faults

Observe that Explorer confirms the topology informationabout a node’s neighborhood when it receives two messagescarrying it over internally node disjoint paths. Thus, theprogram can handle a single Byzantine fault. Explorer canhandle k > 1 faults, if it waits until it receives kþ 1 messagesbefore it confirms the topology information. All the messageshave to travel along internally node disjoint paths. For thecorrectness of the algorithm, the topology graph has to beð2kþ 1Þ-connected.

Proposition 4. Explorer is a two-adjacent-edge complete solutionto the strong topology discovery problem in case of k faults andthe system topology graph is at least ð2kþ 1Þ-connected.

5.4 Efficiency Evaluation

Similar to Detector, the number of messages a faulty nodemay send in Explorer is arbitrarily large. However, we canestimate the message complexity of Explorer in the absenceof faults. Each message carries a process identifier, aneighborhood of this process, and a visited set. The numberof the identifiers in a neighborhood is no more than �, and

the number of identifiers in the visited set can be as large asn. Hence, the message size is bounded by � þ nþ 1, whichis in OðnÞ.

Note that for the neighborhood A of each process a, everyprocess broadcasts a message twice: when it first receives theinformation and when it confirms it. Thus, the total number ofsent messages is 4e � n and the overall message complexity ofExplorer if no faults are detected is in Oðn4Þ.

6 COMPOSITION AND EXTENSIONS

6.1 Composing Detector and Explorer

Observe that Detector has better message complexity thanExplorer if the neighborhood size is bounded. Hence, if theincidence of faults is low, it is advantageous to run Detectorand invoke Explorer only if a fault is detected. We assume thatthe processes can distinguish between message types ofExplorer and Detector. In the combined program, a processrunning Detector switches to Explorer if it discovers a fault.Other processes follow suit when they receive their firstExplorer messages. They ignore Detector messages henceforth.A Byzantine process may potentially send an Explorermessage as well, which leads to the whole system switchingto Explorer. Observe that if there are no faults, the system willnot invoke Explorer. Thus, the complexity of the combinedprogram in the absence of faults is the same as that of Detector.Note that even though Detector alone only needs ðkþ 1Þ-connectivity of the system topology, the combined programrequires ð2kþ 1Þ-connectivity.

6.2 Message Termination

We have shown that Detector and Explorer comply with thefunctional termination properties of the topology discoveryproblem. That is, all processes eventually discover topol-ogy. However, the performance aspect of termination, viz.,message termination, is also of interest. Usually, analgorithm is said to be message terminating if all itscomputations contain a finite number of sent messages [5].

However, a Byzantine process may send messagesindefinitely. To capture this, we weaken the definition ofmessage termination. We consider a Byzantine-tolerantprogram message terminating if the system eventually arrivesat a state where 1) all channels are empty except for theoutgoing channels of a faulty process and 2) all actions innonfaulty processes are disabled except for possibly thereceive actions of the incoming channels from Byzantineprocesses. These receive actions do not update the variablesof the process. That is, in a terminating program, eachnonfaulty process starts to eventually discard messages itreceives from its Byzantine neighbors.

Making Detector terminating is fairly straightforward. Asone process detects a fault, the process floods theannouncement throughout the system. Since the topologygraph for Detector is assumed ðkþ 1Þ-connected, everyprocess receives such announcement. As the process learnsof the detection, it stops processing or forwarding themessages. Note that the initiation of the flood by aByzantine node itself only accelerates the termination ofDetector as the other processes quickly learn of the faultynode’s existence.

1786 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009

Page 11: Discovering Network Topology in the Presence of Byzantine Faults

The addition of termination to Explorer is more involved.To ensure termination, restrictions have to be placed onmessage processing and forwarding. However, the restric-tions should be delicate as they may compromise theliveness properties of the program. By the design ofExplorer, each process may send at most one message aboutits own neighborhood to its neighbors. Hence, the subse-quent messages can be ignored. However, a faulty processmay send messages about neighborhoods of other pro-cesses. These processes may be real or fake. We discussthese cases separately.

Note that each process in Explorer can eventually obtainan estimate of the identities of the processes in the systemand disregard fake process information. Indeed, a path to afake node can only lead through faulty processes. Thus, if aprocess discovers that there may be at most k internallynode disjoint paths between itself and a certain node, thisnode is fake. Therefore, the process may cease to processmessages about the fake node’s neighborhood. Note thatsince the system is ð2kþ 1Þ-connected, messages about realnodes will always be processed. Therefore, the livenessproperties of Explorer are not affected.

As to the real processes, they can be either Byzantine ornonfaulty. Recall that each nonfaulty process of Explorereventually confirms neighborhoods of all other nonfaultyprocesses. After the neighborhood of a process is con-firmed, further messages about it are ignored.

The last case is a Byzantine process u sending a messageto its correct neighbor v about the neighborhood of anotherByzantine process w. By the design of Explorer, v relays themessage about w provided that the neighborhood informa-tion about w differs from what was previously receivedabout w. As we discussed above, eventually v estimates theidentities of all real processes in the system. Therefore, thereis a finite number of possible different neighborhoods of wthat u can create. Hence, eventually they will be exhausted,and v starts ignoring further messages from u about w.

Thus, Explorer can be made terminating as well.

6.3 Handling Topology Updates

In the topology discovery problem statement, it is assumedthat the system topology does not change. However,Detector and Explorer can be adapted to manage topologychanges as well. Observe that apart from detecting fakenodes in Explorer, both algorithms propagate the informa-tion of one process neighborhood independently of theothers. We first describe how this propagation can be donein case the topology changes and then address the fake nodedetection. Each time the neighborhood of a process pchanges, p starts a new version of the topology discoveryalgorithm for its neighborhood. Observe that a faultyprocess may also start a new version for p.

The versions are distinguished by version numbers. Eachprocess maintains the version numbers of p. Each relatedmessage carries the version number. Each process outputsthe discovered neighborhood of p with the highest receivedversion number. Observe that in the case of Explorer, theprocesses only output confirmed information. Note that if afaulty process sends incorrect information about p’sneighborhood with a certain version number, this incorrectinformation will be handled by the basic Detector or Explorerwithin that version. For example, the faulty messages of

version i about p’s neighborhood will be countered by thecorrect messages of the same version. Note that a faultyprocess in Explorer may start a version j for p’s neighbor-hood such that it is higher than the highest version i that pitself started. However, according to the basic Explorer, theincorrect information in version j will not be confirmed.

There are two specific modifications to the basic Detector.If the faulty process sends a message concerning p with theversion number higher than that of p, p itself detects thefault. To detect fake nodes generated by a faulty process,each node has to compile the topology TOP graph of thehighest version number for each node in the system andensure that its connectivity does not fall below kþ 1.Observe that Detector is unable to differentiate betweentemporary lack of connectivity from malicious behavior ofthe faulty nodes. Therefore, the connectivity of the dis-covered network at each node should never fall below kþ 1.For that, we assume that throughout a computation, theintersection of all system topologies is kþ 1-connected.Similarly, the topology has to always be 2kþ 1-connectedfor Explorer to operate correctly.

The topology update mechanism can be optimized inobvious ways. For Detector, each process has to keep theinformation for p with only the highest version number.Obsolete information can be safely discarded. For Detector,the process may keep the latest version of confirmedneighborhood information. Observe that this extension ofthe topology discovery algorithm assumes infinite sizecounters. Care must be taken when implementing thesecounters in the actual hardware, as the faulty processes maytry to compromise topology discovery if the counter valuesare reused. Hence, such an implementation would require aByzantine-robust counter synchronization algorithm. Lam-port and Melliar-Smith [11] proposed such an algorithm forcompletely connected systems. Extending it to arbitrarytopology systems is an attractive avenue of further research.

6.4 Discovering Neighbors

As described, in the initial state of Detector and Explorer, eachprocess has access to correct information about its immediateneighborhood. Note that, in general, obtaining this informa-tion in the presence of Byzantine processes may be difficult asthey can mount a Sybil attack [8]. In such an attack, a faultyprocess is able to use an arbitrary identifier while a correctprocess cannot determine whether two messages were sent bythe same process or by different processes. A Sybil attack isdifficult to handle [4], [23]. However, Detector and Explorer canbe modified to handle neighborhood discovery with knownports. That is, each process does not know the identities of itsneighbors but can determine if a message is coming from thesame process. Also note that we assume that there are noactual topology changes during the computation as itinterferes with the neighborhood discovery. Observe thatwith knownports, a faulty process maynot be able to use morethan one identifier per correct neighbor without beingdetected.

Note that there are certain limits of what can bediscovered under these assumptions. First, no algorithmcan determine the identifier of a faulty process as it mayassume an arbitrary one. Moreover, with more than onefaulty process, the system is subject to “black hole” attack:a pair of colluding faulty nodes may deceive their

NESTERENKO AND TIXEUIL: DISCOVERING NETWORK TOPOLOGY IN THE PRESENCE OF BYZANTINE FAULTS 1787

Page 12: Discovering Network Topology in the Presence of Byzantine Faults

nonfaulty neighbors into believing that they share an edge.When communicating to a nonfaulty node a, its faultyneighbor b assumes the identity of another nonfaulty noded. Similarly, a faulty neighbor c of d assumes the identity ofa. This way, nonfaulty nodes a and d are led to believe theyshare an edge.

The modified algorithms contain two phases: neighbor-hood discovery phase and topology discovery properphase. In the first phase, each process broadcasts itsidentifier to its neighbors. Observe that faulty processesmay not send these initial messages at all. Thus, the processshould not wait for a message from every possibleneighbor. Instead, as soon as each process p gets a messagefrom a new port with q in its identifier, p may start thesecond phase with fqg as its neighborhood. Every time p

gets a new distinct identity from a fresh port, p treats it astopology update, increments its counter, and re-initiates thetopology discovery. Note that a faulty process may select anidentifier of another neighbor r. Due to known ports, therecipient will be able to distinguish between the twoprocesses. In case of Detector, the recipient can immediatelydetect a fault. In case of Explorer, the proposed procedure isas follows. The recipient process identifies its neighborswith the same identities r and r0, and propagates theinformation to the other processes in this form. The otherprocesses treat these identities as distinct.

This identity discovery procedure can be further stream-lined. Recall that for Detector and Explorer, the topologygraph has to be kþ 1 and 2kþ 1-connected, respectively.Thus, depending on the algorithm, each process isguaranteed to have at most 1 or kþ 1 nonfaulty neighbors.Therefore, each process may delay initiating topologydiscovery until it gets this minimum number of distinctidentities. However, through the updates, each processlearns the identities of all correct processes in the system.

6.5 Other Extensions

Observe that Explorer is designed to disseminate theinformation about the complete topology to all processes inthe system. However, it may be desirable to just establish theroutes from all processes in the system to one or to a fixednumber of distinguished ones. To accomplish this, Explorerneeds to be modified as follows. No neighborhood informa-tion is propagated. Instead of the visited set, each messagecarries the propagation path of the message. That is, the orderof the relays is significant.

Only the distinguished processes initiate the messagepropagation. The other processes only relay the messages.Just as in the original Explorer, a process confirms a path toanother process only if it receives 2kþ 1 internally processdisjoint paths from the source or from other confirmingprocesses. Again, like in Explorer, such process rebroadcaststhe message, but empties the propagation path. In theoutcome of this program, for every distinguished process,each nonfaulty process will contain paths to at least 2kþ 1processes that lead to this distinguished process. Out of thesepaths, at least kþ 1 ultimately lead to the distinguishedprocess.

In Explorer, for each process, the propagation of itsneighborhood information is independent of the otherneighborhoods. Thus, instead of topology, Explorer can be

used for efficient fault-tolerant propagation of arbitraryinformation from the processes to the rest of the network.

7 CONCLUSION AND OPEN PROBLEMS

In this paper, we explored an unorthodox approach to

Byzantine-robust routing. It relies on topology itself ratherthan on cryptographic primitives. It is best suited where the

more conventional cryptography-based solution is infeasible

due to lack of computing, storage, and transmissionresources such as on motes [9]. However, our approach

may be an attractive alternative even if the platform

technically allows cryptographic operation. In topology-

based Byzantine tolerance, the fault recovery is based onroute-based redundancy of information propagation. If the

system allows such redundancy of routes, then our

approach promises significant computation savings since

no cryptographic operations are required.Note that our approach requires that the network be

designed to be dense enough to tolerate the acceptable

number of faults. Alternatively, the connectivity or density

of the network dictates the maximum number of faults it isable to tolerate.

In conclusion, we would like to outline several interestingresearch directions that this paper opens. The existence of

Byzantine-robust topology discovery solution poses the

question of theoretical limits of efficiency of such programs.The obvious lower bound on message complexity can be

derived as follows. Every process must transmit its neighbor-

hood to the rest of the nodes in the system. Transmitting

information to every node requires at least nmessages, so theoverall message complexity is at least �n2. If k processes are

Byzantine, they may not relay the messages of other nodes.

Thus, to ensure that other nodes learn about its neighbor-

hood, each process has to send at least kþ 1 messages. Thus,the complexity of any Byzantine-robust solution to the

topology discovery problem is at least in �ð�n2kÞ.Observe that Explorer and Detector may not explicitly

identify faulty nodes or the inconsistent view of theirimmediate neighborhoods. We believe that this identification

can be accomplished using the technique used by Dolev [7]. In

case there are 3kþ 1 nonfaulty processes, they may exchange

the topologies they collected to discover the inconsistencies.This approach may potentially expedite termination of

Explorer at the expense of greater message complexity: if a

certain Byzantine node is discovered, the other processes may

ignore its further messages.In Section 6, we briefly considered the problem of

transformation of known ports to oral record [12] model in

the context of topology discovery. However, the problem

can be posed in a more general setting. That is, what are thenecessary and sufficient conditions for enabling such

transformation in an arbitrary system?Note that our solution assumes that the failures may

affect arbitrary nodes. As the system scale increases, this

may lead the designers to build the system with a lot ofredundant links. It would be interesting to study a problem

with less potent faults, for example, where the placement of

faulty nodes is randomized.

1788 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 12, DECEMBER 2009

Page 13: Discovering Network Topology in the Presence of Byzantine Faults

ACKNOWLEDGMENTS

Some of the results in this paper were presented at the

13th Colloquium on Structural Information and Commu-

nication Complexity, Chester, UK, in July 2006. Mikhail

Nesterenko was supported in part by DARPA contract

OSU-RF#F33615-01-C-1901 and by US National Science

Foundation (NSF) CAREER Award 0347485. Part of this

work was done while this author was visiting Paris Sud

University. Sebastien Tixeuil was supported in part by the

ANR grant SOGEA from “Securite, Systemes Embarques

et Intelligence Ambiante” program, and ANR grant

SHAMAN from VERSO program. Part of this work was

done while this author was visiting Kent State University.

REFERENCES

[1] H. Attiya and J. Welch, Distributed Computing: Fundamentals,Simulations, and Advanced Topics. McGraw-Hill Publishing Com-pany, May 1998.

[2] I.C. Avramopoulos, H. Kobayashi, R. Wang, and A.Krishnamurthy, “Highly Secure and Efficient Routing,” Proc.IEEE INFOCOM, Mar. 2004.

[3] V. Bhandari and N.H. Vaidya, “On Reliable Broadcast in a RadioNetwork,” Proc. 24th Ann. ACM SIGACT-SIGOPS Symp. Principlesof Distributed Computing (PODC ’05), July 2005.

[4] S. Delaet, P.S. Mandal, M. Rokicki, and S. Tixeuil, “DeterministicSecure Positioning in Wireless Sensor Networks,” Proc. ACM/IEEEInt’l Conf. Distributed Computing in Sensor Networks (DCOSS ’08),June 2008.

[5] E.W. Dijkstra and C.S. Scholten, “Termination Detection forDiffusing Computations,” Information Processing Letters, vol. 11,no. 1, pp. 1-4, Aug. 1980.

[6] E.W. Dijkstra and C.S. Scholten, Predicate Calculus and ProgramSemantics. Springer-Verlag, 1990.

[7] D. Dolev, “The Byzantine Generals Strike Again,” J. Algorithms,vol. 3, no. 1, pp. 14-30, 1982.

[8] J.R. Douceur, “The Sybil Attack,” Proc. First Int’l Workshop Peer-to-Peer Systems (IPTPS ’02), Mar. 2002.

[9] J.L. Hill and D.E. Culler, “Mica: A Wireless Platform for DeeplyEmbedded Networks,” IEEE Micro, vol. 22, no. 6, pp. 12-24,Nov./Dec. 2002.

[10] C.-Y. Koo, “Broadcast in Radio Networks Tolerating ByzantineAdversarial Behavior,” Proc. 23rd Ann. ACM Symp. Principles ofDistributed Computing (PODC ’04), pp. 275-282, 2004.

[11] L. Lamport and P.M. Melliar-Smith, “Byzantine Clock Synchroni-zation,” Operating Systems Rev., vol. 20, no. 3, pp. 10-16, 1986.

[12] L. Lamport, R. Shostak, and M. Pease, “The Byzantine GeneralsProblem,” ACM Trans. Programming Languages and Systems, vol. 4,no. 3, pp. 382-401, July 1982.

[13] D. Malkhi, Y. Mansour, and M.K. Reiter, “Diffusion without FalseRumors: On Propagating Updates in a Byzantine Environment,”Theoretical Computer Science, vol. 299, nos. 1-3, pp. 289-306, Apr.2003.

[14] D. Malkhi, M. Reiter, O. Rodeh, and Y. Sella, “Efficient UpdateDiffusion in Byzantine Environments,” Proc. 20th IEEE Symp.Reliable Distributed Systems (SRDS ’01), pp. 90-98, Oct. 2001.

[15] T. Masuzawa, “A Fault-Tolerant and Self-Stabilizing Protocol forthe Topology Problem,” Proc. Second Workshop Self-StabilizingSystems, pp. 1.1-1.15, 1995.

[16] T. Masuzawa and S. Tixeuil, “Bounding the Impact of UnboundedAttacks in Stabilization,” Proc. Int’l Conf. Stabilization, Safety, andSecurity (SSS), A.K. Datta and M. Gradinariu, eds., pp. 440-453,2006.

[17] T. Masuzawa and S. Tixeuil, “Stabilizing Link-Coloration ofArbitrary Networks with Unbounded Byzantine Faults,” Int’l J.Principles and Applications of Information Science and Technology(PAIST), vol. 1, no. 1, pp. 1-13, Dec. 2007.

[18] Y. Minsky and F.B. Schneider, “Tolerating Malicious Gossip,”Distributed Computing, vol. 16, no. 1, pp. 49-68, 2003.

[19] M. Nesterenko and A. Arora, “Tolerance to Unbounded ByzantineFaults,” Proc. 21st IEEE Symp. Reliable Distributed Systems, pp. 22-29, 2002.

[20] A. Pelc and D. Peleg, “Broadcasting with Locally BoundedByzantine Faults,” Information Processing Letters, vol. 93, no. 3,pp. 109-115, 2005.

[21] A. Perrig, J. Stankovic, and D. Wagner, “Security in WirelessSensor Networks,” Comm. ACM, vol. 47, no. 6, pp. 53-57, June2004.

[22] Y. Sakurai, F. Ooshita, and T. Masuzawa, “A Self-Stabilizing Link-Coloring Protocol Resilient to Byzantine Faults in Tree Networks,”Proc. Int’l Conf. Principles of Distributed Systems (OPODIS ’04), Dec.2004.

[23] A. Vora, M. Nesterenko, S. Tixeuil, and S. Delaet, “UniverseDetectors for Sybil Defense in Ad Hoc Wireles Networks,” Proc.Int’l Conf. Stabilization, Safety, and Security (SSS 2’008), Nov. 2008.

[24] J. Yellen and J.L. Gross, Graph Theory & Its Applications. CRC Press,1998.

Mikhail Nesterenko received the PhD degree in1998 from Kansas State University. Presently heis an associate professor at Kent State Uni-versity. He is interested in distributed algorithms.

Sebastien Tixeuil received the PhD degree in2000 from the Universite Paris Sud-XI. He isnow a professor at the Universite Pierre & MarieCurie, Paris 6. His research interests includefault and attack tolerance in networks anddistributed systems.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

NESTERENKO AND TIXEUIL: DISCOVERING NETWORK TOPOLOGY IN THE PRESENCE OF BYZANTINE FAULTS 1789