the power of reconfiguration

The Power of Recon�guration�Y. Ben-Asher y D. Peleg z R. Ramaswami x A. Schuster {April 5, 1998AbstractThis paper concerns the computational aspects of the recon�gurable network model. The computationalpower of the model is investigated under several network topologies and assuming several variants of themodel. In particular, it is shown that there are recon�gurable machines based on simple network topologies,that are capable of solving large classes of problems in constant time. These classes depend on the kindsof switches assumed for the network nodes. Recon�gurable networks are also compared with various othermodels of parallel computation, like PRAM's and Branching Programs.�Part of this work is to be presented at the 18th International Colloquium on Automata, Languages, and Programming(ICALP), July 1991, Madrid.yDepartment of Computer Science, The Hebrew University, Jerusalem 91904, Israel. E-mail: [email protected], Supportedby Eshcol Fellowship.zDepartment of Applied Mathematics and Computer Science, The Weizmann Institute, Rehovot 76100, Israel. E-mail: [email protected]. Supported in part by an Allon Fellowship, by a Bantrell Fellowship and by a Walter and Elise Haas CareerDevelopment Award.xIBM T.J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598, USA. E-mail: [email protected].{Department of Computer Science, The Hebrew University, Jerusalem 91904, Israel. E-mail: [email protected]

1 IntroductionMost currently available parallel computers are based on a collection of processors and switching nodes con-nected by a fast interconnection network. While traditional designs are based on a network of �xed topology,the concept of a recon�gurable network (RN) has been around for a while. However, only recently did this ideabecome the focus of much interest, due to technological developments that has made it more viable.The basic idea of a recon�gurable network is to enable exible connection patterns, by allowing nodes toconnect and disconnect their adjacent edges in various patterns. This yields a variety of possible topologies forthe network, and enables the program to exploit this topological variety in order to speed up the computation.Informally, a recon�gurable network operates as follows. Essentially, the edges of the network are viewedas building blocks for larger bus components. The network dynamically recon�gures itself at each time step,where an allowable con�guration is a partition of the network into paths, or, a set of edge-disjoint busses.A crucial point is that the recon�guration process is carried out locally at each processor (or switch) of thenetwork. That is, at the beginning of each step during the execution of a program on the RN, each switch ofthe network �xes its local con�guration by partitioning its collection of edges into any combination of pairs andsingletons. Two adjacent edges grouped by a switch into the same pair are viewed as (hardware) connected,and the processor may choose to listen to any incoming or passing message.Recon�gurable networks have been proposed and built in the past. The CHiP project [Sn82] focused on theconstruction of a globally controlled recon�gurable mesh. The decision of the next con�guration is taken by adedicated front-end processor, and is left unchanged throughout the entire computation. The CHiP machineis capable of emulating any �xed-connection architecture taken from a prede�ned (constant size) collection ofarchitectures. Our approach is di�erent in two ways. First, it allows the connection pattern to change in everystep during the computation. Secondly, the network is considerably more exible, and allows a much largernumber of possible con�gurations. These di�erences are clearly re ected by our results, implying that the realpower of RN's comes from these two additional features.Several papers have considered the general RN model as presented above. Rothstein et. al. [MR79, Ro76,RD79] investigated the Bus Automata and its computational power. Nakatani [Na87] considered comparison-based operations like merging, sorting and selection on recon�gurable arrays. Miller, Stout, Reisis and Kumar[MPRS88a] and Reisis and Kumar [RK87] considered parallel computations and data movement operations1

on the recon�gurable mesh. They showed that many CRCW PRAM algorithms are also suited to this modeland that the pyramid and the mesh of trees architectures may be simulated by it. Wang and Chen [WCL90,WC90, WC90] present several constant time algorithms, and Li and Maresca [LM89a, LM89b, ML89] showthat the model is suited for VLSI implementation and for image processing. Recently two of us developed anew method for recon�gurable networks analysis [BS90c]. The method consists of the novel e�ciency measureof bus - usage, which is shown to trade-o� against computation time.The basic assumption concerning the behavior of the recon�gurable model (and indeed, of any other busmodel) is that in any con�guration, the time it takes to transmit along any bus is constant, regardless of the buslength. This assumption is somewhat unrealistic on currently realizable parallel architectures, and partiallyexplains why the RN model has not gained wide acceptance. With traditional semiconductors technology,conductance and capacity of wires and transistors imply a constant lower bound for the switching rate of asingle switch, even when already set to the right con�guration.Despite the above unencouraging discussion, there already exist experimental and commercial recon�gurablechips having thousands of switches and processors. Suggested implementations are both electronic [X86,MKS89, GK89, LM89b] and optic [TCS89, BS90a]. This may be explained by the following argument: animportant property of the recon�gurable networks model is the lack of diameter considerations. This advantagecalls for underlying network topologies which are simple and easy to build, e.g. meshes, rectangles and cubes(three dimensional linear arrays). Simple con�gurations, in turn, imply short wires and simple wiring (typically,machines based on these topologies can be embedded so that every edge has wiring length O(1).) which shortenthe cycle of the machine and make the �xed bus transmission assumption acceptable, even for massively parallelmachines.Under the above assumptions, recon�gurable networks outperform PRAM's in the computation of severalnatural problems. For example, the simplest connected recon�gurable topology, namely the path, is capableof computing the OR of N Boolean inputs in a single step, whereas the EREW PRAM requires (logN)steps (it is assumed that the number of processors in both machines is N , and the input bits are given oneat each processor). Another example concerns the computation of parity (or summation mod 2). The parityof N1=2 inputs is computed in O(1) steps by a N1=2 � N1=2 recon�gurable mesh (see Section 3), compared tothe lower-bound of ( logNlog logN ) steps for CRCW PRAM having any polynomial number of processors [BH87].In general, RN's seem to run logN times faster than EREW PRAM's and logNlog logN times faster than CRCW2

PRAM's. In Section 5 we establish this gap more generally, by introducing simulations of PRAM's by RN'sand vice versa.Another model that should be compared with the RN model is the (�xed topology) mesh with multiplebusses [A86, BP91, Bo84, JS79, R86, S83, S86]. This model is based on a regular mesh augmented with acollection of busses providing broadcast facilities along every row and every column. It is important to notethat multiple-bus models use the same assumptions as discussed above for RN's. Namely, multiple-bus modelsassume both broadcast capability and that a message may travel along any bus in a single step, where bussesare of unbounded length. In spite of this fact, the gap between the two models is quite large. As an example,the above constant-timeN -input parity algorithm using N2 switches implies also an O(log logN)-time N -inputparity algorithm on a recon�gurable N1=2 � N1=2 mesh [MPRS87]. In Section 4 it is shown how to computeparity in constant time using a simple recon�gurable topology with only 4N switches. These results should becompared with the (N1=6) steps required to compute this function on the square mesh with multiple busses[PR87], and (N1=8) steps needed for computing parity on any rectangular N -processor mesh with multiplebusses [BP91].It is particularly important to compare recon�gurable networks with ordinary (�xed-topology) networks,since the latter are frequently suggested as a \realizable" model of parallel computation, and are in fact thebasis for most currently available machines. Obviously, RN's are superior to �xed-topology networks as well; a�xed-topology network is easily simulated by an RN with the same underlying topology with no time loss, butvarious problems solvable in constant time by RN's require at least (D) time on �xed-topology D-diameternetworks. In contrast, when comparing various popular �xed network topologies with the basic RN topologiesconsidered here, it turns out that the two models have similar embedding parameters. Consider, for example,some well known multistage interconnection networks such as the butter y or the shu�e-exchange (cf. [Ul84]).VLSI Chip embeddings of such O(N logN)-node networks require area A = (N2) [Ul84]. Thus the averagewire length is �(N= logN). In contrast, the underlying RN topologies studied here, such as the mesh, aresimple and short-wired, so that a bus consisting of k connected edges is of length O(k). For instance, inSection 3 we present an area-optimal constant time sorting algorithm on an RN, with maximum bus length�(N). In summary, while the basic recon�guration assumption is not that far from its counterpart \�xedconnection" basic assumption, it makes the model much faster.In this paper we investigate the theoretical limits of the computational power of constant-degree RN's with a3

polynomial number of switches. We focus mainly on operations requiring constant time. The rest of this paperis organized as follows. In Section 2 we give a precise de�nition of the RN model. Section 3 presents some basictools, exempli�ed via the design of e�cient algorithms for the problems of addition and sorting. In particular,a constant-time, optimal-area sorting network is constructed. In Section 4, the relationships between RN's andBranching Programs are explored in some detail. Con�gurations taken by RN's are viewed as paths that arechosen for the ow of \signals" in a branching program. It is shown that fairly simple constructions exist forcomputing all functions in NC1 by an RN in constant time. Section 5 investigates the relationships betweenthe RN and PRAM models. It is shown that an EREW PRAM may simulate any RN in logarithmic time,thus bounding the power of RN's from above. In Section 6 it is shown that by introducing a di�erent kind ofswitches, the power of the model is enhanced so that every problem in PTIME can be answered in constanttime by a recon�gurable mesh. Section 7 extends the guided optic implementation suggested in [BS90a], givingthe guidelines of how massively parallel recon�gurable architectures may actually be built.2 The Recon�gurable Network modelA recon�gurable network (RN) operates in the single-instruction multiple-data (SIMD) mode. That is, theprocessors residing at the nodes of the network perform the same program synchronously, taking local decisionsand calculations according to the input and locally stored data. Initially, each input bit (or item) is availableat a single node of the network.A single node of the network may consist of a computing unit, a small bu�er and a switch with reconnectioncapability. The bu�er holds either an input item or something that was previously read from approachingbusses. Sometimes the sole objective of the computing unit is to decide the next state of the switch accordingto the data stored at the bu�er. In the sequel, we interchange the notions of switch, processor and a networknode.A single time step of an RN computation is composed of the following substeps.Substep 1: The network selects a con�guration H of the busses, and recon�gures itself to H. This is done bylocal decisions taken at each switch individually.Substep 2: One or more of the processors connected by a bus transmit a message on the bus. These processorsare called the speakers of the bus. The message length is bounded by the bandwidth parameter of the4

network.Substep 3: Some of the processors connected by the bus read the message transmitted on the bus by thespeaker(s). These processors are referred to as the readers of the bus.Substep 4: A constant time local computation is taken by every processor.At each time step, a bus may take one of the following three states.� Idle: no processor transmits,� Speak: there is a single speaker,� Error: there is more than one speaker.An Error state, re ecting a collision of several speakers, is detectable by all processors connected by thecorresponding bus, but the messages are assumed to be destroyed.The recon�gurable mesh is the underlying topology considered by most of the literature. The N � Nrecon�gurable mesh is composed of an array of N columns and N rows of switches, with edges connecting eachswitch to its four neighbors (or fewer, for borderline switches). We refer to the switch at the ith row and thejth column as si;j. Similarly, we refer to the ith row as si;� and to the ith column as s�;i.There are 10 local con�gurations that may be taken by a switch of the mesh. These con�gurations aredepicted in Figure 1. For convenience, we envision the mesh as embedded in the plane so that row 1 is at thebottom, and column 1 is to the left.In order to simplify the presentation of our algorithms, we describe the global con�gurations taken by thenetwork and the actions performed by the processors in a non-formal way. For example, by saying that \eachprocessor of the bottom row transmits (some message) on its column" we mean the following. First, set theglobal con�guration so that all columns are connected, forming vertical busses, and processors at di�erentcolumns are disconnected. This is done by using the U �D local con�guration at all switches. Then, let s1;i,the ith processor of the bottom row, be the speaker of the bus created on column i. The readers are all therest of the processors of that column.There are several e�ciency measures one may consider for recon�gurable networks.5

L �R, U �D�� L�D, R � U�� L� U , R �D�� ; U �DL�D� R� U� L� U� R �D� L�RFigure 1: Local con�gurations allowable for a switch of a mesh.Time: The time complexity of an RN computation is the number of parallel steps required to complete thecomputation.Cycle: The length of a bus is the number of edges it contains. The basic RN model assumption is that trans-mission along a bus of any length takes a single time unit. However, one cannot ignore the implicationsof the maximal bus length on the actual performance of implementations of the model, namely, the cycletime of an RN machine. We attempt to capture this notion by formulating the following parameter: thecycle of an RN computation as the maximal number of edges participating in any single bus during thatcomputation.Size: The size of an RN is the number of switches it consists of.Area: The grid model of VLSI chips [Th80] applies to RN's, along with the area-time tradeo�s that are derivedfor this model. The area of an RN is the area of its layout on a 2-dimensional grid (see Section 3).The speed (or, cycle length), size and e�ciency (or computation time) of an RN algorithm seem to be theimportant parameters for evaluating RN implementations. Moreover, we shall later see that the number ofswitches, or the size of the network, can be traded for computation time.6

3 Basic AlgorithmsIn this section we introduce some simple computational paradigms and tools for the recon�gurable networkmodel. These serve as an illustration for the model and its capabilities, and also provide the necessary basiccomponents for our later constructions. In particular, we look at the problems of addition and sorting.3.1 AdditionLet us next consider the problem of e�ciently adding up N numbers of N bits each. Wang et. al. give analgorithm for this operation on an N2 �N2 recon�gurable mesh [WCL91]. Here we take a di�erent approach.The operation is carried out by algorithm Add on a recon�gurable three-dimensional mesh of N3 switchessh;i;j, 1 � h; i; j � N , in which each switch is connected to its (up to 6) immediate neighbors. In fact, thetopology that is actually needed, termed the mesh of meshes, is simpler than that of the cube. It consists ofa basic mesh, consisting of the processors s1;�;�, with an additional individual recon�gurable mesh attached toeach of the columns of the basic mesh. These additional meshes are termed binary counting meshes.The algorithm Add makes use of a procedure for solving the subproblem of summing up N bits, de�nedas follows. The inputs to the bit summation operation are N bits bi 2 f0; 1g, 1 � i � N , initially stored eachat a di�erent node of an RN. The problem is to compute the sum of these inputs.An algorithm Bit Sum for performing this operation was proposed in [MPRS87, MPRS88b], and serves asan instructive example for a procedure that uses the power of switch setting rather than relying on arithmeticoperations. This algorithm is described next.Algorithm Bit Sum:Input: Each processor s1;i at the bottom row of the mesh holds an input bi.The algorithm operates by performing the following two steps.1. Each switch s1;i on the bottom row transmits its input bi on its column, and all the switches of columni read it.2. The switches of the ith column �x their local con�guration according to the input bi as follows. If bi = 0,the con�guration taken is U �D, R�L. If bi = 1, the switch connects as L�U , D�R. (See Figure 1).Now, a \signal" message is sent from the leftmost column, crossing the mesh from left to right. It starts7

at the leftmost node of one of the two bottom rows, speci�cally, at s1;1 if b1 = 0, or at s2;1 if b1 = 1. Thesignal is �nally detected by processor sj;N in the rightmost column, such that j = Pi bi.Note that the content of the signal message is immaterial, and its sole purpose is to \mark" the terminationrow j at the rightmost column. Essentially, the signal \jumps" one row upward whenever crossing a column iwith input bi = 1, thus reaching the rightmost column at the correct row.Proposition 3.1 [MPRS87, MPRS88b] Bit summation can be carried out on an N � N recon�gurablemesh in a constant number of steps.Let us remark that in fact, the procedure Bit Sum does not use the entire N � N mesh, but only a\triangle" attached to the bottom row. This observation may serve to reduce the number of switches used inthe network roughly by a factor of 2.The bit summing procedure demonstrates how the capability of dynamically setting the switches, resultingin controllable destinations for a signal, may speed up the computation of an RN. In Section 4 we generalizethis method and show how to use it in order to compute arbitrary Boolean functions in constant time, usingspecial RN constructions.The sole use of the binary counting meshes in the adition algorithmAdd is in applying algorithm Bit Sumin order to sum up bits stored at the columns of the basic mesh. We next present algorithm Add.Algorithm Add:Assumption: It is assumed that the bandwidth, i.e., the maximum bit-length of a correctly transferredmessage at a single step, is at least N . Thus, all data movements in the following algorithm may be performedin a single time unit, by transmitting a single N -bit number on a bus.Input: The N input numbers xi, 1 � i � N , reside at the leftmost column of the basic mesh, i.e., at thenodes of s1;�;1. The desired answer is X = Pi xi.The algorithm operates in phases as follows. At the �rst step, the nodes s1;i;1 of the leftmost basic columntransmit their numbers on the basic rows, so that the ith number xi is read by all the processors of the ithbasic row, s1;i;�. Next, the numbers are locally \decomposed" into bits, and the jth processor at the ith basicrow, s1;i;j, fetches the jth bit of the number xi. Now each basic column performs bit summation of the fetchedbits, using algorithm Bit Sum on the binary counting mesh attached to it. The result of each such summation8

is a logN -bit number, and overall we have N numbers c1; : : : ; cN of logN bits each, stored at the bottombasic row s1;1;�. These numbers should be interpreted as the coe�cients of powers of 2, where the exponent isthe corresponding column number, such that their sum yields X. In other words, the numbers stored at thebottom basic row satisfy X1�i�N ci2i = X:These N numbers ci are next moved in a single step to the leftmost basic column s1;�;1, and the processdescribed above is repeated to obtain N + logN numbers of log logN bits. Again, their sum as coe�cients ofthe corresponding powers of 2 is X.This process is iterated log�N times. Eventually, the leftmost basic column stores N numbers, eachconsisting of a single bit. These bits compose the desired result.Two problematic points complicate the algorithm somewhat. The �rst is that the fetched bits correspondto appropriate powers of 2. Since at iterations following the �rst, each number itself is a coe�cient of a di�erentpower of 2, this must be taken into account. Essentially, the jth node at the ith basic row, s1;i;j, fetches the(j � i)th bit at the kth round if log(k)N > j � i � 0 ;where log(k) denotes log � � � ktimes � � � logN .The second point is that after the �rst iteration we actually need N + logN columns of the basic mesh tosum all coe�cients of powers of 2. Similarly, the third iteration requires N + logN + log logN basic columns,and N + logN basic rows, etc. A second summation phase at each iteration may solve this problem. In thesecond phase, coe�cients of powers of 2 that are higher than N are summed up.The algorithm described above establishes the following:Proposition 3.2 The sum of N numbers of N bits each can be computed on an N �N �N mesh of mesheswithin O(log�N) steps. The result is stored at an array of N switches, each bit at a node.Remark 1: In Section 4.3 we prove that N numbers of N bits can be summed (by an RN of a di�erenttopology) in constant time. In that algorithm, too, each bit of the result eventually resides at a di�erent switch.The size and cycle of the network, however, might increase signi�cantly.9

Remark 2: The bandwidth assumption, given at the beginning of the algorithm, is actually super uousand is given for the sake of clarity of description. More precisely, it is possible to enhance the same algorithmusing single bit messages. This corresponds to the \bit model" (in contrast to the \word model", cf. [Lei84]).The number of switches remains O(N3), asymptotically the same as the mesh of meshes. However, the binarycounting meshes attached to the basic mesh columns are replaced by a somewhat di�erent construction, similarto those presented at Section 4. The requirement is that input bits of the di�erent input numbers are giveninitially at di�erent locations, one at a switch.3.2 SortingSorting N keys on the N � N � N recon�gurable cube can be completed in a constant number of steps.The algorithm was given independently in [WCL90, BS90b]. It actually uses the same subgraph of the three-dimensional array that was used by the addition algorithm Add of Section 3.1, namely the mesh of meshes.Intuitively, the constant time bit summation algorithm Bit Sum yields constant time ranking. This enablesus to reduce sorting to the problem of permutation routing, de�ned as follows.Permutation routing: initially, N messages (or packets) reside at N particular nodes of the network,one at each node. Let W denote the set of message-holding nodes. Each message is destined to one of themessage-holding nodes in W , and each of the nodes in W is the destination of exactly one message. In otherwords, the routing instance is de�ned by a permutation of the nodes of W .The following algorithm Perm Route, serves as a basic building block in several of the algorithms studiedlater on. In fact, this algorithm is used implicitly as a component in several other algorithms in the literature[MPRS87, MPRS88a, WCL90].Algorithm Perm Route:Input: It is assumed that the message-holding nodes are the nodes of the bottom row of the mesh, s1;�.More precisely, at the beginning of the computation each processor s1;i at the bottom row holds a packet piwith destination di, and is also the destination of exactly one packet.The algorithm operates by performing the following three steps.1. The network is recon�gured to create busses along the columns, and each of the processors s1;i of thebottom row transmits its packet pi on its column i. The packet is read by all the nodes on the column.10

However, it is stored only by the node sdi;i, i.e., the node whose row number �ts the packet's destination.2. The network is recon�gured along the rows, and each node sdi;i transmits the packet pi held by it on itsrow di, so that it is read by the corresponding node at the leftmost column, sdi ;1. Since the destinationnumbers di, 1 � i � N , are all distinct, each node of the leftmost column now stores exactly one packet.3. The leftmost column is \copied" onto the bottom row, by shipping the packet at sdi;1 to s1;di using thenatural con�guration.Lemma 3.1 Permutation routing of N packets can be performed on an N � N recon�gurable mesh in threesteps.Corollary 3.1 [WCL90, BS90b] Sorting N elements can be performed on an N � N � N recon�gurablemesh of meshes in O(1) steps.One important parameter that is ignored in the above construction is that of the number of switches,naturaly translating also into VLSI area. We now present a more area-e�cient and size-e�cient solution. Themodel considered in this section is the grid model of VLSI chip layout (see [Th80]). This framework modelsa chip as a grid with evenly spaced vertical and horizontal tracks. Nodes of the network are placed at theintersections of these tracks. The area A of a chip is de�ned to be the number of grid-points it contains. Thegrid is allowed to have a constant \thickness," or height. The basic requirement governing the embedding of anetwork in the chip is that two network edges crossing each other at some point of the grid must be embeddedat di�erent \levels" (or heights) of the chip. This may require a path of the network to occasionally \switchlevels" in the chip. Switching levels is allowed only at the grid-points.The grid model is appropriate for modeling RN's, since it captures an inherent property, namely, therequired amount of hardware, regardless of the underlying graph, wire length and switch operation. Severallower bounds on the area-time product required for various operations (including sorting) come from boundingfrom below a quantity referred to as the information crossing of the operation, which is a pure information-theoretic notion. As such, it is not dependent on the con�guration of the underlying graph.The mesh of meshes has N3 switches. Thus, its layout on a chip based on an X � Y � Z grid, where Z isconstant, is of area X�Y = (N3). A possible layout of the mesh of meshes is depicted at Figure 2. The area11

Figure 2: N3 area embedding of the mesh of meshes.is X � Y � Z = 2N3 with X = N , Y = N2 and Z = 2. The construction consists the N basic columns of thebasic mesh placed in a row, one after another, forming an N2 length array. (these are the boldface horizontallines in Fig. 2). At the one side of this N2 length array (the upper side in Fig. 2) we attach a binary countingmesh to each of the columns. At the other side we use two layers, for connecting each basic column to thebasic column on its left and to the basic column on its right.The tradeo� for sorting is AT 2 � N2 (using the word model, see e.g. [Lei84]). Thus, for constant timesorting, the chip area must be at least A = X � Y = (N2). Leighton [Lei84] showed that the bound isachievable with logN � T � pN logN , on a regular network. We will see that it is attainable also withT = O(1) and size N1+� on an RN. For the construction we recursively use Leighton's Columnsort algorithm[Lei84], terminating the recursion with a single application of Ranksort.In the following description we ignore minor details (such as rounding) for the sake of clarity; for example,when writing r = 2N2=3, it is assumed that N is chosen so that r is an integer.Let us brie y review the Columnsort algorithm. Think of the keys as items arranged in an r � s matrix,where s divides r and r � 2s2. The sorting algorithm takes eight steps. In steps 1, 3, 5 and 7, each of thecolumns is sorted in ascending order from top to bottom. In steps 2, 4, 6 and 8, the items are permuted in thefollowing ways: step 2 involves a transpose operation that permutes the matrix elements from column majororder to row major order. Step 4 is the inverse of the transpose permutation. Step 6 involves a cyclic shift ofall the matrix elements for r=2 positions along the row major order; the r=2 last items get shifted into the r=2�rst positions in the next column which was previously empty and is completed with r=2 dummy items valued+1; the �rst r=2 positions in the �rst column are �lled with dummy items valued �1. Step 8 is the inverse12

N -� 4N4=36?2N2=3meshofmeshes2N2=3-�RanksortXXXXzSSSwMove columns intomesh of meshes @@RTranspose -@@RShift -@@R

Figure 3: An N7=3-area embedding of the constant time Column/Rank sorting network.of this permutation.In order to complete sorting in constant time, we use Columnsort, choosing s = 12N1=3 and r = 2N2=3.Ranksort is used to sort the columns in steps 1, 3, 5 and 7. For the permutations in steps 2, 4, 6 and 8 thereare two possibilities: either use a mesh of N2 switches (relying on Lemma 3.1) or wire the 2 permutationsdirectly between the switches. Note that this direct wiring is possible since the permutations are oblivious tothe order of the input keys. Both options require area A = �(N2) (see [Ul84]), However, direct wiring doesnot require extra switches.Let us now estimate the complexity of the resulting sorting algorithm. Ranksort takes constant time as well(speci�cally, 7 steps), so the entire algorithm requires 32 steps. The size of the RN (assuming direct wiring ofpermutations) is O(sr3) = O(N7=3). The entire sorting network can be embedded on a chip of dimensionsX � Y � Z = N �N4=3 �O(1);schematically described by Figure 3. (Z = 7 is su�cient, as two layers are used for Ranksort, one layer formoving each column, so that altogether the columns of length 2N2=3 form a single row of length N , and twoadditional layers for each permutation of that row).We now apply the above construction with recursion depth 2, i.e., the r�smatrix is sorted using Columnsort,13

where each of the column-sorting steps 1, 3, 5 and 7 uses the above Column/Rank sorting network. The resultis a sorting RN with Size = O(N 13N 23 � 13N 23 � 23 �3) = O(N 179 ), X � Y � Z = N � N � O(1) (roughly Z = 11),T ime = 132 and Cycle = O(N).In general, using depth-k recursion, we haveTheorem 3.1 For each k � 2 there exists a sorting RN with the following properties:� T ime = 7 � 4k +Pki=1 4i = O(4k),� Size = O(N1+2�( 23 )k)� Chipsize = X � Y � Z = N �N � (3 + 4k),� Cycle = O(N).The obvious question remaining is: what is the minimal size that is required for an RN to sort in constanttime? From information theoretical considerations, the answer is (N logN). More generally,Proposition 3.3 (N logNT ) switches are necessary for an RN sorting in T time steps.Proof: Let R be an RN sorting in T time steps, and let S be its size. Let C denote the number of di�erentstates (local con�gurations) of a single switch of the RN. We assume that C is constant.Let Gi denote the (global) con�guration taken byR at the i'th step of the sorting program. The computationof R is the ordered list hG1; G2; � � � ; GT i of the T con�gurations assumed by R throughout the execution ofthe program.The RN R outputs N ! di�erent permutations of its inputs, thus the number of possible computations isat least N !, since otherwise there are two distinct input orders for which the output permutation is the same.Thus (CS)T � N !implying the claim. 14

4 Branching programs and recon�gurable networksIn this section we investigate the relationships between the computational model of branching programs andrecon�gurable networks.4.1 Branching programsLet us start by providing precise de�nitions concerning the branching programs model. For a more thoroughintroduction, the reader is referred to [We88].Branching Programs: A branching program (BP) is a directed acyclic graph (dag) consisting of a singlesource node (with in-degree 0), internal (computation) nodes with out-degree 2, each labeled by a Booleanvariable from among fX1; : : : ;Xng, with the two out-edges (or exits) labeled by 0 and 1, and sinks (without-degree 0) labeled 0 or 1.Let ~x denote the input, consisting of N Boolean values, ~x = (x1; : : : ; xn), xi 2 f0; 1g. The computationof a branching program starts at the source, and proceeds by following a path down the dag according tothe input. Speci�cally, upon reaching an internal node labeled by Xi, the computation proceeds to theexit labeled by the value xi.Let BN denote the collection of Boolean functions on N Boolean inputs, f : f0; 1gN 7! f0; 1g. A branchingprogram P computes a function f 2 BN if for every input ~x, the computation of P terminates at a sinklabeled by f(~x).Size & Length: The size of a BP is the number of internal nodes in the dag. The length of a BP is the lengthof the longest path from its source to any of its sinks.Layered BP: A BP is layered if for each node v, all paths from the source to v have the same length, referredto as the depth of v, d(v). The nodes of a layered BP are arranged in levels, with level j containing allnodes v that are at depth d(v) = j.The width of level j of a layered BP, denoted w(j), is the number of nodes it contains. The width of theentire (layered) BP is the maximum width of any of its levels.15

Oblivious BP: A layered BP is oblivious if at each (internal) level, all the nodes are labeled by the sameBoolean variable.Permutation BP: An oblivious BP is called a permutation BP (PBP) if for every j � 0, the 0-labeled edgesgoing from level j to level j + 1 form a one-to-one correspondence (or permutation) between the nodesof these levels, and similarly for the 1-labeled edges.4.2 Simulating BP's by RN'sA branching program P can be translated into a recon�gurable network R in the following manner. Constructthe main subnetwork of the recon�gurable network R with the same topology of the (undirected) graph un-derlying the dag P , i.e., think of the nodes (respectively, arcs) of P as switches (resp. edges) of R. In the�rst step, every switch is informed of the current (input) value xi of the variable Xi labeling it. The task ofdelivering the inputs to the switches is performed by an additional subnetwork, referred to as the initializingnetwork. The switch now sets its con�guration, by connecting its incoming edge(s) to the out-edge labeled xi.In the next step, let the source send a (meaningless) signal message. Clearly, the signal follows the preciseroute down the dag speci�ed by the input. The sink nodes simply listen to detect the signal, and for everyinput, precisely one sink succeeds, and subsequently announces the outcome.Several problems may appear when dealing with RN equivalents of general BP's. The �rst has to do withthe need for using \OR switches". As the switch does not \know" from which input edge the signal may arrive,all input edges need to be connected to the chosen out-edge, thus requiring the switch to perform an \OR"-likeoperation. Such a switch operation is not included in the original model (as presented in Section 2): only pairsof edges may be connected, so that only linear busses are formed.A second problem concerns the input pattern. In the �rst step of the simulation, the appropriate inputsshould be delivered to the switches. Unfortunately, general BP's require inputs to appear at many internalnodes, with no uniform pattern, whereas the RN model allows only for a single occurrence of each input, at asingle location. This may cause complicated wiring of the initializing network.Finally, the high fan-in of the program may raise a problem too. The number of inputs (in-edges) of aninternal BP node may be very high, up to the order of the total number of nodes. The corresponding RNswitch may thus have high fan-in, perhaps polynomial in N , the number of inputs. As already mentioned, our16

attention is focused in this paper on constant degree RN's.However, for permutation (layered, oblivious) BP's, the construction of the corresponding RN's is simpli�ed.The layered structure implies regularity of the construction. The number of connections between two successivelevels is bounded by twice the width of the BP. The permutation condition implies that the fan-in of a switchis 2. All nodes in a certain level read the input variable from the same bus and the relevant in-edge may bedetermined in a single step, conditioned on the input value of the previous level.Wiring the initializing network is easy; for example, it may consist of N \input busses" where the inputbits are given at the �rst step and an additional \initializing bus" for each level, which is also connected to theappropriate \input busses". Let Xi and Xj denote the input variables that label the nodes of levels l and l� 1correspondingly. It is necessary that a switch at level l is given both Xi and Xj , in order for it to be able todetermine its state: Xi determines the out-edge to be used by the node, and Xj determines the in-edge that isto be connected to the selected out-edge. Thus, a bus initializing level l is connected to both \input busses"carrying Xi and Xj.We summarize the above discussion in the following theorem:Theorem 4.1 Every permutation branching program P with M nodes, L layers and width K can be realizedas a constant time recon�gurable network R of M + 2L + O(1) switches and cycle L. If P computes someBoolean function f 2 BN , then R computes f in constant time.Figures 4 and 5 describe a PBP for computing 4-input parity and its corresponding RN. Note the ratherlarge portion of the construction devoted for the initializing network. Obviously, for any input size N , thetotal network size is linear in N . A CRCW PRAM, in contrast, requires ( logNlog logN ) time steps for parity, evenwhen equipped with any polynomial number of processors [BH87].Remark: In this context, it is instructive to imagine RN's as implemented by �xing a source of light at thesource, acting as the speaker, implementing switches and busses as \light guides" through which the light ows,and placing light detectors at the sinks, to act as readers. Guided optic implementation of RN's, motivated bythis relation, is given in Section 7.An obvious question is what kinds of computations can be performed by PBP's of \reasonable" width,length and size. This question is addressed by the following theorem.17

Starti - x1i -0��1 x2ii -0@@@@@R1i -0��1 x3ii -0@@@@@R1i -0��1 x4ii -0@@@@@R1i -0��1 SinksiParity(~x) = 0iParity(~x) = 1Figure 4: A Branching Program for the computation of Parity of four inputs.Signal Sourcei i�� ii@@@@@i�� ii@@@@@i�� ii@@@@@i�� DetectorsiParity(~x) = 0iParity(~x) = 1x1 ix2 i iix3 i��x4 ii��InitializingNetworkFigure 5: A Recon�gurable Network computing the Parity of four input bits in three initialization steps and a singlecomputation step. 18

Theorem 4.2 [Ba86] Let f 2 BN be computed by a depth-d circuit S over the basis f:;V;Wg. Then there isa PBP G of width 5 and length 22d computing f .Given the circuit S, Barrington's proof is constructive. It therefore implies, for instanceCorollary 4.1 For every problem in (non-uniform) NC1 there exists a polynomial-size constant-time RN forsolving it.Remark: Note, though, that a similar claim for NC2 problems would be considerably stronger, since thePRAM simulation from Section 5 would then give O(logN) programs for NC2, implying NC2=NC1.Cai and Lipton [CL89] showed that the length of the (width-5) branching program may be reduced toO(N1:81d). Recently, Cleve [Cl90] generalized the construction, so that the length of the branching programmay be reduced signi�cantly, while increasing its width. His result, restricted to the Boolean case, can bestated as follows.Theorem 4.3 [Cl90] For any �xed K and f 2 BN that is computable by a depth-d circuit S over the basisf:;V;W;Lg (where L denotes an Exclusive OR gate), there exists a PBP G of width 22+K � 1 and lengthO(2d(1+2=K)) computing f .This theorem yields shorter BPB's constructions. However, while improvement may shorten the cycle ofthe corresponding RN's up to the square-root of the one achieved by Barrington's construction, it complicatesthe wiring of permutations between successive levels since the width of the PBP's increases.4.3 Addition and MultiplexersIn Section 3.1 it was shown how to add N N -bit numbers in log�N steps. We reconsider this problem, asan example for a complex function that is accelerated using the results of the previous section, concerningconstant time computations of Boolean functions.Corollary 4.2 For any �xed � > 0, the sum of N N-bits numbers can be computed by a constant time RN ofcycle O(NC(1+�)), where C < 5:13. The result is stored at an array of N switches, each bit at a node.19

Proof: The N input numbers, xi; 1 � i � N are initially given at N2 di�erent switches, a single bit at eachswitch. Let rj denote the jth bit of the sum X = PNi=1 xi. Clearly rj 2 BN2 for all 1 � j � N�, whereN� = N + logN + log logN + � � �+ 1.Addition of N N -bit numbers may be carried out by a circuitD of depth C logN for some constant C. Forexample, using log 32 N applications of the \2-3 trick" (cf. [KR]), and naively taking 3 circuit layers for eachsuch application, we get C � 5:13. The circuitD has N2 inputs and N� outputs. We may (naively, again) vieweach of the outputs as a di�erent circuit Dj for the computation of a Boolean function rj . Applying Cleve'stheorem to Dj , we get that for any given � there exist a �xed K and a PBP Pj for the computation of rj,such that Pj is of width K and length N (1+�)C. Applying theorem 4.1 we get for all 1 � j � N� an RN Rjcomputing rj in constant time.The initialization of the N� RN's may be done by either a single, global initializing network, or by N2additional busses, handing the inputs to the \input busses" of all the RN's, as described in the previoussubsection.Remark: Corollary 4.2 should be contrasted with the log�N step algorithm Add of Section 3.1, using themesh of meshes of size N3. Note, however, that the cycle of the mesh of meshes is O(N), whereas the cycle ofthe RN derived in Corollary 4.2 might exceed N3 considerably.Next, let us consider the issue of having all the result bits reside at the same switch. Unfortunately, Cleve'sconstruction is not directly translatable to multi-output circuits, since we would then need as many as 2N�sinks. Extending our current approach, the question is: given m bits, one at a network switch, how fast canwe deduce the number composed by these bits? If m = N then the polynomial-size RN giving the answerto this question in t steps, where t � logN , would imply a t-step polynomial-size addition algorithm for NN -bit numbers. Beame [Be86] proved that such an operation requires (logN) on a CRCW PRAM havingany polynomial number of processors. Thus, such a network would imply an e�ciency gap of logN betweenthe RN's model and the CRCW PRAM model. In Section 5 we show that this is not the case, giving an EREWPRAM simulation of RN's with a �(logN) slowdown factor.If m = logN then the task of collecting the bits is easy. The RN topology is a rectangle of m = logNcolumns and N = 2m rows. 20

Algorithm Multiplex:Input: The input bits are assumed to reside at the switches of the bottom row.The algorithm operates by performing the following two steps.1. The ith switch of the bottom row transmits its input bit along the ith column so that all switches in thatcolumn read it.2. Consider switch sj;i, and let b1b2 � � � bm be the binary representation of j. The switch sj;i transmits asignal message on the jth row if the bit that it read at the previous step is di�erent from bi.Clearly, there is exactly one row of switches on which nothing is transmitted at the second step, and the idof this row corresponds to the required number.Proposition 4.1 It is possible to collect m input bits, given each at a switch, into a single switch in a constantnumber of steps, using an m� 2m recon�gurable rectangle.4.4 Universal RN'sSo far, we have considered only non-uniform families of permutation branching programs, so that each PBPhad to be translated individually into an RN. Let us now consider the question of constructing a universalrecon�gurable network (URN), capable of simulating an entire family of BP's.Theorem 4.4 For every (�xed) K and L, there is a (simply wired) universal recon�gurable network of cycleO(L logK), simulating in constant time all PBP's of length L and width K.Proof: The input of the URN is both the PBP description (e.g., the one given by Cleve's construction) andthe N input values to the function. Conceptually, the URN consists of L virtual levels of K switches each.The \PBP description" component of the input is given as two permutations for each pair of adjacent levels,and the input variable labeling each level (note that this labeling may di�er for di�erent functions). Togetherwith the appropriate input value to the function, the input may be viewed as L� 1 consecutive permutations.The URN consists of L� 1 permutation networks (PN's), such as the Benes or the butter y multi-layerednetworks (cf. [Ul84]). Such networks have K inputs and K outputs, and are built of 2dlogKe+ 1 levels, each21

consisting of K=2 switches. Two levels are connected only if they are adjacent. These PN networks are capableof implementing any permutation from their inputs to their outputs. A PN switch has two inputs and twooutputs connected to it in either of the two one-to-one states. Thus a PN switch structurally conforms withthe switch of the RN model.The initializing network consists of L busses on which the input permutations are given. Each of thesebusses is read by all the switches of a single PN. We save for each PN switch its state (a single bit) for anyrequired permutation. This state is selected by the switch according to the input. Once all the permutationsare �xed, the computation terminates in a single step: the signal sent by the source is detected by the sinks(namely, the nodes at the last level of the URN).Combining the above theorem with Thm. 4.3, we thus getCorollary 4.3 For every �xed � > 0 and C > 0 there exists a uniform RN of cycle O(NC(1+�)) computing inconstant time all BN functions that are computable by circuits of depth C logN .5 PRAM algorithms and recon�gurable networksThe previous sections have demonstrated the power of RN's by showing that many computational tasks canbe performed in constant time using relatively simple RN topologies. In this section we consider the questionjust how powerful are polynomial size RN's compared to the common PRAM model.Theorem 5.1 A T -step computation of an N-switch RN with E edges can be simulated by a O(E)-processorEREW PRAM in time O(T logN).Proof: We �rst note that any RN program R can be simulated by a directed program RD, in which each edgeis split into two directed edges (in opposite directions), and every switch has degree that is twice his original(undirected) degree. The con�gurations are de�ned as the natural extension of the original ones (for instance,a U � L con�guration at a switch is simulated by connecting its incoming U edge to its outgoing L edge,and vice versa). The rule for the speakers is modi�ed accordingly. This simulation does not change the timecomplexity of the program.A PRAM algorithm for simulating the directed RN program RD can be constructed as follows. The PRAMgets as its input both the adjacency matrix of the RN and the input to the RN. Each step of the RN is simulated22

by the PRAM in three phases as described below.(1) The �rst phase incorporates only N of the PRAM processors, each simulating a single RN switch. Thisphase is dedicated for the internal computation taken by the RN switches, in which the bus splitting,speaking, reading and the (virtual) local con�guration are decided.Once a switch decides on a certain local con�guration, its edges are grouped into singletons and connectedpairs. Thus the global con�guration of the RN can be represented by an augmented graph ~R, in whicheach original RN switch whose degree in the underlying network topology is d, is represented by at most2d nodes. The total number of nodes in this augmented graph ~R is thus at most 4E, and each of thesenodes has degree at most two. The connected components of ~R are composed of directed paths only.(2) Each of the nodes of ~R is represented by an EREW processor. In the second phase, the local con�gurationis read by all processors. If d is the highest degree of any RN switch, then this phase requires log(d� 1)steps.(3) The third phase involves moving the message from the processors representing speakers to the rest of theprocessors in a connected path. This is done in O(logE) steps. The processors of the EREW PRAM,standing for nodes of the global con�guration graph ~R, construct a balanced spanning tree for each pathusing a variant of the \recursive doubling" technique of [SV82].(4) Speakers of a bus use the tree constructed at phase (3) to broadcast messages (and detect errors).Remark: If the RN model is further enhanced, so that any connected component of the network maybecome a bus (but busses are edge-disjoint), then a spanning tree algorithm (e.g. the one from [SV82]) maybe used in order to �nd the connected components and their spanning trees in O(logN) steps, using anN + 2E-processor CRCW PRAM.Corollary 5.1 Problems of input size N that are computed by a T (N)-step, polynomial-size RN, have O(T (N) logN)-step EREW PRAM programs. In particular, problems having O(logK N)-steps, polynomial-size RN's withuniformly generated underlying topologies are in (uniform) EREW (K+1).23

�?�� M(N) memory cells -id = 1id = P (N)?P (N)proc's6 switchesHHHHY ��vv

Figure 6: CRCW PRAM simulation by a recon�gurable rectangle.In other words, the corollary implies that problems that are \inherently sequential", i.e., that are \nonparallelizable" using traditional parallel models, maintain this property under the recon�gurable network model.In Section 6 we show that this is no longer true when the model is enhanced with di�erent kinds of switches.As already mentioned in the introduction, many problems requiring ( logNlog logN ) steps on a CRCW PRAM (or(logN) steps on an EREW PRAM) with polynomial number of processors, can be computed by a constant-time polynomial-size RN. The following theorem (independently given by [WC90]) shows that this is not thecase for the opposite direction.Theorem 5.2 A (priority) CRCW PRAM with P (N) processors, M(N) memory cells and T (N) time can besimulated by a O(T (N))-step, P (N) �M(N)-size RN.Proof: An RN for simulating a given PRAM algorithm can be constructed as follows. The RN topology isshown in Figure 6. It is based on a P (N) �M(N) rectangle. The M(N) memory cells are simulated by thebottom row of the rectangle, and the P (N) processors are simulated by its leftmost column.The algorithm uses the constant time routing algorithm Perm Route of Section 3 and the broadcastpower of the busses. Let us consider the Read and Write steps of the PRAM separately. For a Read step, eachswitch s(p) representing a PRAM processor p transmits the address of the element m it requests on its row.24

The \intersecting switch" (i.e., the switch on p's row and m's column) forwards the request on its column tothe bottom row switch s(m) representing m, while disconnecting its up-link. At the next step s(m) transmitsin return the desired memory element on its column, where the \intersecting switch" may read it and forwardit on its row, where it is read by s(p).For a Write step, the element m to be written is transmitted by the switch s(p) simulating the writingprocessor p on its row. The intersecting switch transmits the element down its column, while disconnectingits up-link. This disconnection of up-links by the column switches results in priority write-con ict resolution,with the highest priority assigned to the lowest-id processor.The problem of simulating PRAM's by recon�gurable networks was studied also in [MPRS87, MPRS88a].The simulations proposed therein are di�erent in nature from ours, in that they attempt to bound the size ofthe simulating RN as a function of the number of PRAM processors alone, at the cost of a logN slowdown. Incontrast, our simulation yields constant slowdown, but requires an RN whose size depends also on the numberof memory cells used by the PRAM.6 Polynomial-time problems and non-monotonic recon�gurablenetworksIn this section we consider the interesting possibility of enhancing the recon�gurable network model, by allowingnew types of devices as possible switches of an RN. Suppose we strengthen the power of our switches byallowing some new types of operations other than simply \connect" or \disconnect". We shall show that thisenhancement may result in a considerable increase in the power of the model, assuming P 6= NC.As in Section 4, we view the busses as \channels" in which signals ow. Signals are interpreted by switchesas carrying binary information, i.e., a signal can carry either \0" or \1". A possible implementation can bebased on indicating a \1" by a signal, and a \0" by the lack of a signal. We now consider the following switchoperations. Each of these operations assumes some orientation of the bus. That is, it is assumed that in eachstep, each switch knows in advance from which direction the message will arrive.\NOT" switch: The \NOT" switch is used within a local con�guration that has one input edge and oneoutput edge, and its action is to complement a signal passing through it. I.e., the switch operates onthe signal so that an incoming \0" signal proceeds as a \1" and vice versa. We call the RN model with25

\NOT" switches non-monotonic RN.\OR" switch: The \OR" switch has two input edges and one output edge. A signal proceeds (as a \1")through the output edge if it arrives from either input.\Optimal Splitter": The \Optimal Splitter" has one input edge and two output edges. It duplicates anysignal entering it via the input edge into two identical signals, which proceed via the two exits.\NOT-splitter": The \NOT-splitter" is the combination of a \NOT" switch followed by an \optimal split-ter," so that the complemented signal is duplicated.\OR-splitter": The \OR-splitter" is a combination of an \OR" switch followed by an \optimal splitter."Using the \NOT-splitter" and \OR-splitter" switches substantially strengthens the computational powerof the RN model. In particular, we have the following simple simulation.Theorem 6.1 Any circuit of depth D can be simulated by a non-monotonic RN in constant time and cyclelength D.Proof: The simulation is achieved simply by constructing an RN whose underlying topology correspondsto that of the simulated circuit. In particular, if the RN model allows polynomial cycle length then everyproblem in PTIME can be simulated in this way in constant time, although the simulation is not \uniform"or universal, and each circuit needs to be simulated individually.This theorem has immediate implications on the relationship between non monotonic RN and PTIME(speci�cally, the two classes can be shown equivalent under some appropriate assumptions). This line ofinvestigation is pursued further in [BPS91].As an illustrative example to the power of non monotonic RN, let us consider the well studied problemof computing the Lexicographically-First Maximal Independent Set (Lex-MIS) in a given graph. This problemis often used as a canonical example for a P -complete problem just as is the Circuit Value Problem (CVP).However, while the above theorem implies that CVP can be solved in constant time by a non monotonic RN,that solution has a non uniform avor. In contrast, in the example below we demonstrate how the Lex-MISproblem can be solved in constant time by a simple mesh-topology non monotonic RN. Interestingly, we notethat for inputs of size m = N2 (namely, N -vertex graphs), our RN has a cycle of pm.26

The Lexicographically-First Maximal Independent Set (Lex-MIS)Given a graph G = (V;E) on N vertices, V = f1; � � � ; Ng, a set of vertices U � V is independent if no twovertices in it are connected by an edge. A maximal independent set (or MIS for short) is an independent setU such that every vertex in V �U is connected by an edge to some vertex in U (so the addition of any vertexto U will result in a dependent set). Representing a set of vertices U = fv1; : : : ; vjU jg, v1 < : : : < vjU j, by theword v1 : : : vjU j, induces a lexicographical ordering on these sets. The Lex-MIS of the graph is the maximalindependent set that is the lexicographically minimal among these sets. The Lex-MIS problem is to �nd thisset.The Lex-MIS problem has an interesting property which makes it seem inherently sequential: a node voccurs in the Lex-MIS if and only if its lower-id neighbors are not in it. Indeed, Cook has shown that theLex-MIS problem is P -complete [Co85]. We shall now show that using the non-monotonic RN model (with\NOT-splitter" and \OR-splitter" switches), Lex-MIS is computable in constant time on the recon�gurablemesh.Algorithm Lex MIS:The algorithm is composed of a single step, in which the mesh is recon�gured and then a single signal ispassed. The switch si;j of the mesh is connected according to the following cases.i < j: the connection is �xed to \+" (L-R, U-D), with no need for switching.i = j: the connection is �xed to a \NOT-splitter" with the input incoming from the Left (the lower numberedcolumns meeting row i), and the two outputs directed Right and Up, along the directions of row i andcolumn j.i > j: the switch is recon�gured according to the input graph, G. The con�guration depends on whether thenodes i and j are connected in G, i.e., whether (i; j) 2 E.(i; j) 2 E: the node switches to an \OR-splitter", so that if a signal arrives from either of the Down orLeft directions (or both), then it proceeds to the Right and Up directions.(i; j) 62 E: the connection state is a \+" (L-R, U-D).After the input is made available to the mesh nodes and they switch to the appropriate state, two signalsare transmitted by the switch s1;1, one rightward on the bottom row, and one upward on the leftmost column.27

��w -��w -��w -� �wwww - si;Nr r rj166 j266 jl66 6\OR-splitters"HHHYXXXXXXXy �� \NOT-Splitter"si;i @@R signaldetectorFigure 7: The ith row of the recon�gurable mesh computing Lex-MIS of the graph G. The nodes j1; � � � ; jl are the neighborsof node i in G having lower id's.Finally, for 1 � i � N � 1, node i of the input graph is declared to belong to the Lex-MIS if and only if asignal is detected at switch si;N of the recon�gurable mesh, coming from the Left. Node N is in the Lex-MISif and only if no signal is detected at sN;N coming from either the Down link or the Left link.Lemma 6.1 Algorithm Lex MIS is correct.Proof: In order to prove the claim, let us consider some row i. Its connections (see Figure 7) are a seriesof \OR"s with columns j1; � � � ; jl, which represent the neighbors of node i that have a lower id. These areexactly the neighbors on which the selection of i to the Lex-MIS is conditioned. To the right of these \OR"connections, there is a \NOT-splitter" at switch si;i, on the intersection of the ith row with the ith column.Thus, a signal reaches node si;i of the mesh from the Left if and only if it is coming from at least one of thecolumns j1; � � � ; jl. This switch si;i produces two signals, one to be sent rightward and reach si;N and another tocontinue down the ith column. However, this happens if and only if no signal arrives from any of the columnsj1; � � � ; jl at their intersection with row i. Denote as selected every node i for which a signal was detected bynode si;N of the mesh. By induction, a node is selected if and only if all its neighbors having lower id's wherenot selected, which guarantees that the set selected is the Lex-MIS. Correctness for node N of the graph followsby the same argument.7 Optical implementation of recon�gurable networksIn this section we discuss the possibilities and potential for an all-optical implementation of recon�gurablenetworks. There are two parts to the optical implementation. The �rst step is to do all the transmission28

and switching optically; i.e., the path from the source node to the destination node is all optical, but all theprocessing is done electronically. The next step is to do simple processing functions optically with the ideaof eventually building an optical computer. For example, by doing simple logic functions optically, one canenhance the capabilities of RN's as discussed in Section 6. We discuss the technology available today for bothsteps.The components of interest are optical switches and splitters, optical waveguides, �ber, or other inter-connection material, optical ampli�ers to compensate for losses, optical logic devices, and light sources anddetectors.7.1 Optical Switching and TransmissionFor the switching and transmission part, the key components are fast optical switches with low latency alongwith optical waveguides/�bers for transmission and light sources and detectors. For example, a mesh networkwould consist of a two-dimensional array of switches connected together either by optical waveguides, �bers,or free-space. Assuming the busses to be unidirectional, each column requires two busses, as does each row. Inorder to implement the 10 states described in Figure 1, one requires a 4 � 4 switch at each node. This switchcan either be built directly or as a crossbar of four 2 � 2 switches. The setting of these switches collectivelywill determine the interconnection pattern.Our requirements on the switches are that the switching time and the propagation delay through the switchesshould be small, typically, subnanoseconds. Titanium-di�used lithium niobate directional couplers [Al88] arethe most attractive option. The coupling ratio of a 2 � 2 coupler can be varied by applying a voltage thatchanges the refractive index of the waveguide material. These devices have been proposed as two-state switchesbut their coupling ratios can be controlled to any desired value by setting the control voltage appropriatelyand hence they can also act as splitters if required (if only splitters are desired, cheaper technologies such as�ber splitters exist, although the latency is a problem with any �ber device). The inherent switching speed issub-picoseconds, and in practice, the switching speeds are limited by the control circuitry. The latency throughthese switches is negligible because they are only a few mm long [Ko88]. The propagation loss is commonlybelow 0.2 dB/cm and the main source of loss is the interface to a �ber (if present), which is commonly 1 dB.One can potentially implement the whole mesh on optical waveguides.A potential implementation would use optical ampli�ers between switches, as proposed in [LK90]. Two29

types of ampli�ers are commercially available today: semiconductor-laser ampli�ers and doped-�ber ampli�ers.Both are bidirectional devices. The latter device is usually made in �ber and is a few meters long; hence it maynot be suited for this application because of the large latency. It may also be possible to dope the waveguideconnecting two switches so that it acts as an ampli�er itself.As an example, consider a 1000 by 1000 mesh where the longest path goes through 1000p2 switches.Assuming millimeter long switches and a centimeter of waveguide material between two switches, the worst-case latency through the network will be of the order of a 100 nanoseconds. Even more signi�cant is thecumulative loss when a message is to be broadcast to many other nodes through an array of couplers. Eachcoupler is set to tap o� a small portion of the energy from the source and send the rest of the energy onwards tothe following taps. Let us consider the case where one node is broadcasting a message to N other nodes througha series of couplers. It can be shown that if � is the excess loss per coupler, then if the coupling coe�cientsare set optimally, the ratio of received power to transmit power is �N(1 � �)=(1 � �N) which becomes 1=Nwhen � = 1, i.e., the couplers are ideal with no excess loss. For practical values of �, the number of couplersthat can be cascaded is only a few 10s but this number can be increased by two orders of magnitude on usingampli�ers to compensate for the loss [LK90]. The placement of the ampli�ers in the bus is addressed in [LK90].The optimal policy in order to maximize the signal-to-noise ratio at the receiver is to place low gain ampli�ersevery few couplers, as opposed to a high-gain stage after a lot of couplers. Thus, in the extreme case, onecan use an ampli�er with a few dB of gain after every couplers. This will be feasible if the ampli�er can beintegrated along with the waveguide connecting the two couplers.The limitations in using these switches come from the loss when many are cascaded, the speed of thecontrolling electronics, and the current high cost of the technology.The light sources used would be laser diodes or light-emitting diodes (LEDs) at 0.8, 1.3 or 1.5 microns.Cheap LEDs will su�ce if the optical power requirements are small and the modulation speeds are below afew hundred Mb/s.7.2 Optical Logic DevicesNext we discuss optical logic devices brie y. The two types of devices proposed are the self-electrooptic e�ectdevice (SEED) and the nonlinear Fabry-Perot etalon (NLFP) [Hi88]. The SEED device consists of a p-i-indiode with a multiple quantum well region connected in series with a resistor. Two optical inputs are required30

for the SEED, a bias beam, and a signal beam. When only the bias beam is present, the device is transparent,giving a '1' state at the output. but a su�ciently large signal beam coming in causes the device to absorb energyresulting in a '0' state at the output. Thus the functionality is that of a NOR gate. Multiport SEEDs have alsobeen demonstrated with better tolerances on the signal inputs than for a simple SEED device. The NLFP isusually a bulk device requiring mirrors to form a resonant cavity with a non-linear medium in between, whoserefractive index can be controlled by an input beam. Depending on the input signal, the bias beam is eithertransmitted to an output port or re ected towards another output port. Thus, it can be con�gured either as aNOR, NAND, AND, OR, or XOR gate. The problems with both these devices are the strict requirements onthe intensity of the bias and signal beams and the limited fan-in and fan-out capabilities.

31

References[A86] A. Aggarwal, Optimal Bounds for Finding Maximum on Array of Processors with k Global Busses,IEEE Trans. on Computers, c-35, (1986), 62{64.[Al88] R.C. Alferness, Waveguide electrooptic switch arrays, IEEE J. Sel. Areas. Commun. 6, (1988), 1117{1130.[SV82] Y. Shiloach and U. Vishkin, An O(logN) Parallel Connectivity Algorithm, J. of Algorithms 3, (1982),57{67.[Ba86] D. Barrington, Bounded-Width Polynomial-Size Branching Programs Recognize Exactly Those Lan-guages in NC1, Proc. 18th ACM Symp. on Theory of Computing, 1986, pp. 1{5.[Bo84] S.H. Bokhari, Finding Maximum on an Array Processor with a Global Bus, IEEE Trans. on Computers,c-33, (1984), 133{139.[Be86] P. Beame, Limits on the Power of Concurrent-Write Parallel Machines, Proc. 18th ACM Symp. onTheory of Computing, 1986, pp. 169-176.[BH87] P. Beame and J. Hastad, Optimal Bounds for Decision Problems on the CRCW PRAM, Proc. 19thACM Symp. on Theory of Computing, 1987, pp. 83{93.[BS90a] A. Ben-Asher and A. Schuster, Optical Splitting Graphs, The 1990 Int. Topical Meeting on OpticalComputing, Kobe, Japan, April l990.[BS90b] A. Ben-Asher and A. Schuster, Algorithms and Optical Implementation for Recon�gurable Networks,Proc. 5th Jerusalem Conf. on Information Technology, Jerusalem, 1990, pp. 225{235..[BS90c] A. Ben-Asher and A. Schuster, Recon�gurable Paths and Bus Usage, Hebrew University TechnicalReport #90-14, January 1990.[BPS91] A. Ben-Asher, D. Peleg and A. Schuster, The Complexity of Recon�guring Models, In preparation,1991.[BP91] A. Bar-Noy and D. Peleg, Square Meshes are not Always Optimal, IEEE Trans. on Computers 40,(1991), 196{204. 32

[Co85] S.A. Cook, A Taxonomy of Problems with Fast Parallel Algorithms, Inform. and Control 64, (1985),2{22.[Cl90] R. Cleve, Toward Optimal Simulations of Formulas by Bounded-Width Programs, Proc. 22nd ACMSymp. on Theory of Computing, May 1990, pp. 271{277.[CL89] J. Cai and R.J. Lipton, Subquadratic Simulations of Circuits by Branching Programs, Proc. 30th IEEESymp. on Foundation of Computer Sci., 1989, pp. 568{573.[Hi88] H. S. Hinton, Architectural considerations for photonic switching networks, IEEE J. Sel. Areas. Com-mun. 6, (1988), 1209{1226.[GK89] J.P. Gray and T.A. Kean, Con�gurable Hardware: A New Paradigm for Computation, Proc. 10thMIT VLSI Conference 1989, pp. 279{295.[JS79] H.F. Jordan and P.L. Sawyer, A Multimicroprocessor System for Finite Element Structural Analysis,Comput. Struc., 10, (1979), 21{29.[KR] R.M. Karp and V. Ramachandran, A Survey of Parallel Algorithms for Shared-Memory Machines, inHandbook of Theoretical Computer Science, North-Holland, 1990.[Ko88] S.K. Korotky and R.C. Alferness, Waveguide electrooptic devices for optical �ber communication,Chapter 11 in Optical Fiber Communications II, ed. S.E. Miller and I.P. Kaminow, Academic Press,1988.[Lei84] F.T. Leighton, Tight Bounds on the Complexity of Parallel Sorting, Proc. 16th Symp. on Theory ofComputing, 1984, pp.71{80.[LK90] K. Liu and R. Ramaswami, Analysis of optical bus networks using doped �ber ampli�ers, Proc.IEEE/LEOS Topical Meeting on Optical Multiaccess Networks, July 1990.[LM89a] H. Li and M. Maresca, Polymorphic-torus architecture for computer vision, IEEE Trans. PatternAnal. Machine Intell., Vol. 11, No. 3, pp. 233{243, March 1989.[LM89b] H. Li and M. Maresca, Polymorphic-torus network, IEEE Trans. Comput., Vol. 38, No. 9, pp. 1345{1351, September 1989. 33

[ML89] M. Maresca and H. Li, Connection Autonomy in SIMD computers: A VLSI implementation, J. Paralleland Distributed Comput., Vol. 7, No., 2, pp. 302{320, April 1989.[MR79] J.M. Moshell and J. Rothstein, Bus Automata and Immediate Languages, Information and Control40, (1979), 88{121.[MKS89] O. Menzilcioglu, H. T. Kung and S. W. Song, Comprehensive Evaluation of a Two-dimensionalCon�gurable Array, Proc. 19th Symp. on Fault-Tolerant Computing, Chicago, Illinois, June 1989, pp.93-100.[MPRS87] R. Miller, V.K. Prasanna-Kumar, D.I. Reisis and Q.F. Stout, Parallel Computations on Recon�g-urable Meshes, University of Southern California Technical Report IRIS#229, March 1987.[MPRS88a] R. Miller, V.K. Prasanna-Kumar, D.I. Reisis and Q.F. Stout, Data Movement Operations andApplications on Recon�gurable VLSI Arrays, proc. 1988 Intl. Conf. on Parallel Processing, Vol. I,205{208.[MPRS88b] R. Miller, V.K. Prasanna-Kumar, D.I. Reisis and Q.F. Stout, Image Computations on Recon�g-urable VLSI Arrays, proc. 1988 Intl. Conf. on Vision and Pattern Recognition, 925{930.[Na87] T. Nakatani, Interconnections by Superposed Parallel Busses, Ph.D. dissertation, Princeton, 1987.[PR87] V.K. Prasanna-Kumar and C.S. Raghavendra, Array Processor with Multiple Broadcasting, J. ofParallel and Distributed Computing 4, (1987), pp. 173{190.[R86] C.S. Raghavendra, Hmesh: a VLSI Architecture for Parallel Processing, CONPAR 86, LNCS 237,Springer-Verlag, pp. 76{83, 1986.[RK87] D. Reisis and V.K. Prasanna-Kumar, VLSI Arrays with Recon�gurable Busses, Proc. 1st Intl. Conf.on SuperComputing, 1987, pp. 732{742.[Ro76] J. Rothstein, On the Ultimate Limitations of Parallel Processing, in proc. 1976 Int. Conf. on ParallelProcessing (best paper award), pp. 206{212.[RD79] J. Rothstein and A. Davis, Parallel Recognition of Parabolic and Conic Patterns by Bus Automata, inproc. 1979 Int. Conf. Parallel Processing, pp. 288{297.34

[Sn82] L. Snyder, Introduction to the con�gurable highly parallel computer, Computer 15, (1982), 47{56.[S83] Q. F. Stout, Mesh-Connected Computers with Broadcasting, IEEE Trans. on Computers, c-32, (1983),826{830.[S86] Q. F. Stout, Meshes with Multiple Busses, proc. 27th IEEE Symp. on Foundations of ComputerScience, 1986, pp. 264{273.[TCS89] X. Thibault, D. Comte and P. Siron, A Recon�gurable Optical Interconnection Network for HighlyParallel Architecture, Proc. 2nd Symp. on the Frontiers of Massively Parallel Computation, 1989.[Th80] C.D. Thompson, A Complexity Theory for VLSI, Ph.D. thesis, 1980.[Th83] C.D. Thompson, The VLSI Complexity of Sorting, IEEE Trans. Comput. 32, (1983), 1171{1184.[Ul84] J.D. Ullman, Computational Aspects of VLSI, Computer Science Press, 1984.[WC90] B.F. Wang and G.H. Chen, Two dimensional Processor Array With a Recon�gurable Bus System isat Least as Powerful as CRCWModel, Information Processing Letters 36, No. 1, pp. 31{36, Oct. 1990.[WC90] B.F. Wang and G.H. Chen, Constant Time Algorithms for the Transitive Closure and Some RelatedGraph Problems on Processor Arrays with Recon�gurable Bus Systems, IEEE Transactions on Paralleland Distributed Systems, Vol. 1, No. 4, October 1990.[WCL90] B.F.Wang, G.H. Chen and F. Lin, Constant Time Sorting on a Processor Array with a Recon�gurableBus-System, Information Processing Letters 34, (1990), pp. 187{192.[WCL91] B.F. Wang, G.H. Chen and H. Li, Con�gurational Computation: A New Computation Methodon Processor Arrays with Recon�gurable Bus Systems, proc. 1st Workshop on Parallel Processing,Tsing-Huan Univ., Taiwan, Dec. 1990.[We88] I. Wegener, The Complexity of Boolean functions, John Wiley, 1988.[X86] Xilinx Inc., The programmable Gate Array Design Handbook, San Jose, Calif., 1986.35

the power of reconfiguration

Documents