journal of selected topics in quantum electronics 1 … · photonic networks for high-performance...

JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 1

Energy-Efficient and Bandwidth-ReconfigurablePhotonic Networks for High-Performance

Computing (HPC) SystemsAvinash Kodi,Member, IEEE,and Ahmed Louri,Senior Member, IEEE

(Invited Paper)

Abstract—Optical interconnects are becoming ubiquitous forshort-range communication within boards and racks due tohigher communication bandwidth at lower power dissipationwhen compared to metallic interconnects. Efficient multiplexingtechniques (wavelengths, time an space) allow bandwidths toscale, static or predetermined resource allocation of wavelengthscan be detrimental to network performance for non-uniform(adversial) workloads. Dynamic bandwidth re-allocation basedon actual traffic pattern can lead to improved network perfor-mance by utilizing idle resources. While dynamic bandwidthre-allocation (DBR) techniques can alleviate interconnection bot-tlenecks, power consumption also increases considerably withincrease in bit-rate and channels. In this paper, we proposeto improve the performance of optical interconnects using DBRtechniques and simultaneously optimize thepower consumptionusing Dynamic Power Management (DPM) techniques. DBRre-allocates idle channels to busy channels (wavelengths)forimproving throughput and DPM regulates the bit rates and sup-ply voltages for the individual channels. A reconfigurable opto-electronic architecture and a performance adaptive algorithmfor implementing DBR and DPM are proposed in this paper.Our proposed reconfiguration algorithm achieves a significantreduction in power consumption and considerable improvementin throughput with a marginal increase in latency for syntheticand real (Splash-2) traffic traces.

Index Terms—Performance Modeling, Power-Aware, Recon-figurable Optical Interconnects, High-Performance Computing(HPC).

I. I NTRODUCTION

T HE increasing bandwidth demands at higher bit ratesand longer communication distances in high-performance

computing (HPC) systems are constraining the performanceof electrical interconnects at all levels of communication;on-chip, chip-to-chip, board-to-board and rack-to-rack levels[1], [2], [3], [4], [5], [6], [7], [8], [9]. This has givenrise to opto-electronic networks that can support greaterbandwidth through a combination of efficient multiplexingtechniques (wavelength-division, time-division, and space-division). Opto-electronic interconnects provide maximum

Manuscript received March 1, 2010. This work was supported in part bythe National Science Foundation grants CCR-0538945, ECCS-0725765 andCCF-0953398.

Avinash Karanth Kodi is with the School of Electrical Engineeringand Computer Science, Ohio University, Athens, OH 45701, USA e-mail:[email protected].

Ahmed Louri is with the Department of Electrical and ComputerEngineering, University of Arizona, Tucson, AZ 85721, USA e-mail:[email protected]

flexibility for HPC systems by augmenting electronic process-ing functionalities with high bandwidth optical communicationcapabilities, thereby optimizing cost to performance ratio.

In an optically interconnected network, it is often that thewavelengths or the channels are statically allocated to nodes orboards using different wavelengths, fibers and time-slots [6],[10], [11], [12]. Static allocation of wavelengths in optical in-terconnects offers every node with equal opportunity for inter-processor communication. While static allocation improvesperformance for uniform or benign traffic patterns, the networkcongests for non-uniform or adversial traffic patterns due touneven resource utilization. Based on the enormous bandwidthdemands (in excess of Terabytes per second) of future HPCsystems, optical interconnects will need to be much more flex-ible to adapt to various application communication patterns.Therefore, dynamic re-allocation of bandwidth based on actualtraffic utilization can improve performance by utilizing idleresources in the network. Prior work on dynamic reconfigu-ration have used active electro-optic switching element [5],time-slots based bandwidth re-allocation [13] and both timeand space based bandwidth switching [14]. More recent workhave used genetic algorithm to dynamically determine possiblewavelengths from any source to destination by reusing thesame wavelengths [15].

While opto-electronic networks can improve performancewith higher bit rates and dynamic re-allocation of bandwidth,power consumption is still a critical problem for HPC systems.As interconnection network consumes a sizeable fraction ofthe system power budget, researchers have proposed severalpower-aware techniques to optimize power consumption forHPC systems. Dynamic power reduction techniques such asDVFS (Dynamic Voltage and Frequency Scaling) [16], [17],[18], [19] and DLS (Dynamic Link Shutdown) [20] havebeen suggested for electrical networks. In DVFS, voltage andfrequency of the electrical link are dynamically adjusted todifferent power levels according to traffic intensities to mini-mize power consumption. DLS, on the other hand turns downthe link if it is not used and turns up the link when needed. In[17], power-aware opto-electronic network design space isex-plored by regulating power consumption in response to actualnetwork traffic. However, they have designed efficient powerregulation control policies without bandwidth re-allocation. In[19], an opto-electronic network with voltage scaling from1.2V to 0.6V with almost linearly scalable bandwidth from8 Gbps to 4 Gbps has been proposed.


The motivation for designing dynamically reconfigurable,power-aware opto-electronic network for HPC systems is twofold. First, as bandwidth demands increase, networks that candynamically re-allocate bandwidth by adapting to shifts innet-work traffic can gain significant improvement in performance.Second, as spatial and temporal locality exists due to inter-process communication patterns, opto-electronic power-awarenetworks can optimize their power consumption and therebyimprove performance by scaling bit rates and supply voltage.While scaling the bit rates allows opto-electronic networksto reduce their power consumption, this can adversely affectperformance by increasing latency. Similarly, dynamically re-allocating bandwidth can improve the network performance,but at the same time consume more power. Taken together, thiswork evaluates the power-performance trade-off by balancingpower consumption with improving network performance.

In this paper we propose a dynamically reconfigurable opti-cal interconnect called E-RAPID that not only dynamically re-allocates bandwidth, but also reduces the power consumptionwhile delivering high-bandwidth, and high connectivity. Dy-namic Power Management (DPM) technique such as DVFS isapplied in conjunction with Dynamic Bandwidth Re-allocation(DBR) technique based on prior network utilization for variouscommunication patterns. We propose a dynamic reconfigura-tion algorithm called Lock-Step (LS) technique that adaptsto changes in communication patterns. LS is a history-baseddistributed reconfiguration algorithm that triggers reconfigura-tion phases, disseminates state information, re-allocates systembandwidth, regulates power consumption and re-synchronizesthe system periodically with minimal control overhead. LS hasseveral advantages including: (1) Decentralized power scalingsuch that every board independently makes power controldecisions. (2) Re-allocation of bandwidth happens betweenany system boards without affecting the on-going communi-cation in the overall system, and (3) Maximum bandwidthcan be provided for system boards for hot-spot/bursty trafficpattern, where extremely high load is placed for a shortduration of time. Our simulation results on synthetic workloadindicates performance improvements of 30% to 400% withpower savings in excess of 50%.

II. OPTICAL RECONFIGURABLE ARCHITECTURE:E-RAPID

A E-RAPID network is defined by a 3-tuple:(C,B,D) whereC is the total number of clusters, B is the total number ofboards per cluster and D is the total number of nodes perboard. Figure 1 shows an E-RAPID system with C = 1, B= 4 and D = 4. All nodes are connected to the scalableelectrical Intra-Board Interconnect (IBI). The IBI connectsthe nodes for local (intra-board communication) as well asto the Scalable Remote Optical Super-Highway (SRS) forremote (inter-board communication). All interconnects ontheboard are implemented using electrical interconnects, where asthe interconnections from the board to SRS are implementedusing optical fibers using multiplexers and demultiplexers.The WDM and SDM features are exploited by the SRS formaximizing the inter-board connectivity as explained next.

Board 2

Board 1

Board 0

��0 (0)��1

(0)��2 (0)��3

(0)

��2 (2)

��3 (3)

��1 (1)

��0 (0) ��1

(1) ��2 (2) ��3

(3)++ +( )

Scalable Optical Remote Super-Highway

(SRS)

Board 3

Node 0 Node 1 Node 3 Node 2

Inter-Board Optical Interface

Intra-Board Interconnect

No

de

4N

od

e 5

No

de

7N

od

e 6

In

ter-

Bo

ard

Op

tica

l In

terf

ace

In

tra-

Bo

ard

In

terc

on

nec

t

Node 8 Node 9 Node 11 Node 10

Inter-Board Optical Interface

Intra-Board Interconnect

No

de

12N

od

e 13

No

de

15N

od

e 14

In

ter-

Bo

ard

Op

tica

l In

terf

ace

In

tra-

Bo

ard

In

terc

on

nec

t

Electronic Interconnect

Opto-Electronic Interconnect

$ I/O

NetworkInterface

Memory

Node 0

PE

Fig. 1. Routing and wavelength assignment in E-RAPID for inter-boardcommunication.

A. Inter-board and Intra-board Communication

The static routing and wavelength allocation (RWA) forinter-board communication for a R(1,4,4) system is shownin Figure 1. For inter-board communication, different wave-lengths from various boards are selectively merged to separatechannels to provide high connectivity. Inter-board wavelengthsare indicated byλ(s)

i , wherei is the wavelength ands is thesource board number from which the wavelength originates.The wavelength assigned for a given source boards anddestination boardd is given byλ(s)

B−(d−s) if d > s andλ(s)(s−d)

if s > d, where B is the total number of boards in the system.For example, if any node on board1 needs to communicatewith any node in board0, the wavelength used isλ(1)

1 andfor reverse communication, the wavelength used isλ

(0)3 . The

multiplexed signal received at the board is demultiplexed suchthat every optical receiver detects a wavelength.

Figure 2(a) shows the intra-board interconnections for board0. The network interface at every node is composed of sendand receive ports. These send and receive ports at each nodeare connected to the optical transmitter and receiver portsthrough the bidirectional switch. Each packet, consistingofseveral fixed-size units called flits, that arrives on the physicalinput buffers progress through various stages in the routerbefore it is delivered to the appropriate output port. Theprogression of the packet can be split intoper-packetandper-flit steps. The per-packet steps include route computation(RC), virtual-channel allocation (VA) and per-flit steps includeswitch allocation (SA) and switch traversal (ST)[21]. A linkcontroller (LC) is associated with each optical transmitter andreceiver and a Reconfiguration Controller (RC) is associatedwith each system board. The co-ordination between RCs andLCs are essential for implementing the reconfiguration algo-rithm. One significant distinction should be made in E-RAPID:Flits from different nodes are interleaved in the electricaldomain using virtual channels whereas packets from differentboards are interleaved in the optical domain. Although flittransmission in the optical domain is feasible, flit management


BidirectionalCrossbar

Switch

Routing &Switching

Board 0

��0(0)

��3(0)

Demux(AWG)

��0(0) ��1

(1) ��2(2) ��3

(3)

Input Buffer

Output Buffer

Input Buffer

Output Buffer

ReconfigurationController

(RC0)Output Buffer

LinkController

PD

��0

PD

��3

Inter-BoardOpto-Electronic

Interconnect

Intra-BoardElectronic

Interconnect

Intra-NodeInterconnect

NetworkSend 0

Node 0

PE 0

Mem

NetworkRecv 0

$

Node 3

PE 3

Mem

$

Intra-Board Electrical Links

Inter-Board Optical Links

Intra-Board Control Links

OpticalInterconnect

(SRS)

Output Buffer

LinkController

Input Buffer

Link Controller

Input Buffer

LinkController

Inter-Board Control Links

Transmitter T0

Transmitter T3

Receiver R0

Receiver R3

VCSELArray

��0To Board 0

To Board 1

VCSELArray

��3

NetworkSend 0

NetworkRecv 0

(a)

Coupler 0

��0

��0

��0

��0

��1

��1

��1

��1

��2

��2

��2

��2

��3

��3

��3

��3

Board 0

To Board 0

To Board 3

To Board 2

To Board 1

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

d

Coupler 1

Coupler 2

Coupler 3

LC0

LC1

LC2

LC3

Transmitter T0

Transmitter T1

Transmitter T2

Transmitter T3

OutputBuffer

OutputBuffer

OutputBuffer

OutputBuffer

��3d

��2c

��1b

��0a

��0b

��1 c

��2d

��3a

��0c

��1d

��2a

��3b

��0d

��1a

��2b

��3c

(b)

Fig. 2. (a) The proposed on-board interconnect for the E-RAPID architecture with reconfiguration controller (RC) and link controllers (LC). (b) The proposedtechnology for reconfiguration using passive couplers and array of lasers per transmitter port.

across multiple domains is extremely complicated.

B. Technology for Dynamic Bandwidth Re-Allocation (DBR)

From Figure 2(a), each optical transmitter is composed ofan array of similar wavelength lasers. The enabling technologyfor reconfigurability in E-RAPID is shown in Figure 2(b). Eachoptical transmitter is associated with 4 output ports (a, b,c andd) as there are 4 boards in the system. The notationλ

(y)x is

used here to indicate wavelengthx originating from porty fora given transmitter. The statically assigned wavelength asperthe communication requirements from section 2.1 are enclosedin a bracket.

The ability to dynamically switch multiple wavelengthsthrough different ports of a given transmitter simultaneously todifferent system boards using passive couplers forms the basisfor system reconfigurability in E-RAPID. This provides theflexibility in E-RAPID where more than one wavelength canbe used for board-to-board communication in case of increasedtraffic loads. The basis of reconfiguration is to combine, at agiven coupler, different wavelengths from similar numberedports, but from different transmitters. Referring to Figure 2(b),the multiplexed signal appearing at coupler1 is composedof all the signals inserted by same numberedb ports (λ(b)

0 ,λ(b)1 , λ

(b)2 and λ

(b)3 ), but from different transmitters. Now,

when needed, different destination boards can be reached bymore than one static wavelength, thereby enabling the dynamicreconfigurability of the proposed architecture. For example,assume that the traffic intensity from board 0 to 2 is high. Thestatic wavelength assigned for communication to board 0 to 2is λ

(c)2 at coupler2. The other wavelengthsλ(c)

0 , λ(c)1 andλ(c)

3

appearing at the same coupler 2, could be used if other boards(board 1, 2 or 3) release their statically allocated wavelengths(with which they can communicate with board 2) to board 0.If board 1 releases wavelengthλ1 to board 0, then board 0can start using portc at transmitter1 (λ(c)

1 ) in addition to portc at transmitter2 (λ(c)

2 ), thereby doubling the bandwidth and

reducing communication latency. The physical link over whichboth the wavelengthsλ(c)

1 , andλ(c)2 propagate are the same,

where as the different channel is formed between transmitters1 and 2 at board0 with different receivers on board 2. Thisallows contending traffic, not only to use multiple wavelengths,but also to spread the traffic on the transmitter board, therebyincreasing the throughput of the network.

C. Dynamic Power Management (DPM) of Optical Intercon-nects

An optical link in E-RAPID architecture consists of thetransmitter, the receiver and the channel. Considering a pas-sive channel, the total power consumption of an optical linkdepends on the transmitter and the receiver power. Trans-mitter power is consumed at the laser and laser driver,where as the receiver power is consumed at the photode-tector, transimpedance amplifier (TIA) and clock and datarecovery (CDR) circuitry [16], [22]. While both Multiple-Quantum Wells (MQW) [22] with external modulators andVCSELs (vertical-cavity surface emitting lasers)[22], [23] canbe considered as light sources, we assume a VCSEL (vertical-cavity surface emitting laser) as the laser source, whicheliminates the need for the external modulator. Moreover,there are commercial vendors who provide one-dimensionalmultiple-wavelength VCSEL arrays which can be used forreconfiguration in E-RAPID [24]. In the next subsection, weevaluate the power dissipated in an opto-electronic link anddevice parameters that can be controlled to regulate the powerconsumption.

1) Power Calculations:The total power consumed by anentire opto-electronic link is given by:

PT = (PDriver+PV CSEL)TX+(PPhotodiode+PTIA+PCDR)RX

(1)The superbuffer in the VCSEL driver is a set of cascaded

inverters, and the size of each inverter is larger than the


previous one by a constant factorδ. The total power dissipatedin the driver stages is calculated as

PDriver = γCLV2ddBR (2)

whereγ is the switching factor,CL is the total load capac-itance of the superbuffers (ofn inverters),Vdd is the supplyvoltage andBR is the bit rate. The total capacitance is thesum of input and output capacitance of all the inverters, andis given as [22]

CL = CLoad − Cin +Σn−1k=0 (Cin + Cout)δ

k (3)

whereCLoad is the load capacitance of the inverter chain,Cin

andCout are the input and output capacitances of the minimumsized inverters. We adopt the VCSEL with a CMOS driverfrom [22], where the driver circuitry consists of two NMOStransistors providing the threshold and modulation currentsand a superbuffer driving the gate that delivers the modulationcurrent. The VCSEL power consumed is given as

PV CSEL = ITotal.Vsource = (Ith+Imγ)(Vth+ImRs+Vdd−Vtn)(4)

The total current is the sum of threshold (Ith) and modulationcurrents times the switching factor. The total voltage is thesum of the VCSEL threshold voltage (Vth), the voltage dropacross the series resistance (Rs) and the minimum source-drain voltage (Vdd - Vtn) to ensure the gate that delivers themodulation current is in saturation.

At the receiver, we determine the power consumed bythe photodetector and the TIA. This is modeled similar to[25], which consists of the photodetector as a current source(Id + αβIm) and a common source amplifier connected bya feedback resistance,Rf . Id is the dark current,α is theVCSEL efficiency in A/W andβ is the detector efficiency inW/A. The input capacitance to the amplifierCin = CD +Cg,whereCD is the diode capacitance andCg = CoxWL is thegate capacitance. The VCSEL needs to generate enough lightwhich depends onIm such that the receiver will produce anoutput signal of amplitude△V0, which can then be amplifiedby further receiver stages. This can be approximated as [25]

△V0 =γImβαRf

(5)

Therefore, the power consumption of the VCSEL is definedby the needs of the receiver for a givenBR andVdd. The totalpower dissipated in the TIA based receiver circuit is then givenas

PTIA = IbVdd + I2dVdd + γ(αβIm)2Rf (6)

where Ib is the bias current of the internal amplifier and isgiven byIb = ω3dbintVeC0 whereω3dbint is the 3db bandwidthof the internal amplifier,Ve is the early voltage, andC0 isthe output capacitance. The gain-bandwidth product of theinternal amplifier isGBW = A(ω)ω3dbint = gm/C0, wherew = 2πBR andgm is the transconductance. The relationshipbetween the internal amplifier bandwidth and the maximumbit rate is given asω = 0.35ω3dbint. The bandwidth of TIAis assumed to be half the bandwidth of the internal amplifier,

therefore, the 3dB bandwidth of TIA is approximated as

ω3dbtia =A(ω)

RfCin=

w

0.7(7)

Then the total power dissipated at the receiver can be obtainedas

PTIA =0.7A(ω)I2d2πCinBR

+(2πVeC0Vdd

0.35+2πγ△V 2

0 Cin

0.7A(ω))BR (8)

Then the desiredIm at the transmitter can be obtained bysolving (5), (7) and (8). The power dissipated at the clock anddata recovery is given as [16]

PCDR = γCCDRV2ddBR (9)

whereCCDR is the capacitance of the clock and data recoveryunit.

2) Dynamic Power:At the transmitter, VCSEL is gener-ally biased at threshold currentIth, and the dynamic powerconsumed by VCSEL grows with the modulation currentIm.Im is controlled by the receiver’s minimum voltage swingrequired as given by equation (5). For the VCSEL driver,the dynamic power is consumed by charging/discharging thecapacitor chain and scales withBR andV 2

dd. At the receiver,TIA consumes maximum power and it depends onVdd andBR as given by equation (8). The CDR can be frequency andvoltage scaled as bit rate varies as (V 2

dd andBR) from equation(9).

When the bit rate scales down, the supply voltage is also re-duced of all the above components, resulting in power savings.Scaling the power level focuses on reducing the delay incurredduring the slow voltage transitions as compared to frequencytransitions [16], [17]. As the link can be operational duringthe slow voltage transitions, increasing the link speed involvesincreasing the voltage before scaling the frequency. Similarly,the frequency is decreased before scaling the voltage. Thedelay penalty is limited to frequency transitions as this requiresthe CDR (implemented as phase-locked loop) to relock the bit-rate and re-synchronize the clock with the incoming data.

III. D YNAMIC RECONFIGURATION

A. Power-Performance Trade-Offs

To provide more insight into power and performance trade-offs, consider Figure 3 which shows various combinationof power regulation and bandwidth re-allocation techniques.These techniques include four cases namely, Non-PowerAware Non-Bandwidth Re-allocation (NP-NB), Power-AwareNon-Bandwidth Re-allocation (P-NB), Non-Power Aware,Bandwidth Re-allocation (NP-B) and Power Aware BandwidthRe-allocation (P-B).

Let us consider three power levels, namely, low-powerPL,mid-powerPM and high-powerPH as shown on the left y-axisof Figure 3, and three link utilization levels, low-utilizationUL, mid-utilization UM , and high-utilizationUH as shownon the right y-axis of Figure 3. Link utilization measures theamount of time the link is in use. Moreover, assume that theselink utilization levels are measured at every reconfigurationwindow,Rw. Reconfiguration statistics will be gathered from


PL

PM

PH

��1

UL

UM

UH

PL

PM

PH

��1

UL

UM

UH

PL

PM

PH

��1

UL

UM

UH

PL

PM

PH

��1

��2

��1

UL

UM

UH

Rw

��2

(a) Non-Power Aware, Non-Bandwidth Re-Allocation (NP-NB)

(b) Power Aware, Non-Bandwidth Re-Allocation (P-NB)

(c) Non-Power Aware, Bandwidth Re-Allocation (NP-B)

(d) Power Aware, Bandwidth Re-Allocation (P-B)

Bit-Rate scaling at the reconfiguration

window, Rw

Bandwidth scaling at the reconfiguration

window, Rw

Utilization

Power

Time

Rw Rw Rw

Rw Rw Rw Rw Rw Rw Rw Rw

Fig. 3. Design space of power-aware, and bandwidth reconfigurability.

the past reconfiguration window to predict the future linkutilization, and the corresponding power and bandwidth levels.

Figure 3(a) shows the NP-NB technique. In this case,irrespective of the link utilization, the power consumptionremains constant and the network cannot react to fluctuationsin traffic patterns. Figure 3(b) shows P-NB technique, wherethe link utilization is measured at everyRw. P-NB techniqueallows link power to scale with link utilization. This techniqueis shown to reduce power consumption, but is not able torespond to increases in bandwidth demands. Figure 3(c) showsthe NP-B, where the link utilization is measured at everyRw.NP-B technique allows bandwidth re-allocation to adapt to linkutilization. This technique is also shown to improve perfor-mance, but is unable to regulate power. Figure 3(d) shows P-Bwhere both power is regulated and bandwidth is re-allocatedupon changes in link utilization. This technique is shown tonot only reduce power, but also improve performance.

Power consumption and latency (1/BR) are inversely re-lated, i.e. there is a minimum power at which the latencyincreases asymptotically to infinite and a minimum latencyat which the power consumption increases asymptotically toinfinite. However, between these extremes exist several designpoints at which either power or latency can be optimized.Dynamic power management (DPM) allows power scalingby controlling the bit rates and supply voltages. Dynamicbandwidth re-allocation (DBR) technique allows multiple linksto be operational for a given communication. Taken together,this provides a two-dimensional design space optimizationproblem. In this work, we re-allocate channels for overloadedlinks by DBR and regulate power consumption by DPM forall the links in the network.

B. Dynamic Reconfiguration Technique

LS technique re-allocates link bandwidth, scales the bitrates and supply voltages based on historical information.In LS, each reconfiguration phase works in several circular

stages, each stage is implemented either as a request or aresponse stage between reconfiguration controller (RC) andlink controller (LC). Each RC triggers the reconfigurationphase, communicates with the local LCs and other RCs todetermine the network load based on state information (linkand buffer utilizations) collected during the previous phase.LS protocol works in the background and does not affect theon-going communication, thereby minimizing the impact ofreconfiguration latency on the overall network latency.

Reconfiguration Statistics:Historical statistics are collectedwith the hardware counters located at each LC. Each LC isassociated with an optical transmitter to measure link statistics,and to turn on/off the laser. The link utilization Linkutil tracksthe percentage of router clock cycles when a packet is beingtransmitted in the optical domain from the transmitter queue.The buffer utilization Bufferutil determines the percentage ofbuffers being utilized before the packet is transmitted. Atlow-medium network loads, linkutil provides accurate informationregarding whether a link is being used at all, where asBufferutil provides accurate information regarding networkcongestion at medium-high network load. All these statisticsare measured over a sampling time window calledReconfigu-ration windowor phase,Rw. This sampling window impactsperformance, as reconfiguring finely incurs latency penaltyand reconfiguring coarsely may not adapt in time for trafficfluctuations. We utilize network simulations to determine theoptimumRw.

EachRCi, i = 0, 1, ...B − 1 is connected to all theLCj ,j = 0, 1, ...D − 1 on the board. In addition, eachRCi isalso connected to(RCi+1)moduloB in a simple electrical ringtopology separated from the optical SRS. A ring topology withunidirectional flow of control ensures that what information issent in one direction is always received in another. Figure 4shows the 2 communication stages, RC-LC and RC-RC ofthe reconfiguration implementation. Each LC associated witha transmitter has a link utilization counter, a buffer utilization


TxLC0

TxLC1

TxLC2

TxLC3

Reconfiguration Controller

RC0

Tx LC0

Tx LC1

Tx LC2

Tx LC3

TxLC0 TxLC1 TxLC2 TxLC3


RC1



RC2


RC3

Board 0

Board 1

Board 2

Board 3

(a) RC-LC communication

Link Request/Response

TxLC0

TxLC1

TxLC2

TxLC3

Tx LC0

Tx LC1

Tx LC2

Tx LC3




RC3

Board 0

Board 1

Board 2

Board 3BR3


RC0

BR0


RC1 BR1


RC2

BR2

(b) RC-RC communication

Fig. 4. Reconfiguration algorithm implementation.

counter and a on/off binary value for every wavelengthλ0,λ1, λ2 ... on a given system board.

The symmetry of E-RAPID with respect to the numberof wavelengths provides the insight into reconfiguration al-gorithm. For example, ifΛ = λ0, λ1, λ2 ... λB−1 is thetotal number of wavelengths associated with the system, wecan see that this is exactly the same number of wavelengthstransmitted/received from every system board. In other words,the number ofoutgoingor incomingwavelengths per systemboard is the same. Therefore, in order to balance the loadand re-allocate wavelengths on a given link, the system boardneeds all link statistics on itsincoming links. This is achievedby the co-ordination between the LCs and RCs as explainedin the LS algorithm.LS Algorithm: In order to implement LS, RCs evaluate thestate information and re-allocate the bandwidth for the currentRw based on previousRw. After RCs have decided whichlinks to re-allocate, this information is disseminated back tothe RCs on other boards. RCs then determine the powerlevel for each link and convey re-allocation and power levelinformation to the LCs. The pseudo code of the LS algorithmis shown in Table 1. AfterRw, in Step 2,RCi sends outLinkRequest packets toLCi as shown in Figure 4(a). Whenthis packet is received byRCi, it updates all theoutgoinglinkstatistics. In Step 3, eachRCi sendsBoardRequest packet toobtain all the link statistics for itsincoming links as shownin Figure 4(b). As it sends out, due to the symmetry of thering architecture, it receives BoardRequest from otherRCi.For example, when board1 receivesBR0 from say board0, it will update the field for wavelength with which board1 communicates with board0, i.e. λ1 using the data storedin its outgoing link statistic. When the boardRCi receivesits own BoardRequest packet, it updates all the incoming linkstatistics.

In step 4, DBR is implemented. Now, eachRCi computes ifreconfiguration is necessary based on buffer congestion,Bcon

and minimum link utilizationLmin. While profiling of traffictraces can provide more accurate information regarding whenthe network is actually congested, setting theBcon to 0.5 is

fairly reasonable for most traffic scenarios. This implies that onan average 50% of our buffers are occupied by packets for thegiven reconfiguration windowRw. We setLmin to 0.0 whichindicates no packets are being transmitted on the link. Eachincoming link statistic is classified into three categoriesasunder-utilized if Linkutil is less thanLmin (implying that thiswavelength can be re-allocated), normal utilized if Bufferutil

less thanBcon and Linkutil is greater thanLmin (implying thewavelength is well utilized) and over-utilized if Bufferutil isgreater thanBcon (implying that additional wavelengths areneeded). RC would allocate the under-utilized links to theover-utilized links. In addition, we check whether an alreadyre-allocated link is required by the original board (Step 4(b)).In such situations, we check the Bufferutil is greater than0.0 which will indicate that some packet is waiting to betransmitted. In such scenarios, we reverse the allocation andprovide the board with the originally assigned wavelength.

In Step 5 and from Figure 4(b), eachRCi now sends outBoardResponse to all the remaining boardRCs to update theiroutgoing link statistics. As in board request stage,RCi updatesthe information received from otherRCs for the transmitterswith which RCi communicates with those boards into itsoutgoing link statistics.

In Step 6 DPM is implemented. The power level for thenext Rw is computed based on two buffer thresholds,Bmin

andBmax. While other researchers have used link utilizationsto regulate power levels, link utilization does not allow foraggressive power regulation. In our power regulation tech-nique, we aggressively push the link to be fully utilized andthen evaluate based on buffer thresholds. If the Bufferutil fallsbelow Bmin, the power level of the link is scaled down tothe next lower power level,Pn−1. If the Bufferutil exceedsBmax, the link power is scaled up to the next power level,Pn+1. If the Bufferutil falls betweenBmin and Bmax, thelink retains the same power level,Pn. While multiple bit ratescan conserve more power by finely tuning the bit rates to thebuffer utilization, it increases the delay penalty by re-clockingthe CDR circuitry every time the bit rate is scaled. Similarly,if Rw is too small, the bit rates will be tuned too often, again


TABLE ILOCK-STEPALGORITHM FOR DBR AND DPM IMPLEMENTATION

Step 1: Wait for Reconfiguration window,Rw

Step 2: EachRCi sends theLinkRequest control packet to all itsoutgoing LCi

Step 2a:EachLCi computes the Linkutil and Bufferutil for the previousRw and updates the field in theLinkRequest packet and forwards to the nextLCi+1 and finally toRCi

Step 3: EachRCi sends theBoardRequest control packet to allRCj , i 6= jStep 3a:RCi updates the Linkutil and Bufferutil for the link (wavelength) with which it communicates withRCj when it receives theBoardRequest packet fromRCj

Step 4:RCi receives itsBoardRequest packet containing utilization information for all itsincoming linksStep 4a:RCi classifies everyB − 1 incoming links for DBR as

If Linkutil ≤ Lmin => Under-UtilizedIf Linkutil ≥ Lmin & Bufferutil < Bcon => Normal-UtilizedIf Bufferutil > Bcon => Over-UtilizedRe-allocates Under-Utilized links to Over-Utilized links

Step 4b: For every link, if Bufferutil > 0.0 & Link is re-allocated as Under-Utilized, cancel the re-allocation

Step 5: EachRCi sends theBoardResponse control packet with updated link information toRCj , i 6= jStep 5a:RCi updates the wavelength re-allocation for the link with which it communicates withRCj when it receives theBoardResponse packet fromRCj

Step 6: EachRCi performs DPM and classifies each link asIf Bmin ≥ Bufferutil => Decrease Power Level (Pn−1)If Bmin ≤ Bufferutil ≤ Bmax => Maintain Power Level (Pn)If Bufferutil > Bmax => Increase Power Level (Pn+1)

Step 7: EachRCi sends theLinkResponse control packet to all itsoutgoing LCi with updates linkre-allocation information and new power level informationStep 7a: In response to DBR, eachLCi, turns off/on the lasers for wavelength re-allocationStep 7b: In response to DPM, eachLCi, sends newPowerLevel packets if the new power level is different fromprevious power level

Step 8: Go to step 1

incurring excess delay penalty. IfRw is too large, the bitrates cannot scale to accommodate large fluctuations. We usenetwork simulation to determine an optimum value ofRw.By using 6 power levels in our system architecture, we avoidmultiple bit rate transitions.

In Step 7 and from Figure 4(a), each boardRCi sendsout LinkResponse packets using the data received from itsoutgoing link statistics to each of theLCi. EachLCi updatesthe state information received, thereby either turning on/off thelasers and re-clocking to the new power level. As there is one-to-one mapping between the transmitter and the receiver, thetransmitterLCi injects a bit rate control packet on the linkand stops transmission for the duration while the frequencyand voltage transitions occur. When this bit rate control packetis received, the optical receiver then re-clocks to the new bitrate.

IV. PERFORMANCEEVALUATION

A. Simulation Network Parameters

The performance of E-RAPID is evaluated using YACSIMand NETSIM discrete-event simulator [26] and is comparedto various non-power/power aware, non-bandwidth/bandwidthreconfigured network configurations. The performance of E-RAPID when compared to other competing electrical networkscan be found elsewhere [10], [27]. We use cycle accurate simu-lations to evaluate the performance of E-RAPID. Packets wereinjected according to Bernoulli process based on the networkload for a given simulation run. The network load is varied

from 0.1− 0.9 of the network capacity. The network capacitywas determined from the expressionNc (packets/node/cycle),which is defined as the maximum sustainable throughputwhen a network is loaded with uniform random traffic[21].The simulator was warmed up under load without takingmeasurements until steady state was reached. Then a sample ofinjected packets were labelled during a measurement interval.The simulation was allowed to run until all the labelled packetsreached their destinations.

For the on-board router model designed for E-RAPIDarchitecture, we considered the channel width to be 32 bits andthe router speed to be 400 Mhz, resulting in a unidirectionalbandwidth of 12.8 Gbps and per-port bidirectional bandwidthof 25.6 Gbps. It takes a single router cycle for routing, virtualchannel allocation and switch allocation. For most of the runs,we maintained a constant packet size of 128 Bytes, resultingin a 8 flit packet size.

Network workloads that accurately reflect the high tem-poral and spatial traffic variance of many parallel numericalalgorithms usually employed by scientific applications aremost useful for evaluating the performance of HPC sys-tems. The power-performance of E-RAPID utilizing varioustechniques such as NP-NB, NP-B, P-NB and P-B wereevaluated for several communication patterns including uni-form, butterfly (an−1,an−2, ...,a1, a0 communicates witha0,an−2,...,a1,an−1), complement (an−1,an−2, ...,a1,a0 com-municates with nodean−1, an−2, ..., a1, a0), and perfect shuf-fle (an−1,an−2, ...,a1,a0 communicates with with nodean−2,


an−3,...,a0,an−1) for network size of 64 nodes. Herean−1,an−2, ...,a1, a0 indicate the binary addresses of every node.For example, with n = 6, node 5 (000101) will have the binarypattern asa5 = 0, a4 = 0, a3 = 0, a2 = 1, a1 = 0 anda0 = 1.While networks of varying sizes were modelled, due to spaceconstraints, we describe the performance (throughput, latencyand power) for a 64 node network. The 64-node E-RAPIDconfiguration evaluated consists of 8 boards with 8 nodes perboard (C = 1, B = 8 and D = 8).Optical Network Modelling: In the calculations for a oxidebased VCSEL [22], we considered a 50% duty cycle,γ = 0.5,threshold currentIth = 0.1 mA, series resistanceRs = 250ohm, threshold voltage,Vth = 2V, efficiencyβ = 0.3, andVtn

= 0.38 V. For the driver we considered,Cload = 50pF, inputand output capacitance of minimum sized inverters,Cin =Cout = 2 pF. For the receiver [25], we considered a minimumvoltage swing△V0 = 100mV, detector efficiencyα = 0.4 A/W,amplifier gain A = 10, L = 0.25µ, µn = 1300cm2/V − sec,Ve = 20 V, CD = 0.05 pF andC0 = 0.05 pF,Id = 100 nA,and CDR capacitanceCCDR = 9.26 pF.

From the above parameters and solving equations fromsection 2.3.1, we estimated the various power dissipated inthe link at the transmitter and the receiver. The link poweris dominated by the receiver power consisting of the TIA andCDR where as the VCSEL and driver dissipate minimal power.The receiver power can be further reduced by consideringother low impedance resistive circuits instead of the TIA[25]. The total power dissipated at 10 Gbps is approximately535 mW. With the bit rate scaling from 10 Gbps to 5 Gbpsand the supply voltage scaling from 1.8 V to 0.9 V, thepower dissipation for a 5 Gbps link reduces to almost 108mW, an 80% reduction in power savings. For the opticalnetwork, we considered 6 bit rates corresponding of 5, 6,7, 8, 9 and 10 Gbps andVdd scaling from 0.9 to 1.8 Vgiving us 6 different power levels{108.8mW, 163.7mW,232.5mW, 316.0mW, 417.0mW, 535.0mW}. The CDR delaywas estimated from [17], which was normalized to our networkclock cycle. In [17], the link was disabled for 12 networkclock cycles (for frequency scaling) after the bit rate transitionsto give CDR to re-lock to the input data. In our networksimulation, after the control bit rate packet is transmitted, thetransmitter conservatively disables the link for 65 cycles.

B. Results and Discussion

Reconfiguration Window Rw: In order to determine theoptimum reconfiguration window sizeRw, we performedsimulation by varying the window size from 500 simulationcycles to 4000 cycles. We evaluated the latency and normalizedpower dissipation for complement traffic pattern. Normalizedpower dissipation is calculated by averaging the various linksoperating at different bit rates and normalizing it to themaximum bit rate. The latency and power dissipation areevaluated for low (0.2) and medium (0.9) network loads inFigures 5(a) and 5(b) respectively. At low load of 0.2, thelatency increases marginally withRw, where as at high loadof 0.5, the latency increases almost 8x for 4000 cycles ascompared to 500 cycles. At high loads, most packets that

need reconfiguring are already saturating the links, thereforeincreasingRw worsens the situation. However at low loads,the number of packets saturating the network is less andtherefore the impact on latency is much lesser. For low loadof 0.2, the power dissipation increases with increasingRw.This is because as the network warms up, the demand forreconfiguring at low loads may not exist, however at higherRw, the need to reconfigure grows faster. At high loads,the network is already experiencing saturating, thereforeforall values ofRw, reconfiguration is necessary indicating analmost equal power dissipation. Therefore, to balance the twoconstraints, we chooseRw of 1000 simulation cycles.Buffer Thresholds, Bmin, Bmax: In order to determine theoptimum values of buffer thresholds, we consider two condi-tions as shown in Figure 6(a) and Figure 6(b). In Figure 6(a),we setBmin = 0.1 andBmax = 0.3. As the network warms up,the bit rate is reduced and power savings is obtained. However,byRw = 4, the bit rate starts increasing as the buffer utilizationis greater thanBmax. When the link operates at the peak rateof 10 Gbps, the buffer utilization starts falling, as maximumbandwidth is provided to transmit packets. Once it falls belowBmin, the bit rates again scales down. This continues untilthe buffer utilization falls betweenBmin and Bmax. If thedifference betweenBmin = 0.1 andBmax = 0.2, is lower asshown in Figure 7(b), the bit rate will fluctuate more often andat many instances operate at peak bit rates. Therefore to havea larger range of stable operating points, we chooseBmin =0.1 andBmax = 0.3. Increasing the difference betweenBmin

andBmax, will prevent fluctuations, but the latency penaltywill also increase.Throughput, Latency, Power: Figures 7, 8 and 9 show thethroughput, latency and overall power consumption for 64nodes for uniform, complement, perfect shuffle and butterflytraffic patterns. All traffic patterns selected are adversial trafficpatterns except uniform. From Figure 7, for uniform traffic,NP-NB (non-power aware non-bandwidth reconfigured) andNP-B (non-power aware bandwidth reconfigured) shows iden-tical performance. Both P-NB (power aware non-bandwidthreconfigured) and P-B (power aware bandwidth reconfigured)show a 4% decrease in throughput. This is mainly due tothe power awareness algorithm that attempts to regulate thepower consumption which affects the throughput. For uniformtraffic pattern, all nodes are equally probable to communicatewith every other node. This balances the load on all links,thereby having no under-utilized links to reconfigure. Theworst case traffic pattern for E-RAPID is complement traffic,where all nodes on a given source board communicate witha destination board. For a 64 node network, nodes 0, 1, 2... 7 on board 0 communicates with node 63, 62, 61, ... 56on board 7. Therefore, the network is saturated even for lowload for E-RAPID architecture. As seen, NP-NB and P-NB,the network is saturated at very low loads. The throughputremains the same for both NP-NB and P-NB. With reconfig-uration, all the remaining links can be provided to the systemboard, i.e. NP-B and P-B provide improved performance interms of throughput. We achieve almost 500% improvementin throughput by completely reconfiguring the network. Forperfect shuffle and butterfly patterns, the improvement in


0

1

2

3

4

5

500 1000 1500 2000 2500 4000

Reconfiguration Window Rw (Simulation Cycles)

La

ten

cy (

mic

rose

c)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

No

rma

lize

d P

ow

er

Co

nsu

mp

tio

n

Latency (0.2)

Power (0.2)

(a)

0

2

4

6

8

10

500 1000 1500 2000 2500 4000

Reconfiguration Window Rw (Simulation Cycles)

La

ten

cy (

mic

rose

c)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

No

rma

lize

d P

ow

er

Co

nsu

mp

tio

n

Latency (0.5)

Power (0.5)

(b)

Fig. 5. Reconfiguration window sizing for a network load of (a) 0.2 and (b) 0.5.

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647

Reconfiguration Window

Bu

ffe

r U

tili

za

tio

n

0

2

4

6

8

10

12

Bit

Ra

tes

Buffer Util

BR

(a)

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647

Reconfiguration Window

Bu

ffe

r U

tili

za

tio

n

0

2

4

6

8

10

12

Bit

Ra

tes

Buffer Util

BR

(b)

Fig. 6. (a) Buffer utilization and bit rate comparisons for Complement traffic pattern with (a)Bmin = 0.1 andBmax = 0.3 and (b)Bmin = 0.1 andBmax

= 0.2.

throughput is 37% and 33%. In these communication patterns,all nodes do not communicate with other boards. As thesepatterns have some component of local communication withinthe board, the percentage of improvement is reduced.

From Figure 8, for uniform traffic, the network saturatesat 0.4 for NP-NB and NP-B where as the saturation pointis slightly shifted for P-NB and P-B to 0.37 due to powerawareness being implemented. There is no excess reconfig-uration penalty for P-B and NP-B. This implies that LSindependently evaluates if reconfiguration is necessary. If itcannot reconfigure the network, it does not hinder the on-going communication. For complement traffic, P-B and NP-Btechniques show superior saturation values of 0.5 as opposedto P-NB and NP-NB, where the network is saturated forextremely low load of 0.1. The same is true for butterfly whichsaturates at 0.5 and perfect shuffle which saturates at 0.2. Thelatency is marginally more for P-B technique because of powerregulation being implemented.

From Figure 9, the normalized power dissipation for all thetraffic patterns are shown. For NP-NB and NP-B techniques,the power dissipated is at the maximum as all links areoperational at the peak data transmission rate of 10 Gbps.For uniform traffic, by applying power awareness, we can

reduce power consumption by almost 40% using either P-NB or P-B techniques. For complement traffic pattern, thepower dissipated improves from 50% for low network loadsto 20% at high loads. At high loads, bandwidth re-allocationcauses more links to be active, thereby consumes more power.The sam is true for both perfect shuffle and butterfly patternswhere the power dissipated in P-B technique is more than P-NB technique due to bandwidth re-allocation. In E-RAPIDarchitecture, power regulation and bandwidth re-allocationallows the network, not only to improve performance by re-allocating idle links, but also to save power by bit rate andvoltage scaling. NP-B allows only the bandwidth to be re-allocated, and P-NB allows only power to be scaled. Thisnew P-B allows both, power as well as bandwidth to bereconfigured leading to improved network performance.

Degree of Reconfiguration:Figure 10 shows the degree ofreconfiguration for complement and butterfly traffic in termsofthroughput for varying network loads of 0.1, 0.5 and 0.9. De-gree of reconfiguration indicates the number of links providedfor re-allocating. From Figure 10(a), at low loads of 0.1, thethroughput is insensitive to the amount of available bandwidth.At medium (0.5) and high loads (0.9), the network is sensitiveto the amount of bandwidth availability. For a network load


Uniform Traffic - Throughput

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

Offered Traffic (as a fraction of network capacity)

Th

rou

gh

pu

t (G

Bp

s)

PB

NPB

NPNB

PNB

Uniform Traffic - Latency

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1


La

ten

cy

(m

icro

se

c)

PB

NPB

NPNB

PNB

0 7

Uniform Traffic - Power Dissipation

0.6

0.7

r p

ack

et

0 4

0.5

ate

d p

er

)

0.3

0.4

r D

issip

a(m

Wa

tt)

NPNB

PNB

0.1

0.2

ge

Po

we

r (

PB

0

0.1

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9

Av

era

g

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


Complement Traffic - Throughput

0

10

20

30

40

50

60

70

0 0.2 0.4 0.6 0.8 1


Th

rou

gh

pu

t (G

Bp

s)

PB

NPB

NPNB

PNB

Complement Traffic - Latency

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1


La

ten

cy

(m

icro

se

c)

PB

NPB

NPNB

PNB

0 7

Complement Traffic - Power Dissipation

0.6

0.7

r p

ack

et

0 4

0.5

ate

d p

er

)

0.3

0.4

r D

issip

a(m

Wa

tt)

NPNB

PNB

0.1

0.2

ge

Po

we

r (

PB

0

0.1

1 2 3 4 5 6 7 8 9

Av

era

g

1 2 3 4 5 6 7 8 9

Offered Traffic (as a fraction of network load)

Butterfly Traffic - Throughput

0

10

20

30

40

50

60

70

0 0.2 0.4 0.6 0.8 1


Th

rou

gh

pu

t (G

Bp

s)

PB

NPB

NPNB

PNB

Butterfly Traffic - Latency

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1


La

ten

cy

(m

icro

se

c)

PB

NPB

NPNB

PNB

0 7

Butterfly Traffic - Power Dissipation

0.6

0.7

r p

ack

et

0 4

0.5

ate

d p

er

)

0.3

0.4

r D

issip

a(m

Wa

tt)

NPNB

PNB

0.1

0.2

ge

Po

we

r (

PB

0

0.1

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9

Av

era

g

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


Perfect Shuffle - Throughput

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1


Th

rou

gh

pu

t (G

Bp

s)

PB

NPB

NPNB

PNB

Perfect Shuffle - Latency

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1


La

ten

cy

(m

icro

se

c)

PB

NPB

NPNB

PNB

0 7

Perfect Shuffle - Power Dissipation

0.6

0.7

r p

ack

et

0 4

0.5

ate

d p

er

)

0.3

0.4

r D

issip

a(m

Wa

tt)

NPNB

PNB

0.1

0.2

ge

Po

we

r (

PB

0

0.1

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9

Av

era

g

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


Fig. 7. Throughput, Latency and Power Dissipation for a 64 node E-RAPID configuration implementing NP-NB, NP-B, P-NB andP-B for Uniform,Complement, Butterfly and Perfect shuffle traffic patterns.


of 0.9, at N = 4, the improvement in throughput as comparedto N = 2, is 27% where as at N = 8 as compared to N =4, the improvement in throughput is almost 47%. Therefore,allocating more links is advantageous for complement trafficas all nodes use the same link for communication. From Figure10(b), for butterfly traffic, the improvement in throughput athigh loads (0.9) is lower. At N = 4, the improvement over N= 2, is 5% and the improvement at N = 8 as compared to N= 4, is 16%. For butterfly traffic pattern, allocating the entirebandwidth does not improve throughput significantly. Theseresults show that based on the traffic patterns, link re-allocationcan be optimized such that the performance is improved atmuch lower power.

V. CONCLUSION

In this paper, we combined dynamic bandwidth re-allocation(DBR) techniques with dynamic power management (DPM)techniques and proposed a combined technique called Lock-Step (LS) for improving the performance of the opto-electronicinterconnect, while consuming substantial less power. Weimplemented LS on our proposed opto-electronic E-RAPID ar-chitecture and compared the performance of non-power/poweraware and non-bandwidth/bandwidth reconfigured networks.Our proposed LS technique implemented the power-bandwidth(P-B) reconfiguration technique and achieved similar through-put and latency performance as a fully bandwidth reconfigurednetwork while consuming almost 50% to 25% lesser power.More power levels and corresponding bit rates can furtherimprove the performance as power scaling can follow thetraffic pattern more accurately. The dynamic bandwidth re-allocation techniques proposed in this paper provides completeflexibility to re-allocate all system bandwidth for a givenboard. Cost-effective design alternatives that provide limitedflexibility for reconfigurability may reduce performance, butlower the cost of the network. In the future, we will evaluatemultiple power scaling techniques along with limited band-width reconfigurability for improving the system performance,reducing the power consumption and reducing the overall costof the architecture. In addition, we will also evaluate theperformance of DBR and DPM on HPC benchmarks.

REFERENCES

[1] J. Kash and et.al, “Bringing optics inside the box: Recent progress andfuture trends,” in16th Annual Meeting of the IEEE/LEOS, October 2003,p. 23.

[2] E. Mohammed and et.al., “Optical interconnect system integration forultra-short-reach applications,”Intel Technology Journal, vol. 8, pp. 114–127, 2004.

[3] D. A.B.Miller, “Rationale and challenges for optical interconnects toelectronic chips,”Proceedings of the IEEE, vol. 88, pp. 728–749, June2000.

[4] J. Collet, D. Litaize, J. V. Campenhut, C. Jesshope, M. Desmulliez,H. Thienpont, J. Goodman, and A. Louri, “Architectural approaches tothe role of optics in mono and multiprocessor machines,”Applied Optics,Special issue on Optics in Computing, vol. 39, pp. 671–682, 2000.

[5] P. Dowd and et.al., “Lighnting network and systems architecture,”Journal of Lightwave Technology, vol. 14, pp. 1371–1387, 1996.

[6] J.-H. Ha and T.M.Pinkston, “The speed cache coherence for an opticalmulti-access interconnect architecture,” inProceedings of the 2nd In-ternational Conference on Massively Parallel Processing Using OpticalInterconnections, 1995, pp. 98–107.

[7] R. G. Beausoleil, P. J. Kuekes, G. S. Snider, S.-Y. Wang, and R. S.Williams, “Nanoelectronic and nanophotonic interconnect,” Proceedingsof the IEEE, vol. 96, no. 2, pp. 230–247, February 2008.

[8] D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N.Jouppi,M. Fiorentino, A. Davis, N. Binker, R. Beausoleil, and J. H. Ahn,“Corona: System implications of emerging nanophotonic technology,”in Proceedings of the 35th International Symposium on ComputerArchitecture, June 2008, pp. 153–164.

[9] C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth,M. Popovic, H. Li, H. Smith, J. Hoyt, F. Kartner, R. Ram, V. Stojanovi,and K. Asanovic, “Building manycore processor-to-dram networks withmonolithic silicon photonics,” inProceedings of the 16th Annual Sym-posium on High-Performance Interconnects, August 27-28 2008.

[10] A. K. Kodi and A. Louri, “Rapid: Reconfigurable and scalable all-photonic interconnect for distributed shared memory multiprocessors,”Journal of Lightwave Technology, vol. 22, pp. 2101–2110, September2004.

[11] N. Kirman, M. Kirman, R. Dokania, J. Martnez, A. Apsel, M. Watkins,and D. Albonesi, “Leveraging optical technology in future bus-basedchip multiprocessors,” inProceedings of the 39th International Sympo-sium on Microarchitecture, December 2006.

[12] A. Shacham, B. Small, O. Liboiron-Ladouceur, and K. Bergman, “Afully implemented 12x12 data vortex optical packet switching intercon-nection network,”Journal of Lightwave Technology, vol. 23, pp. 3066–3075, Oct 2005.

[13] C. M. Qiao and et.al., “Dynamic reconfiguration of optically intercon-nected networks with time-division multiplexing,”Journal of Paralleland Distributed Computing, vol. 22, no. 2, pp. 268–278, 1994.

[14] P. Krishnamurthy, R. Chamberlain, and M. Franklin, “Dynamic re-configuration of an optical interconnect,” in36th Annual SimulationSymposium, 2003.

[15] N. Kirman and J. Martinez, “A power-efficient all-optical on-chip inter-connect using wavelength-based oblivious routing,” in15th InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS), March 2010.

[16] L. Shang, L.-S. Peh, and N. K. Jha, “Dynamic voltage scaling with linksfor power optimization of interconnection networks,” inProceedingsof the 9th International Symposium on High Performance ComputerArchitecture, November 2003.

[17] X. Chen, L.-S. Peh, G.-Y. Wei, Y.-K. Huang, and P. Pruncal, “Exploringthe design space of power-aware opto-electronic networkedsystems,” in11th International Symposium on High-Performance Computer Archi-tecture (HPCA-11), February 2005, pp. 120–131.

[18] Q. Wu, P. Juang, M. Martonosi, L.-S. Peh, and D. W. Clark,“Formalcontrol techniques for power-performance management,”IEEE Micro,vol. 25, no. 5, September/October 2005.

[19] G.-Y. W. Xuning Chen and L.-S. Peh, “Design of low-powershort-distance opto-electronic transceiver front-ends with scalable supply volt-ages and frequencies,” inProceedings of the International Symposiumon Low Power and Energy Design (ISLPED), August 2008.

[20] E.J.Kim, K.H.Yum, G.M.Link, N.Vijaykrishnan, M.Kandemir,M.J.Irwin, M.Yousif, and C.R.Das, “Energy optimization techniquesin cluster interconnects,” inProceedings of the 2003 InternationalSymposium on Low Power Electronics and Design (ISLPED 03),August 2003.

[21] W. J. Dally and B. Towles,Principles and Practices of InterconnectionNetworks. San Fransisco, USA: Morgan Kaufmann, 2004.

[22] O. Kibar, A. V. Blerkom, C. Fan, and S. C. Esener, “Power minimiza-tion and technology comparisons for digital free-space optoelectronicinterconnections,”IEEE Journal of Lightwave Technology, vol. 17, pp.546–555, April 1999.

[23] A.V.Krishnamoorthy, K.W.Goossen, L.M.F.Chirovsky,R.G.Rozier,P.Chandramani, S.P.Hui, J.Lopata, J.A.Walker, and L.A.D’Asaro, “16 x16 vcsel array flip-chip bonded to cmos vlsi circuit,”IEEE PhotonicsTechnology Letters, vol. 12, no. 8, pp. 1073–1075, August 2000.

[24] A. Lindstrom, “Parallel links transform networking equipment,” Fiber-Systems International, pp. 29–32, February 2002.

[25] A. Apsel and A. G. Andreou, “Analysis of short distance optoelectroniclink architectures,” inProceedings of the 2003 International Symposiumon Circuits and Systems, May 2003.

[26] J. R. Jump, “Yacsim reference manual,”Rice University Available athttp://www-ece.rice.edu/ rppt.html, March 1993.

[27] A. K. Kodi and A. Louri, “Rapid for high-performance computing:Architecture and performance evaluation,”OSA Applied Optics, vol. 45,pp. 6326–6334, September 2006.


0

10

20

30

40

50

60

N = 2 N = 4 N = 8

Degree of Reconfiguration

Th

rou

gh

pu

t (G

Bp

s)

0.1

0.5

0.9

(a)

0

10

20

30

40

50

60

70

N = 2 N = 4 N = 8

Degree of Reconfiguration

Th

rou

gh

pu

t (G

Bp

s)

0.1

0.5

0.9

(b)

Fig. 8. Degree of reconfiguration at network loads of 0.1, 0.5and 0.9 for (a) Complement traffic and (b) Butterfly traffic patterns.

1 2

1.4

ncy

1

1.2

et

La

ten

0.8

ge

Pa

ck

e

PNPB

NPB

0.6

d A

ve

rag NPB

PNB

PB

0 2

0.4

rma

lize

d

0

0.2

No

r

FFT LU MP3D RADIX WATER

(a)

1.2

1

ipa

tio

n0 6

0.8

we

r D

iss

PNPB

0 4

0.6iz

ed

Po

w NPB

PNB

PB

0.2

0.4

No

rma

li

0

0.2N

FFT LU MP3D RADIX WATER

(b)

Fig. 9. Throughput for a 64 node E-RAPID configuration implementing NP-NB, NP-B, P-NB and P-B for Uniform, Complement, Butterfly and Perfectshuffle traffic patterns.

Avinash Karanth Kodi received the Ph.D. and M.S.degrees in Electrical and Computer Engineeringfrom the University of Arizona, Tucson in 2006and 2003 respectively. He is currently an AssistantProfessor of Electrical Engineering and ComputerScience at Ohio University, Athens. His researchinterests include computer architecture, optical inter-connects, chip multiprocessors (CMPs) and network-on-chips (NoCs). Dr. Kodi is a member of IEEE.

Ahmed Louri received the PhD degree in ComputerEngineering in 1988 from the University of SouthernCalifornia (USC), Los Angeles. He is currently afull professor of Electrical and Computer Engineer-ing at the University of Arizona, Tucson and thedirector of the High Performance Computing Ar-chitectures and Technologies (HPCAT) Laboratory(www.ece.arizona.edu/∼ocppl). His research inter-ests include computer architecture, network-on-chips(NoCs), parallel processing, power-aware parallelarchitectures and optical interconnection networks.

Dr. Louri has served as the general chair of the 2007 IEEE InternationalSymposium on High-Performance Computer Architecture (HPCA), Phoenix,Arizona. He has also served as a member of the technical program committeeof several conferences including the ACM/IEEE Symposium onArchitecturesfor Networking and Communications Systems (ANCS), the OSA/IEEE Con-ference on Massively Parallel Processors using Optical Interconnects, amongothers. Dr. Louri is a senior member of the IEEE and a member ofthe OSA.

journal of selected topics in quantum electronics 1 … · photonic networks for high-performance...

Documents