asynchronous-logic qdi quad-rail sense-ampliﬁer half ... · rail sahb abides by qdi rules, ......

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Asynchronous-Logic QDI Quad-Rail Sense-Amplifier Half-BufferApproach for NoC Router Design

Weng-Geng Ho, Kwen-Siong Chong, Kyaw Zwa Lwin Ne, Bah-Hwee Gwee, and Joseph S. Chang

Abstract— We propose a low area overhead and power-efficientasynchronous-logic quasi-delay-insensitive (QDI) sense-amplifier half-buffer (SAHB) approach with quad-rail (i.e., 1-of-4) data encoding. Theproposed quad-rail SAHB approach is targeted for area- and energy-efficient asynchronous network-on-chip (ANoC) router designs. There arethree main features in the proposed quad-rail SAHB approach. First,the quad-rail SAHB is designed to use four wires for selecting fourANoC router directions, hence reducing the number of transistors andarea overhead. Second, the quad-rail SAHB switches only one out offour wires for 2-bit data propagation, hence reducing the number oftransistor switchings and dynamic power dissipation. Third, the quad-rail SAHB abides by QDI rules, hence the designed ANoC router featureshigh operational robustness toward process-voltage-temperature (PVT)variations. Based on the 65-nm CMOS process, we use the proposed quad-rail SAHB to implement and prototype an 18-bit ANoC router design.When benchmarked against the dual-rail counterpart, the proposed quad-rail SAHB ANoC router features 32% smaller area and dissipates 50%lower energy under the same excellent operational robustness towardPVT variations. When compared to the other reported ANoC routers,our proposed quad-rail SAHB ANoC router is one of the high operationalrobustness, smallest area, and most energy-efficient designs.

Index Terms— Asynchronous network-on-chip (ANoC) router,energy efficient, process–voltage–temperature (PVT) variations,quad-rail data encoding, quasi-delay insensitive (QDI).

I. INTRODUCTION

Multicore processing is a promising option to increase thecomputation speed in many applications nowadays. In the multicoreprocessing, the use of network-on-chip (NoC) router to link allthe processing cores for efficient data communication is widelyadopted. To date, there are many NoC routers reported,including those based on fully synchronous logic (sync) and thosein part based on asynchronous logic (async) [1]–[10]. The designof sync NoC routers, however, becomes increasingly more chal-lenging due to the timing issues for operation correctness. Thetiming issues are further compounded in view of wider process–voltage–temperature (PVT) variations in deep-submicrometer fabri-cation processes.

Conversely, the design of async NoC (ANoC) routers [1]–[10] hasincreasingly become attractive in accommodating timing and datasynchronization issues because their operations are essentially self-timed. In the ANoC router designs, the quasi-delay-insensitive (QDI)timing model with 1-of-n rails data encoding is well known toaccommodate the PVT variations [11].

However, the ANoC router design using the async QDI handshakeprotocol has higher circuit complexity as compared to the sync NoC

Manuscript received April 25, 2017; revised July 21, 2017; acceptedAugust 29, 2017. This work was supported by the Agency for Sci-ence, Technology and Research, Singapore, through the Public SectorResearch Funding under Grant SERC1321202098. (Corresponding author:Weng-Geng Ho.)

W.-G. Ho and K.-S. Chong are with the Temasek Laboratories, NanyangTechnological University, Singapore 637553 (e-mail: [email protected];[email protected]).

K. Z. L. Ne, B.-H. Gwee, and J. S. Chang are with the Virtus, ICDesign Centre of Excellence, School of Electrical and Electronic Engi-neering, Nanyang Technological University, Singapore 639798 (e-mail:[email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2017.2750171

router design due to the following reasons. First, the async QDIhandshake protocol requires input completion and output completioncircuits to validate the input and output completeness for everypipeline stage [12]. Second, the use of well-established dual-raildata encoding increases the computation block complexity and thenumber of wires along the pipeline stage [11]. In short, the ANoCrouter suffers from high area overhead due to the increased circuitcomplexity and high dynamic power dissipation due to the increasednumber of switching in the data propagation.

In this brief, we propose a quad-rail sense-amplifier half-buffer (SAHB) approach with low area overhead, power-efficient, andhigh operational robustness toward PVT variations. Based on 65-nmCMOS process, we implement and prototype an ANoC router usingthe proposed quad-rail SAHB approach. When benchmarked againstthe dual-rail counterpart, the proposed quad-rail SAHB ANoC routerfeatures 32% smaller area and dissipates 50% lower energy under thesame excellent operational robustness toward PVT variations. Whencompared to the other reported ANoC routers, our proposed quad-rail SAHB ANoC router is one of the smallest area, high operationalrobustness, and most energy-efficient designs.

This brief is organized as below. Section II elaborates our proposedquad-rail SAHB cell and the circuit and pipeline analysis. Section IIIelaborates our proposed ANoC router’s architecture. Section IVpresents the chip implementation and measurement results. Finally,Section V draws the conclusion.

II. PROPOSED QUAD-RAIL SAHBWe leverage on our SAHB realization approach [12], and design

the async cells in quad-rail encoding. For ANoC router design,the basic quad-rail cells are buffers (BUF), multiplexers (MUX),and deMUX (DEMUX). The SAHB quad-rail cells are designed attransistor level. For the sake of simplicity, we only elaborate theSAHB quad-rail 4-to-1 MUX cell; the other cells can be built basedon the same concept.

Fig. 1(a) depicts the block diagram of the SAHB quad-rail4-to-1 MUX cell (with generic interface signals), which comprisesfour quad-rail input data (L0, L1, L2, and L3), a quad-rail inputselect data (S), a quad-rail output data (R), an acknowledgment signalfrom the succeeding pipeline (Rack), and acknowledgment signals tothe preceding pipeline (Lack0, Lack1, Lack2, Lack3, and LackS).The complimentary signals are bundled for simplifying the cellstructure.

Fig. 1(b) depicts the pipeline diagram of the SAHB quad-rail4-to-1 MUX, which comprises four functional blocks (FBs) plustwo control logic cum sense amplifier (CLSA), input completiondetection (ICD), output completion detection (OCD), and C-Mullertree (CT). The (four FBs + two CLSAs) block propagates L0, L1,L2, L3, and S to R based on Rack. The ICD validates the inputavailability/nullity by generating input validity signal Val for eachinput data, the OCD validates the output availability/nullity, andfinally CT generates Lack for each input data.

Fig. 1(c) depicts the cell diagram of the SAHB quad-rail4-to-1 MUX, which comprises four FBs, two CLSAs, five ICDs,an OCD, and a CT. The FBs, as depicted in Fig. 1(d), evalu-ate/precharge L0, L1, L2, L3, and S to R, depending on Rack.

1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 1. Proposed quad-rail SAHB 4-to-1 MUX cell. (a) Block diagram. (b) Pipeline diagram. (c) Cell diagram. (d) 0th-rail FB. (e) 0th-rail and 1st-railCLSAs.

The CLSAs, as depicted in Fig. 1(e), control the sequence of asyncoperation, and help amplifying/latching R. The ICDs generate ValL .0,ValL .1, ValL .2, ValL .3, and ValS to validate the input availabil-ity/nullity. The OCD and CT generate Lack0, Lack1, Lack2, Lack3,and LackS to acknowledge the preceding stage.

In the quad-rail SAHB 4-to-1 MUX cell, there are two supplyvoltages, namely, VDDA for FBs and VDD for remainingblocks (CLSA, ICD, OCD, and CT). For proper operation,VDD ≥ VDDA. VDDA is fixed as subthreshold voltage (∼0.3 V)to fully minimize the switching power without comprising theevaluation/precharge speed since the output is amplified throughCLSA. On the other hand, VDD is adjustable from nominalvoltage (1.2 V) for speed-critical applications to near-thresholdvoltage (∼0.6 V) and even subthreshold voltage (∼0.3 V) forlow-power low-speed applications.

To analyze and evaluate our proposed quad-rail SAHB cells, wecompare their transistor count and number of transitions per cycleagainst the reported dual-rail SAHB cells [12]. Since the quad-railcells propagate 2-bit data in the pipeline, we normalize the readingsto single bit for fair comparison with dual-rail counterparts (propagate1-bit data). The comparison is made for four basic NoC func-tions: 1-to-1 buffering, 1-to-2 demultiplexing, 2-to-1 multiplexing,and 4-to-1 multiplexing. For sake of simplicity, we only analyze4-to-1 multiplexing function; the other functions can be analyzedsimilarly.

First, from the circuit-level perspective, the transistor count per bit,TC(4-to-1 MUX) with the transistor count of the cell’s componentssuch as FB, CLSA, ICD, OCD, CT, and Inverter (Inv) can becalculated, as expressed in the following equation:

TC(4 − to − 1MUX)

= 1

2[4 · TC(FB) + 2 · TC(CLSA)

+ 5 · TC(ICD) + TC(OCD) + TC(CT) + 4 · TC(Inv)] (1)

Fig. 2. Three-stage marked graph behavior of SAHB quad-rail 4-to-1MUX cell.

where TC(FB) = 22, TC(CLSA) = 14, TC(ICD) = 10, TC(OCD)= 8, TC(CT) = 48, and TC(Inv) = 2

Second, from the pipeline-level perspective, we depict the three-stage marked graph behavior for our proposed quad-rail SAHB cellapproach in Fig. 2. This graph illustrates evaluate operation (Fe, Inve,ICDe, OCDe, and CTe) and precharge operation (Fp, Invp, ICDp,OCDp, and CTp) to propagate the data within i th, (i + 1)th, and(i + 2)th pipeline stages in cycle basis.

From the three-stage marked graph behavior, the number of tran-sitions per bit for one cycle, T (4-to-1 MUX) with delay, t of thecell’s components can be calculated, as expressed in the followingequation:

T (4 − to − 1MUX)

= 1

2

[t(Fe

i) + t

(Inve

i) + t

(Fe

i+1) + t

(Inve

i+1) + t

(Fe

i+2)

+ t(Inve

i+2) + t

(OCDe

i+2) + t

(CTe

i+2) + t

(Fp

i+1

)

+ t(Invp

i+1) + t

(OCDp

i+1) + t

(CTp

i+1)

(2)



TABLE I

COMPARISON OF 2-BIT NOC FUNCTIONS BETWEEN QUAD-RAIL AND DUAL-RAIL SAHB CELL APPROACHES

Fig. 3. Realization of 2-bit SAHB 4-to-1 multiplexing function.(a) Quad rail. (b) Dual rail.

where t(Fe) = t(Fp) = 2 trans., t(Inve) = t(Invp) = 1 tran.,t(OCDe) = t(OCDp) = 1 tran., and t(CTe) = t(CTp) = 2 trans.

Fig. 3(a) and (b) depicts the quad-rail and dual-rail SAHB realiza-tions of 2-bit 4-to-1 multiplexing function, respectively. In Fig. 3(a),the quad-rail SAHB requires only one quad-rail 4-to-1 MUX cell,resulting in TC = 111 and T = 9. In Fig. 3(b), the dual-rail SAHBrequires six dual-rail 2-to-1 MUX cells, resulting in TC = 240 andT = 54. In short, the quad-rail SAHB reduces 2.16× lower TC and6× lesser T than the dual-rail SAHB. This is because the quad-rail SAHB 4-to-1 MUX cell can select one out of four input databy matching each input data to each wire of the quad-rail (i.e., fourwires) select data; four input data are matched to four wires. Whereasthe dual-rail SAHB 2-to-1 MUX cell can select only one out oftwo input data, hence more number of dual-rail cells are requiredto perform the same function.

For completeness, Table I tabulates the logic gate combinations,TC and T for quad-rail and dual-rail SAHB cell approaches infour different 2-bit ANoC functions. In realizing 1-to-1 buffering,1-to-2 demultiplexing, and 2-to-1 multiplexing functions for 2-bitdata, the dual-rail SAHB embodies two dual-rail cells in contrastto the quad-rail SAHB which embodies only one quad-rail cell.Although the quad-rail cells require ∼ 2× more transistor count thanthe dual-rail cells, the quad-rail SAHB function is advantageous interms of TC, i.e., ∼7% lower TC than the dual-rail SAHB due to thesharing of completion detection circuits for 2-bit data propagation inthe quad-rail SAHB. In terms of T , the quad-rail SAHB cells onlytrigger one out of four wires, in contrast to the dual-rail SAHB cells,which trigger two out of four wires in 2-bit ANoC functions. Thedual-rail SAHB suffers from 2× more T than the quad-rail SAHB.

In summary, the quad-rail SAHB cells, on average, reduce1.33× TC and 3× T, respectively, than the dual-rail SAHB cells,resulting in better performance.

III. PROPOSED ANOC ROUTER ARCHITECTURE

The targeted ANoC router design, as depicted in Fig. 4, comprisesfive interfaces, each having an input port (IP) and an output port (OP).

Fig. 4. ANoC Router Interfaces.

Four interfaces are used to transfer data to/from four directions[i.e., north (N), east (E), south (S), and west (W)] between twoneighboring ANoC routers/IOs, and the last interface to transfer databetween the ANoC router, and its respective processing core.

Fig. 5(a) and (b) depicts the block diagram of the ANoC router’s IPand OP, respectively. The bold dark arrows represent the data packet;the straight arrows represent the control signals; and the dotted arrowsrepresent the acknowledge signals. In the IP and OP, every data packetand control signal are accompanied by an acknowledge signal for onecomplete async handshake four-phase QDI operation. More details ofthe IP and OP could be found from [13]. In the implemented ANoCrouter, the buffer, virtual channel (VC) DEMUX, direction switch,and VC switch blocks, respectively, embody SAHB quad-rail BUFcells, 1-to-2 DEMUX cells, 4-to-1 MUX cells, and 2-to-1 MUX cells.

IV. MEASUREMENT RESULTS

The proposed quad-rail SAHB ANoC router is implemented bythe full-custom approach based on the 65-nm standard-threshold-voltage CMOS process (Vt ∼ 0.5 V) and fabricated through STMi-croelectronics’s circuits multiprojects solutions. Fig. 6 depicts themicrophotograph of the proposed quad-rail SAHB ANoC router andthe test structure.

Fig. 7(a) and (b) depicts the area breakdown of the implementedquad-rail and dual-rail SAHB ANoC routers, respectively. The quad-rail SAHB ANoC router occupies 0.105 mm2, 1.47× smaller areathan the dual-rail counterpart. This is due to the simpler implemen-tation in quad-rail SAHB library cells than the dual-rail implementa-tion (which requires on average 1.33× more transistors, in Table I).As seen, the area of direction switch (4-to-1 multiplexing function)reduces significantly from 53% in dual-rail design to 38% in quad-raildesign; this mainly reduces the overall area.

To demonstrate the operational robustness toward PVT varia-tions, we compare the throughput and energy dissipation of theproposed quad-rail SAHB ANoC router against three related QDIcounterparts, i.e., dual-rail SAHB [12], quad-rail weak-conditionedHB (WCHB) [6], and quad-rail precharged HB (PCHB) [7].These comparisons are evaluated under various variations of sup-ply voltages (0.3, 0.6, 0.9, and 1.2 V), operating temperatures


4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. ANoC router interface. (a) IP. (b) OP.

TABLE II

COMPARISON OF VARIOUS ANOC ROUTERS

Fig. 6. Microphotograph of the proposed SAHB ANoC router.

(−40 °C, 27 °C, and 100 °C), and process threshold voltages (lowthreshold voltage (LVT), standard threshold voltage (SVT), and highthreshold voltage (HVT)), as depicted in Fig. 8(a)–(c), respectively.

Now, we elaborate the overall comparison in these variations.When compared to the dual-rail SAHB, the proposed design fea-tures ∼50% lower energy dissipation due to the adoption of quad-rail data encoding which reduces 1.33× transistor counts and3× transitions/cycle. The tradeoff is slightly low throughput partlydue to a more complex completion detection. When compared

Fig. 7. IC area for SAHB ANoC routers. (a) Quad rail. (b) Dual rail.

to the quad-rail WCHB, the proposed design features ∼15%higher throughput and ∼13% lower energy dissipation. When com-pared to the quad-rail PCHB, the proposed design features ∼27%lower energy dissipation. These improved attributes are achievedbecause the proposed SAHB approach embodies FBs that operate insubthreshold region (VDDA = 0.3 V) to dissipate significantly lowpower (<1 µW), and CLSAs that employ sense amplifiers to enhancethe speed, resulting in high energy efficiency.

Fig. 9 depicts the latency and throughput of the proposed SAHBANoC router by varying VDD from 1.2 to 0.3 V; the results arenormalized with respect to the readings at 1.2 V. As seen, VDD



Fig. 8. Energy versus throughput of ANoC routers for variations of (a) supply voltage, (b) operating temperature, and (c) process threshold voltage.

Fig. 9. Normalized latency and throughput of the proposed SAHB ANoCrouter for various VDD.

is adjustable from nominal voltage, 1.2 V (with 5.7-ns latency,258-MHz throughput) for speed-critical applications, to near thresh-old voltage, 0.6 V (with 2.7× latency, 0.35× throughput), andsubthreshold voltage, 0.3 V (with 38× latency, 0.03× throughput)for low-power low-speed applications.

For completeness, Table II tabulates the comparison of variousANoC routers, where the results, i.e., area, latency, throughput, andenergy dissipation per bit of the reported ANoC routers [1]–[10]are normalized with respect to the proposed SAHB ANoC routerat 65-nm CMOS, 1.2 V based on technology scaling. In termsof robustness toward PVT variations, the ANoC routers based onthe QDI handshake protocol feature high operational robustness ascompared to those based on bundled data and timed handshakeprotocols. This is due to the absence of timing issues in QDIhandshake protocol.

Furthermore, their process technology, design architecture, andimplementation are different, hence a direct comparison of variousANoC routers is somewhat contentious. For example, the reportedARGO [10] and ONIZAWA [9] appear to dissipate lower normalizedenergy per bit than our SAHB design, but they are not robusttoward PVT variations. In addition, the reported ARGO needs tobe associated by an external controller for complete operations,hence dissipating additional energy overhead. When compared withinall QDI ANoC routers, our SAHB designs are, respectively, 35%,56%, and 85% more energy efficient than the reported FAUST [4],ALPIN [6], and ALHUSSIEN [7].

V. CONCLUSION

We have proposed an async QDI quad-rail SAHB approach withemphases on low area overhead and power efficient, for ANoC routerdesign. Based on the 65-nm CMOS process, we have implemented

and prototyped an 18-bit quad-rail SAHB ANoC router. Whenbenchmarked against the dual-rail counterpart, the proposed quad-rail SAHB ANoC router features 32% smaller area and dissipates50% lower energy under the same excellent operational robustnesstoward PVT variations. When compared to the other reported ANoCrouters, our proposed quad-rail SAHB ANoC router is one of thehigh operational robustness, smallest area, and most energy-efficientdesigns.

REFERENCES

[1] D. Rostislav, V. Vishnyakov, E. Friedman, and R. Ginosar, “An asyn-chronous router for multiple service levels networks on chip,” in Proc.IEEE ASYNC, Mar. 2005, pp. 44–53.

[2] T. Bjerregaard and J. Sparsø, “Implementation of guaranteed servicesin the MANGO clockless network-on-chip,” IEE Proc.-Comput. Digit.Techn., vol. 153, no. 4, pp. 217–229, Jul. 2006.

[3] J. Pontes, R. Soares, E. Carvalho, F. Moraes, and N. Calazans, “SCAFFI:An intrachip FPGA asynchronous interface based on hard macros,” inProc. IEEE ICCD, Oct. 2007, pp. 541–546.

[4] D. Lattard et al., “A reconfigurable baseband platform based on anasynchronous network-on-chip,” IEEE J. Solid-State Circuits, vol. 43,no. 1, pp. 223–235, Jan. 2008.

[5] A. Sheibanyrad, A. Greiner, and I. Miro-Panades, “Multisynchronousand fully asynchronous NoCs for GALS architectures,” IEEE Des. TestComput., vol. 25, no. 6, pp. 572–580, Nov. 2008.

[6] Y. Thonnart, P. Vivet, and F. Clermidy, “A fully-asynchronous low-powerframework for GALS NoC integration,” in Proc. DATE, Mar. 2010,pp. 33–38.

[7] A. Alhussien, C. Wang, and N. Bagherzadeh, “A scalable delay insen-sitive asynchronous NoC with adaptive routing,” in Proc. ICTEL,Apr. 2010, pp. 995–1002.

[8] M. N. Horak, S. M. Nowick, M. Carlberg, and U. Vishkin, “A low-overhead asynchronous interconnection network for GALS chip mul-tiprocessors,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,vol. 30, no. 4, pp. 494–507, Apr. 2011.

[9] N. Onizawa, A. Matsumoto, T. Funazaki, and T. Hanyu, “High-throughput compact delay-insensitive asynchronous NoC router,” IEEETrans. Comput., vol. 63, no. 3, pp. 637–649, Mar. 2014.

[10] E. Kasapaki, M. Schoeberl, R. B. Sørensen, C. Müller,K. Goossens, and J. Sparsø, “Argo: A real-time network-on-chiparchitecture with an efficient GALS implementation,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 24, no. 2, pp. 479–492, Feb. 2016.

[11] A. J. Martin and M. Nystrom, “Asynchronous techniques for system-on-chip design,” Proc. IEEE, vol. 94, no. 6, pp. 1089–1120,Jun. 2006.

[12] K.-S. Chong, W.-G. Ho, T. Lin, B.-H. Gwee, and J. S. Chang, “Senseamplifier half-buffer (SAHB): A low-power high-performance asyn-chronous logic QDI cell template,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 25, no. 2, pp. 402–415, Feb. 2017.

[13] W. G. Ho, K. S. Chong, N. K. Z. Lwin, B. H. Gwee, andJ. S. Chang, “High robustness energy- and area-efficient dynamic-voltage-scaling 4-phase 4-rail asynchronous-logic network-on-chip (ANoC),” in Proc. IEEE ISCAS, May 2015, pp. 1913–1916.

asynchronous-logic qdi quad-rail sense-ampliﬁer half ... · rail sahb abides by qdi rules, ......

Documents